Fixing confused bayesian matching data?
Lincoln A Baxter
lab at lincolnbaxter.com
Fri Jul 29 20:48:32 EDT 2016
On Fri, 2016-07-29 at 02:55 -0700, Jim DeLaHunt wrote:
> Philip:
>
> Sorry for the delay in responding to your message. I waited until I
> had
> something useful posted where you could see it.
>
> I was in exactly your situation back in February.
>
> On Sun, 17 Jul 2016 21:24:56 -0400, Philip Matthews
> <philip_matthews at magma.ca> wrote:
> > Just wondering if anyone has any advice on what to do with some
> very confused bayesian matching data?
> >
> > Right now, when I import new transactions (either CSV or QFX), they
> mostly don't find a match anymore. Only around 20 - 30% match. This
> is probably because I like to rejig my accounts from time to time as
> I continue to figure out what works best for me. Looking through the
> ".gnucash" file, I see lots of slot entries with account names that
> don't exist any more....
> >
> > ...Write a Python program that goes through the .gnucash file and
> deletes slot entries that point at accounts that don't exist any
> more.
> >
> > Comments? Other thoughts?
> >
> > Running GnuCash 2.6.11 on a Mac.
>
> My solution? XSLT processing <https://en.wikipedia.org/wiki/XSLT>.
> GnuCash files can be saved as XML format data, and XSLT is a tool
> for
> modifying XML data in a controlled, reliable way. I wrote a set of
> XSLT
> filters which:
>
> 1. lists the Bayes mapping data for each account in a gnucash XML
> file;
> 2. resets the import mapping, by deleting all the Bayes mapping data
> for every account in a gnucash XML file; and
> 3. prunes the import mapping data for certain target accounts in a
> gnucash XML file.
>
> A rather brief explanation of the situation and the GnuCash file
> format,
> with listings of all the XSLT filters, is in my freshly-written blog
> post, /Resetting GnuCash’s import transaction matching/
> <http://blog.jdlh.com/en/2016/07/29/resetting-gnucashs-import-transac
> tion-matching/>.
> Take a look. I hope it's helpful.
This prompts me to reply with a newer version of the perl script I
posted sometime back to do this. (and a whole lot more). I have also
attached the text of the full man page imbedded in this script.
Another GC user (Cheryl Wheeler) provided a patch that is incorporated
into this version. I have been discussing with Chris Good a more
permanent location for this script, which can then be referenced in the
GnuCash wiki. But I have not had time to do much work on that however,
so I repost the script here for now.
Lincoln
-------------- next part --------------
NAME
gc_prune_bayes_data.pl
COPYRIGHT
Copyright (C) 2015-2016 Lincoln A Baxter
This program is free software: you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation, either version 3 of the License, or (at your
option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
Public License for more details.
The GNU General Public License can be found at
<http://www.gnu.org/licenses/>
The following people have contributed to this script:
Cheryl Wheeler: Patch provided 7/26/2016 -- fixed argument
assignment in SetKeyNodeText()
ABSTRACT
Remove Bayes data associated with one or all accounts in an uncompressed
GnuCash XML data file.
This enables the user to analyze, modify, trim (purge), or delete
bayesian data from his GnuCash data file.
The script reads an uncompressed "version 2" GnuCash XML datafile and
removes all Bayes data that references non-existent accounts.
The script reads the GnuCash xml file *as XML* not as text. The script
uses XML::LibXML CPAN perl module to read, traverse modify and output a
new XML file. It is not dependant on the formating of the gnucash XML
unlike most perl scripts this author has seen on the GnuCash users email
list.
Because this script does not treat the GnuCash file as text, it is not
subject to the the breakage would occur if the formatting of the XML
data were to change.
Instead, this script reads the GnuCash data a DOM structure, and then
manipulates that structure.
The script does not modify the input GnuCash XML datafile. The user must
specify an outout data filename. The results of the script's operations
are written to this file, which should then be opened and checked in
GnuCash. Before replacing the original file.
To print a usage synopsis: gc_prune_bayes_data.pl --help
To print the synopsis, plus option descriptions: gc_prune_bayes_data.pl
--help --verbose
To print the entire man page: gc_prune_bayes_data.pl --man
SYNOPSIS
gc_prune_bayes_data.pl [options] CG-file.xml [modified-GC-file.xml]
options:
--help print this synopsis help text
--man print the full man page (this file's POD)
--verbose increased verbosity (mainly for debugging)
--only=only_acts.txt (input) only operate on these accounts
--rename=slot_acts.txt (input) for renaming slot keys
--listKVP=kvps.txt (output) print a Bayes KVP report by account
--listAccounts=accounts.txt (output) print full GC path of all accounts
--listBayesAccounts=filename (output) accounts with bayes data
--removed=removed.txt (output) writes the XML of removed slots
--orphaned=orphaned.txt (output) writes orphaned slot keys
--pathSeparator=char GnuCash account path separator (default=:)
--removeRegex=regex prune slots matching regular expresions (repeatable)
--substExpresssions=expression sed like substition patterns applied
to slot key names (repeatable)
The second file argument is is optional. With no destination output
file, gc_prune_bayes_data.pl runs in analytical/trial mode, and reports
all actions taken and produces all specified analysis outputs.
The source input file is never modified.
If your gnucash data file is compressed you must uncompress it first (on
unix based OSes) as follows:
cat gnucach_data_file gunzip > uncompressed_gnucash_file.xml
Or you can just uncheck the "compressed file" option in GnuCash and
save.
DISCUSSION
Initially (when you start out with GnuCash, or with a new account), you
have to "train" the matcher by manually identifying the destination
account for every transaction. When you do, (assuming you have enabled
Bayesian matching on import), the matching will remember you choices by
storing transaction description data and the account you selected in
key/value pairs called "slots" under the account you we importing into.
The values of an existing slots will be incremented making the matching
stronger for the balancing account selected.
When the matcher builds up a strong enough match for a particular
account, the importer will propose that account as the destination
account, and it will show you the match "strenth."
This works very well when one is consistent.
When the matcher can not make up its mind the importer will ask you to
choose a destination account.
Problems can creep in for the following reasons:
1 Accounts have been renamed
In this case the bayes keys end up referencing accounts names that
no longer exist. Unfortunately, as someone on the GC user list
pointed out, the bayes key values for accounts the reference,
reference them by name not by the account's UUID (GUID). This means
that when accounts are renamed in GnuCash all the bayes that
reference the original account name become as this author puts it
"orphaned" -- they no longer reference existing accounts.
This author had done a lot of renames over time.
2 Inconsistent balancing accounts selected during import
This happens when you purchase items from the same vendor -- Home
Depot, for instance -- but the purchases were for different
purposes, so you balance the transactions with different expense
accounts. In this authors experience, the matcher almost never even
selects a destination account during import. And when if it does, it
is often not the right account for the subject transaction.
3 You have imported unbalanced transactions
If you allow transactions to be imported without balancing accounts,
then GnuCash will balance them in a default "Imbalance" account.
This might happen because you want to get the transactions imported,
but you want to fix the imbalance transactions later. Maybe you had
already spent a lot of time selecting accounts and wanted get the
them import finished, but there were still some transactions, that
unbalanced. These will be balanced by default to Imbalance account.
*The Bayes slot data will remember this!* Perhaps, the importer
should explicity NOT remember unbalanced transactions. The author
may decide to volunteer some time to work the the importor, and if I
do, this is one "feature" I would want to fix.
Degredation over time
Over a long period of time, even when you have been careful with
balancing account selections, the matching account selection can degrade
significantly (see reason 2 above). Eventually this authors frustration
rose to the level of desiring to "fix something" or delete all the Bayes
data and start over. Finally this lead to this script.
Repairing GnuCash Bayes slot data
This script will allow one remove or repair Bayes slot data associated
with all or individual accounts in a GnuCash xml data file. If you
remove all Bayes data in an account, the account will be returned to the
state it was in at the time you first created it, and you will have to
identify balancing accounts for transactions imported into it.
Instead of starting over, however, this script will allow you to "prune"
or edit the bayes slot data. By default, all bayes data referencing
accounts that no longer exist in the GnuCash data file are removed.
The script will also allow you to "adjust" or tune the the bayes data in
one or all accounts.
Details: GnuCash Bayes Matching Data
The Bayes matching data is stored in the GnuCash data file, in "slots"
associated with accounts that have been target of transaction imports as
as hierachical "Key/Value" pairs. The Bayes data slots are stored in a
toplevel "frame" slot (array) with the name of import-map-bayes:
<slot:key>import-map-bayes</slot:key>
The array of slots with this slots' "frame" contains the KVPs written by
the importer when Bayesian matching is enabled.
<slot:key>import-map-bayes</slot:key>
<slot:value type="frame">
<slot>
<slot:key>#000001342</slot:key>
<slot:value type="frame">
<slot>
<slot:key>Expenses:Auto:Repair and Maintenance</slot:key>
<slot:value type="integer">1</slot:value>
</slot>
<slot>
<slot:key>Expenses:Auto:Fuel</slot:key>
<slot:value type="integer">1</slot:value>
</slot>
</slot:value>
</slot>
<slot>
<slot:key>#000001359</slot:key>
<slot:value type="frame">
<slot>
<slot:key>Expenses:Gifts</slot:key>
<slot:value type="integer">1</slot:value>
</slot>
</slot:value>
</slot>
The line
<slot:key>#000001342</slot:key>
in the above, represents a word from a transaction descript that has
been imported. The value of this slot key is a frame (array) of slots
that provide account names
<slot:key>Expenses:Auto:Repair and Maintenance</slot:key>
<slot:key>Expenses:Auto:Fuel</slot:key>
And the matching "strength" values for the named account
<slot:value type="integer">1</slot:value>
ENVIRONMENT
Because gc_prune_bayes_data.pl reads the input XML file *as XML* using
the CPAN XML::LibXML module, using an XPath expression to find the bayes
matching data slots in each account, the script requires that your perl
environment have the CPAN XML::LibXML module installed.
The command
perl -c -MXML::LibXML </dev/null
will report an error if the module is not installed in your environment.
Of course the script will report this also, because if it is not
present, the script will not compile.
Unix/Linux environments
Most Linux distributions make this available via the their standard
package managers. On Debian based distributions this can be install with
the following command:
sudo apt-get install libxml-perl
On Unix/Linux environments this script should be made executable with
the chmod command
chmod +x gc_prune_bayes_data.pl
Windows environments
XML::LibXML is also available in Active Perl, and in the cygwin
environments.
This author has not tested this script on windows, but there should be
no reason why it will not work once the required environment is
installed.
On windows the easiest way to run the script would be by using perl from
a cmd prompt:
perl gc_prune_bayes_data.pl
Macintosh environments
This author is not familiar with the OSX environment. Patches to these
instructions are welcome.
OPTIONS
All options may be abbreviated as long as the option is distinct from
all other options.
--listKVP=filename
Specifies a file into which while be written the Bayes key/value pairs
for each account in the GnuCash file. This is useful for analyzing your
bayes matching data in n easy to read abbreviated (indented) format.
--only=filename
Specifies an input control file which is expected to contain a list of
account paths from which import-map-bayes slots will be pruned.
--listBayesAccounts=filename
Specifies a file to which the account paths with bayes import data will
be written (just names the accounts that contain import data). This can
be used to create in input file for --onlyFile=filename).
--listAccounts=filename
Specifies a file to which will be written a list accounts in the gnucash
file. The full account path name is written. If filename is '-' then the
files will be listed to STDOUT
This is may be useful for creating the following input files specified
by other options.
--orphanedFile=filename
Specifies an output file to which will be written a list of orphaned
account paths. This is the list of references that do not refer to an
existing account.
This can be used in conjunction with the --listAccounts file to create a
renameFile.
--renameFile=filename
Specifies a file of from/to account paths. Bayes slot data the matchs
the 'from' path, will be replaced with the 'to' path. This is the best
way to retain slot data for accounts that have been renamed.
The file is expected in the form of baby quoted (') 'from' and 'to'
paths one per line. The regular expression used to parse this is
'(.*)'[:,\/\s]{1,}'(.*)'$
Any of ":", "," "/", space, or tab, (one or more) can be used as a
separators between the from and to account paths.
--removeRegex='regex'
The bayesian matcher tokenizes transaction descriptions on word
boundaries (delimited by white space). It is argued that this creates a
lot of fairly useless key to account relationships. This option can be
used to remove key/value pairs using regular expression matching against
the slot keys.
Purely numeric keys can be removed as follows:
--removeRegex='^\d{1,}$'
Similarly,
--removeRegex='^#\d{1,}$'
would remove all slots whose keys start with a hash (#) and are purely
numeric after the hash.
This switch may be specified multiple times to specify multiple Regexes.
--saveslots=filename
Specifies a file to which will be written the removed slots in XML
format. Use if you want to save the XML version of the slot data for
some paranoid reason. This was implemented because the author was
originally paranoid.
--verbose
Print very verbose output to STDOUT. Used for debugging. Don't bother
--help (or -h) : print help text
--path(Separator)=: : the character used as the path separator in
your GC datafile default value is a colon (:)
--help (or -h) : print this help text
--path(Separator)=: : the character used as the path separator in
your GC datafile default value is a colon (:)
--substString=substitution_pattern
--subst='s/fromRegex/toRegex/' #or
--subst='s,fromRegex,toRegex,'
repeat as often as you need
Substition expressions used to edit slot key values. The regular
expression is executedly literally in a perl eval(). If the eval fails,
a message will be written to STDERR, and the supstitution regex will be
removed from the list.
Examples
gc_prune_bayes_data.pl --listA=- uncompressed_gcfile
List all accounts in the input file to stdout A file make be provided
instead of -
gc_prune_bayes_data.pl --onlyFile=only.txt source-CG-file.xml rewritten-GC-file.xml
Operate only on the accounts specified in only.txt. Produce
rewritten-GC-file.xml with the results of the operations specified.
gc_prune_bayes_data.pl
prune bayes data from all accounts and write and updated gcfile
gc_prune_bayes_data.pl uncompressed_gcfile modified_gcfile
Lincoln A. Baxter email: my intials (all three) (at) lincolnbaxter (dot)
com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gc_prune_bayes_data.pl
Type: application/x-perl
Size: 39095 bytes
Desc: not available
URL: <http://lists.gnucash.org/pipermail/gnucash-user/attachments/20160729/76b48a2c/attachment-0001.pl>
More information about the gnucash-user
mailing list