Fixing confused bayesian matching data?

Lincoln A Baxter lab at lincolnbaxter.com
Fri Jul 29 20:48:32 EDT 2016


On Fri, 2016-07-29 at 02:55 -0700, Jim DeLaHunt wrote:
> Philip:
> 
> Sorry for the delay in responding to your message. I waited until I
> had 
> something useful posted where you could see it.
> 
> I was in exactly your situation back in February.
> 
> On Sun, 17 Jul 2016 21:24:56 -0400, Philip Matthews 
> <philip_matthews at magma.ca> wrote:
> > Just wondering if anyone has any advice on what to do with some
> very confused bayesian matching data?
> >
> > Right now, when I import new transactions (either CSV or QFX), they
> mostly don't find a match anymore. Only around 20 - 30% match.   This
> is probably because I like to rejig my accounts from time to time as
> I continue to figure out what works best for me.  Looking through the
> ".gnucash" file, I see lots of slot entries with account names that
> don't exist any more....
> >
> > ...Write a Python program that goes through the .gnucash file and
> deletes slot entries that point at accounts that don't exist any
> more.
> >
> > Comments?  Other thoughts?
> >
> > Running GnuCash 2.6.11 on a Mac.
> 
> My solution? XSLT processing <https://en.wikipedia.org/wiki/XSLT>. 
> GnuCash files can be saved as XML format data, and XSLT is a tool
> for 
> modifying XML data in a controlled, reliable way. I wrote a set of
> XSLT 
> filters which:
> 
>  1. lists the Bayes mapping data for each account in a gnucash XML
> file;
>  2. resets the import mapping, by deleting all the Bayes mapping data
>     for every account in a gnucash XML file; and
>  3. prunes the import mapping data for certain target accounts in a
>     gnucash XML file.
> 
> A rather brief explanation of the situation and the GnuCash file
> format, 
> with listings of all the XSLT filters, is in my freshly-written blog 
> post, /Resetting GnuCash’s import transaction matching/ 
> <http://blog.jdlh.com/en/2016/07/29/resetting-gnucashs-import-transac
> tion-matching/>. 
> Take a look. I hope it's helpful.

This prompts me to reply with a newer version of the perl script I
posted sometime back to do this. (and a whole lot more).  I have also
attached the text of the full man page imbedded in this script.  

Another GC user (Cheryl Wheeler) provided a patch that is incorporated
into this version.  I have been discussing with Chris Good a more
permanent location for this script, which can then be referenced in the
GnuCash wiki. But I have not had time to do much work on that however,
so I repost the script here for now.

Lincoln



-------------- next part --------------
NAME
    gc_prune_bayes_data.pl

COPYRIGHT
    Copyright (C) 2015-2016 Lincoln A Baxter

    This program is free software: you can redistribute it and/or modify it
    under the terms of the GNU General Public License as published by the
    Free Software Foundation, either version 3 of the License, or (at your
    option) any later version.

    This program is distributed in the hope that it will be useful, but
    WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
    Public License for more details.

    The GNU General Public License can be found at
    <http://www.gnu.org/licenses/>

    The following people have contributed to this script:

       Cheryl Wheeler: Patch provided 7/26/2016 -- fixed argument
                       assignment in SetKeyNodeText()

ABSTRACT
    Remove Bayes data associated with one or all accounts in an uncompressed
    GnuCash XML data file.

    This enables the user to analyze, modify, trim (purge), or delete
    bayesian data from his GnuCash data file.

    The script reads an uncompressed "version 2" GnuCash XML datafile and
    removes all Bayes data that references non-existent accounts.

    The script reads the GnuCash xml file *as XML* not as text. The script
    uses XML::LibXML CPAN perl module to read, traverse modify and output a
    new XML file. It is not dependant on the formating of the gnucash XML
    unlike most perl scripts this author has seen on the GnuCash users email
    list.

    Because this script does not treat the GnuCash file as text, it is not
    subject to the the breakage would occur if the formatting of the XML
    data were to change.

    Instead, this script reads the GnuCash data a DOM structure, and then
    manipulates that structure.

    The script does not modify the input GnuCash XML datafile. The user must
    specify an outout data filename. The results of the script's operations
    are written to this file, which should then be opened and checked in
    GnuCash. Before replacing the original file.

    To print a usage synopsis: gc_prune_bayes_data.pl --help

    To print the synopsis, plus option descriptions: gc_prune_bayes_data.pl
    --help --verbose

    To print the entire man page: gc_prune_bayes_data.pl --man

SYNOPSIS
    gc_prune_bayes_data.pl [options] CG-file.xml [modified-GC-file.xml]

    options:

     --help                         print this synopsis help text
     --man                          print the full man page (this file's POD)
     --verbose                      increased verbosity (mainly for debugging)
     --only=only_acts.txt           (input)  only operate on these accounts
     --rename=slot_acts.txt         (input)  for renaming slot keys 
     --listKVP=kvps.txt             (output) print a Bayes KVP report by account
     --listAccounts=accounts.txt    (output) print full GC path of all accounts 
     --listBayesAccounts=filename   (output) accounts with bayes data
     --removed=removed.txt          (output) writes the XML of removed slots
     --orphaned=orphaned.txt        (output) writes orphaned slot keys
     --pathSeparator=char           GnuCash account path separator (default=:)
     --removeRegex=regex            prune slots matching regular expresions (repeatable)
     --substExpresssions=expression sed like substition patterns applied
                                    to slot key names (repeatable)

    The second file argument is is optional. With no destination output
    file, gc_prune_bayes_data.pl runs in analytical/trial mode, and reports
    all actions taken and produces all specified analysis outputs.

    The source input file is never modified.

    If your gnucash data file is compressed you must uncompress it first (on
    unix based OSes) as follows:

      cat gnucach_data_file gunzip > uncompressed_gnucash_file.xml

    Or you can just uncheck the "compressed file" option in GnuCash and
    save.

DISCUSSION
    Initially (when you start out with GnuCash, or with a new account), you
    have to "train" the matcher by manually identifying the destination
    account for every transaction. When you do, (assuming you have enabled
    Bayesian matching on import), the matching will remember you choices by
    storing transaction description data and the account you selected in
    key/value pairs called "slots" under the account you we importing into.
    The values of an existing slots will be incremented making the matching
    stronger for the balancing account selected.

    When the matcher builds up a strong enough match for a particular
    account, the importer will propose that account as the destination
    account, and it will show you the match "strenth."

    This works very well when one is consistent.

    When the matcher can not make up its mind the importer will ask you to
    choose a destination account.

    Problems can creep in for the following reasons:

    1 Accounts have been renamed
        In this case the bayes keys end up referencing accounts names that
        no longer exist. Unfortunately, as someone on the GC user list
        pointed out, the bayes key values for accounts the reference,
        reference them by name not by the account's UUID (GUID). This means
        that when accounts are renamed in GnuCash all the bayes that
        reference the original account name become as this author puts it
        "orphaned" -- they no longer reference existing accounts.

        This author had done a lot of renames over time.

    2 Inconsistent balancing accounts selected during import
        This happens when you purchase items from the same vendor -- Home
        Depot, for instance -- but the purchases were for different
        purposes, so you balance the transactions with different expense
        accounts. In this authors experience, the matcher almost never even
        selects a destination account during import. And when if it does, it
        is often not the right account for the subject transaction.

    3 You have imported unbalanced transactions
        If you allow transactions to be imported without balancing accounts,
        then GnuCash will balance them in a default "Imbalance" account.
        This might happen because you want to get the transactions imported,
        but you want to fix the imbalance transactions later. Maybe you had
        already spent a lot of time selecting accounts and wanted get the
        them import finished, but there were still some transactions, that
        unbalanced. These will be balanced by default to Imbalance account.
        *The Bayes slot data will remember this!* Perhaps, the importer
        should explicity NOT remember unbalanced transactions. The author
        may decide to volunteer some time to work the the importor, and if I
        do, this is one "feature" I would want to fix.

  Degredation over time
    Over a long period of time, even when you have been careful with
    balancing account selections, the matching account selection can degrade
    significantly (see reason 2 above). Eventually this authors frustration
    rose to the level of desiring to "fix something" or delete all the Bayes
    data and start over. Finally this lead to this script.

  Repairing GnuCash Bayes slot data
    This script will allow one remove or repair Bayes slot data associated
    with all or individual accounts in a GnuCash xml data file. If you
    remove all Bayes data in an account, the account will be returned to the
    state it was in at the time you first created it, and you will have to
    identify balancing accounts for transactions imported into it.

    Instead of starting over, however, this script will allow you to "prune"
    or edit the bayes slot data. By default, all bayes data referencing
    accounts that no longer exist in the GnuCash data file are removed.

    The script will also allow you to "adjust" or tune the the bayes data in
    one or all accounts.

  Details: GnuCash Bayes Matching Data
    The Bayes matching data is stored in the GnuCash data file, in "slots"
    associated with accounts that have been target of transaction imports as
    as hierachical "Key/Value" pairs. The Bayes data slots are stored in a
    toplevel "frame" slot (array) with the name of import-map-bayes:

       <slot:key>import-map-bayes</slot:key>

    The array of slots with this slots' "frame" contains the KVPs written by
    the importer when Bayesian matching is enabled.

          <slot:key>import-map-bayes</slot:key>
          <slot:value type="frame">
            <slot>
              <slot:key>#000001342</slot:key>
              <slot:value type="frame">
                <slot>
                  <slot:key>Expenses:Auto:Repair and Maintenance</slot:key>
                  <slot:value type="integer">1</slot:value>
                </slot>
                <slot>
                  <slot:key>Expenses:Auto:Fuel</slot:key>
                  <slot:value type="integer">1</slot:value>
                </slot>
              </slot:value>
            </slot>
            <slot>
              <slot:key>#000001359</slot:key>
              <slot:value type="frame">
                <slot>
                  <slot:key>Expenses:Gifts</slot:key>
                  <slot:value type="integer">1</slot:value>
                </slot>
              </slot:value>
            </slot>

    The line

       <slot:key>#000001342</slot:key>

    in the above, represents a word from a transaction descript that has
    been imported. The value of this slot key is a frame (array) of slots
    that provide account names

       <slot:key>Expenses:Auto:Repair and Maintenance</slot:key>
       <slot:key>Expenses:Auto:Fuel</slot:key>

    And the matching "strength" values for the named account

       <slot:value type="integer">1</slot:value>

ENVIRONMENT
    Because gc_prune_bayes_data.pl reads the input XML file *as XML* using
    the CPAN XML::LibXML module, using an XPath expression to find the bayes
    matching data slots in each account, the script requires that your perl
    environment have the CPAN XML::LibXML module installed.

    The command

      perl -c -MXML::LibXML </dev/null

    will report an error if the module is not installed in your environment.
    Of course the script will report this also, because if it is not
    present, the script will not compile.

  Unix/Linux environments
    Most Linux distributions make this available via the their standard
    package managers. On Debian based distributions this can be install with
    the following command:

       sudo apt-get install libxml-perl

    On Unix/Linux environments this script should be made executable with
    the chmod command

       chmod +x gc_prune_bayes_data.pl

  Windows environments
    XML::LibXML is also available in Active Perl, and in the cygwin
    environments.

    This author has not tested this script on windows, but there should be
    no reason why it will not work once the required environment is
    installed.

    On windows the easiest way to run the script would be by using perl from
    a cmd prompt:

       perl gc_prune_bayes_data.pl

  Macintosh environments
    This author is not familiar with the OSX environment. Patches to these
    instructions are welcome.

OPTIONS
    All options may be abbreviated as long as the option is distinct from
    all other options.

  --listKVP=filename
    Specifies a file into which while be written the Bayes key/value pairs
    for each account in the GnuCash file. This is useful for analyzing your
    bayes matching data in n easy to read abbreviated (indented) format.

  --only=filename
    Specifies an input control file which is expected to contain a list of
    account paths from which import-map-bayes slots will be pruned.

  --listBayesAccounts=filename
    Specifies a file to which the account paths with bayes import data will
    be written (just names the accounts that contain import data). This can
    be used to create in input file for --onlyFile=filename).

  --listAccounts=filename
    Specifies a file to which will be written a list accounts in the gnucash
    file. The full account path name is written. If filename is '-' then the
    files will be listed to STDOUT

    This is may be useful for creating the following input files specified
    by other options.

  --orphanedFile=filename
    Specifies an output file to which will be written a list of orphaned
    account paths. This is the list of references that do not refer to an
    existing account.

    This can be used in conjunction with the --listAccounts file to create a
    renameFile.

  --renameFile=filename
    Specifies a file of from/to account paths. Bayes slot data the matchs
    the 'from' path, will be replaced with the 'to' path. This is the best
    way to retain slot data for accounts that have been renamed.

    The file is expected in the form of baby quoted (') 'from' and 'to'
    paths one per line. The regular expression used to parse this is

         '(.*)'[:,\/\s]{1,}'(.*)'$

    Any of ":", "," "/", space, or tab, (one or more) can be used as a
    separators between the from and to account paths.

  --removeRegex='regex'
    The bayesian matcher tokenizes transaction descriptions on word
    boundaries (delimited by white space). It is argued that this creates a
    lot of fairly useless key to account relationships. This option can be
    used to remove key/value pairs using regular expression matching against
    the slot keys.

    Purely numeric keys can be removed as follows:

       --removeRegex='^\d{1,}$'

    Similarly,

       --removeRegex='^#\d{1,}$'

    would remove all slots whose keys start with a hash (#) and are purely
    numeric after the hash.

    This switch may be specified multiple times to specify multiple Regexes.

  --saveslots=filename
    Specifies a file to which will be written the removed slots in XML
    format. Use if you want to save the XML version of the slot data for
    some paranoid reason. This was implemented because the author was
    originally paranoid.

  --verbose
    Print very verbose output to STDOUT. Used for debugging. Don't bother

  --help (or -h)            : print help text
  --path(Separator)=:       : the character used as the path separator in 
                                your GC datafile default value is a colon (:)
  --help (or -h)            : print this help text
  --path(Separator)=:       : the character used as the path separator in 
                                your GC datafile default value is a colon (:)
  --substString=substitution_pattern
       --subst='s/fromRegex/toRegex/' #or 
       --subst='s,fromRegex,toRegex,'

    repeat as often as you need

    Substition expressions used to edit slot key values. The regular
    expression is executedly literally in a perl eval(). If the eval fails,
    a message will be written to STDERR, and the supstitution regex will be
    removed from the list.

Examples
  gc_prune_bayes_data.pl --listA=- uncompressed_gcfile
    List all accounts in the input file to stdout A file make be provided
    instead of -

  gc_prune_bayes_data.pl --onlyFile=only.txt source-CG-file.xml rewritten-GC-file.xml
    Operate only on the accounts specified in only.txt. Produce
    rewritten-GC-file.xml with the results of the operations specified.

  gc_prune_bayes_data.pl
    prune bayes data from all accounts and write and updated gcfile

    gc_prune_bayes_data.pl uncompressed_gcfile modified_gcfile

    Lincoln A. Baxter email: my intials (all three) (at) lincolnbaxter (dot)
    com

-------------- next part --------------
A non-text attachment was scrubbed...
Name: gc_prune_bayes_data.pl
Type: application/x-perl
Size: 39095 bytes
Desc: not available
URL: <http://lists.gnucash.org/pipermail/gnucash-user/attachments/20160729/76b48a2c/attachment-0001.pl>


More information about the gnucash-user mailing list