Questions for a fresh GNUCASH ledger in 2016

Lincoln A Baxter lab at lincolnbaxter.com
Sun Jan 3 23:27:09 EST 2016


Hi Thevenin,

Your Bayes import Slot data was carried over to the new GC files so the
importer is confused about how to do the matches.  But it's fixable.

Attached is a perl script (and documentation -- which is imbedded in
the script too).  I've been working on this on an off, and I believe it
will help you clean up you bayes matching data.  The script does not
modify the GC file it reads, it creates a new GC file with the name you
specify.  You can then check that, with GC before over writing you
original GC file.

Larry (CC'ed) Sorry for not getting back to you sooner. I finally got
back to working this last weekend.  I believe I have fixed the problem
you found.  And I have subsequently run this a bunch of times to clean
up my bayes data, after a bunch of account renames etc.  

The script can be used to get rid of all the data that references
accounts that no long exist in the file.  And if your renamed accounts,
you can re-target the bayes data.  

Enjoy...

Lincoln

ps: I also have a script to remove all transactions older than a
specified date (and I plan to modify it to remove all transaction after
a given date... thus allowing one to split a large GC file on a given
date.   

And I just wrote on to move transaction splits.  I needed that when I
consolidated too many account and wanted to split things back out
again.  That script uses the from account, the to account, and a
regular expression searching transaction descriptions to find the
transaction one wants to move.



On Sat, 2016-01-02 at 12:16 -0800, Thevenin wrote:
> I've been using Gnucash for many years now, ever since living abroad.
> During
> that time, I have stopped using some foreign accounts or working in
> these
> currencies. This week I decided to create a new account by saving the
> account tree structure and recording my end-of-year balances.
> 
> Here is my problem: when I opened the new zero account file, I
> deleted the
> accounts I no longer need (i.e. pre-school or AUD currency accounts,
> etc.).
> When I enter transactions, though, the transfer suggestions persist
> in
> suggesting these deleted accounts.
> 
> Can anyone offer some help on this?
> 
> <http://gnucash.1415818.n4.nabble.com/file/n4682289/Gnucash_problem_2
> 016-01-02_temp.png> 
> 
> Thevenin
> 
> 
> 
> --
> View this message in context: http://gnucash.1415818.n4.nabble.com/Qu
> estions-for-a-fresh-GNUCASH-ledger-in-2016-tp4682289.html
> Sent from the GnuCash - User mailing list archive at Nabble.com.
> _______________________________________________
> gnucash-user mailing list
> gnucash-user at gnucash.org
> https://lists.gnucash.org/mailman/listinfo/gnucash-user
> -----
> Please remember to CC this list on all your replies.
> You can do this by using Reply-To-List or Reply-All.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gc_prune_bayes_data.pl
Type: application/x-perl
Size: 38580 bytes
Desc: not available
URL: <http://lists.gnucash.org/pipermail/gnucash-user/attachments/20160103/d0df6c83/attachment-0001.pl>
-------------- next part --------------
NAME
    gc_prune_bayes_data.pl

COPYRIGHT
    Copyright (C) 2015 Lincoln A Baxter

    This program is free software: you can redistribute it and/or modify it
    under the terms of the GNU General Public License as published by the
    Free Software Foundation, either version 3 of the License, or (at your
    option) any later version.

    This program is distributed in the hope that it will be useful, but
    WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
    Public License for more details.

    Please see the GNU General Public License at
    <http://www.gnu.org/licenses/>

ABSTRACT
    Remove Bayes data associated with one or all accounts in an uncompressed
    GnuCash XML data file.

    This enables the user to analyze, modify, trim (purge), or delete
    bayesian data from his GnuCash data file.

    The script reads an uncompressed "version 2" GnuCash XML datafile and
    removes all Bayes data that references non-existent accounts.

    The script reads the GnuCash xml file *as XML* not as text. The script
    uses XML::LibXML CPAN perl module to read, traverse modify and output a
    new XML file. It is not dependant on the formating of the gnucash XML
    unlike most perl scripts this author has seen on the GnuCash users email
    list.

    Because this script does not treat the GnuCash file as text, it is not
    subject to the the breakage would occur if the formatting of the XML
    data were to change.

    Instead, this script reads the GnuCash data a DOM structure, and then
    manipulates that structure.

    The script does not modify the input GnuCash XML datafile. The user must
    specify an outout data filename. The results of the script's operations
    are written to this file, which should then be opened and checked in
    GnuCash. Before replacing the original file.

    To print a usage synopsis: gc_prune_bayes_data.pl --help

    To print the synopsis, plus option descriptions: gc_prune_bayes_data.pl
    --help --verbose

    To print the entire man page: gc_prune_bayes_data.pl --man

SYNOPSIS
    gc_prune_bayes_data.pl [options] CG-file.xml [modified-GC-file.xml]

    options:

     --help                         print this synopsis help text
     --man                          print the full man page (this file's POD)
     --verbose                      increased verbosity (mainly for debugging)
     --only=only_acts.txt           (input)  only operate on these accounts
     --rename=slot_acts.txt         (input)  for renaming slot keys 
     --listKVP=kvps.txt             (output) print a Bayes KVP report by account
     --listAccounts=accounts.txt    (output) print full GC path of all accounts 
     --listBayesAccounts=filename   (output) accounts with bayes data
     --removed=removed.txt          (output) writes the XML of removed slots
     --orphaned=orphaned.txt        (output) writes orphaned slot keys
     --pathSeparator=char           GnuCash account path separator (default=:)
     --removeRegex=regex            prune slots matching regular expresions (repeatable)
     --substExpresssions=expression sed like substition patterns applied
                                    to slot key names (repeatable)

    The second file argument is is optional. With no destination output
    file, gc_prune_bayes_data.pl runs in analytical/trial mode, and reports
    all actions taken and produces all specified analysis outputs.

    The source input file is never modified.

    If your gnucash data file is compressed you must uncompress it first (on
    unix based OSes) as follows:

      cat gnucach_data_file gunzip > uncompressed_gnucash_file.xml

    Or you can just uncheck the "compressed file" option in GnuCash and
    save.

DISCUSSION
    Initially (when you start out with GnuCash, or with a new account), you
    have to "train" the matcher by manually identifying the destination
    account for every transaction. When you do, (assuming you have enabled
    Bayesian matching on import), the matching will remember you choices by
    storing transaction description data and the account you selected in
    key/value pairs called "slots" under the account you we importing into.
    The values of an existing slots will be incremented making the matching
    stronger for the balancing account selected.

    When the matcher builds up a strong enough match for a particular
    account, the importer will propose that account as the destination
    account, and it will show you the match "strenth."

    This works very well when one is consistent.

    When the matcher can not make up its mind the importer will ask you to
    choose a destination account.

    Problems can creep in for the following reasons:

    1 Accounts have been renamed
        In this case the bayes keys end up referencing accounts names that
        no longer exist. Unfortunately, as someone on the GC user list
        pointed out, the bayes key values for accounts the reference,
        reference them by name not by the account's UUID (GUID). This means
        that when accounts are renamed in GnuCash all the bayes that
        reference the original account name become as this author puts it
        "orphaned" -- they no longer reference existing accounts.

        This author had done a lot of renames over time.

    2 Inconsistent balancing accounts selected during import
        This happens when you purchase items from the same vendor -- Home
        Depot, for instance -- but the purchases were for different
        purposes, so you balance the transactions with different expense
        accounts. In this authors experience, the matcher almost never even
        selects a destination account during import. And when if it does, it
        is often not the right account for the subject transaction.

    3 You have imported unbalanced transactions
        If you allow transactions to be imported without balancing accounts,
        then GnuCash will balance them in a default "Imbalance" account.
        This might happen because you want to get the transactions imported,
        but you want to fix the imbalance transactions later. Maybe you had
        already spent a lot of time selecting accounts and wanted get the
        them import finished, but there were still some transactions, that
        unbalanced. These will be balanced by default to Imbalance account.
        *The Bayes slot data will remember this!* Perhaps, the importer
        should explicity NOT remember unbalanced transactions. The author
        may decide to volunteer some time to work the the importor, and if I
        do, this is one "feature" I would want to fix.

  Degredation over time
    Over a long period of time, even when you have been careful with
    balancing account selections, the matching account selection can degrade
    significantly (see reason 2 above). Eventually this authors frustration
    rose to the level of desiring to "fix something" or delete all the Bayes
    data and start over. Finally this lead to this script.

  Repairing GnuCash Bayes slot data
    This script will allow one remove or repair Bayes slot data associated
    with all or individual accounts in a GnuCash xml data file. If you
    remove all Bayes data in an account, the account will be returned to the
    state it was in at the time you first created it, and you will have to
    identify balancing accounts for transactions imported into it.

    Instead of starting over, however, this script will allow you to "prune"
    or edit the bayes slot data. By default, all bayes data referencing
    accounts that no longer exist in the GnuCash data file are removed.

    The script will also allow you to "adjust" or tune the the bayes data in
    one or all accounts.

  Details: GnuCash Bayes Matching Data
    The Bayes matching data is stored in the GnuCash data file, in "slots"
    associated with accounts that have been target of transaction imports as
    as hierachical "Key/Value" pairs. The Bayes data slots are stored in a
    toplevel "frame" slot (array) with the name of import-map-bayes:

       <slot:key>import-map-bayes</slot:key>

    The array of slots with this slots' "frame" contains the KVPs written by
    the importer when Bayesian matching is enabled.

          <slot:key>import-map-bayes</slot:key>
          <slot:value type="frame">
            <slot>
              <slot:key>#000001342</slot:key>
              <slot:value type="frame">
                <slot>
                  <slot:key>Expenses:Auto:Repair and Maintenance</slot:key>
                  <slot:value type="integer">1</slot:value>
                </slot>
                <slot>
                  <slot:key>Expenses:Auto:Fuel</slot:key>
                  <slot:value type="integer">1</slot:value>
                </slot>
              </slot:value>
            </slot>
            <slot>
              <slot:key>#000001359</slot:key>
              <slot:value type="frame">
                <slot>
                  <slot:key>Expenses:Gifts</slot:key>
                  <slot:value type="integer">1</slot:value>
                </slot>
              </slot:value>
            </slot>

    The line

       <slot:key>#000001342</slot:key>

    in the above, represents a word from a transaction descript that has
    been imported. The value of this slot key is a frame (array) of slots
    that provide account names

       <slot:key>Expenses:Auto:Repair and Maintenance</slot:key>
       <slot:key>Expenses:Auto:Fuel</slot:key>

    And the matching "strength" values for the named account

       <slot:value type="integer">1</slot:value>

ENVIRONMENT
    Because gc_prune_bayes_data.pl reads the input XML file *as XML* using
    the CPAN XML::LibXML module, using an XPath expression to find the bayes
    matching data slots in each account, the script requires that your perl
    environment have the CPAN XML::LibXML module installed.

    The command

      perl -c -MXML::LibXML </dev/null

    will report an error if the module is not installed in your environment.
    Of course the script will report this also, because if it is not
    present, the script will not compile.

  Unix/Linux environments
    Most Linux distributions make this available via the their standard
    package managers. On Debian based distributions this can be install with
    the following command:

       sudo apt-get install libxml-perl

    On Unix/Linux environments this script should be made executable with
    the chmod command

       chmod +x gc_prune_bayes_data.pl

  Windows environments
    XML::LibXML is also available in Active Perl, and in the cygwin
    environments.

    This author has not tested this script on windows, but there should be
    no reason why it will not work once the required environment is
    installed.

    On windows the easiest way to run the script would be by using perl from
    a cmd prompt:

       perl gc_prune_bayes_data.pl

  Macintosh environments
    This author is not familiar with the OSX environment. Patches to these
    instructions are welcome.

OPTIONS
    All options may be abbreviated as long as the option is distinct from
    all other options.

  --listKVP=filename
    Specifies a file into which while be written the Bayes key/value pairs
    for each account in the GnuCash file. This is useful for analyzing your
    bayes matching data in n easy to read abbreviated (indented) format.

  --only=filename
    Specifies an input control file which is expected to contain a list of
    account paths from which import-map-bayes slots will be pruned.

  --listBayesAccounts=filename
    Specifies a file to which the account paths with bayes import data will
    be written (just names the accounts that contain import data). This can
    be used to create in input file for --onlyFile=filename).

  --listAccounts=filename
    Specifies a file to which will be written a list accounts in the gnucash
    file. The full account path name is written. If filename is '-' then the
    files will be listed to STDOUT

    This is may be useful for creating the following input files specified
    by other options.

  --orphanedFile=filename
    Specifies an output file to which will be written a list of orphaned
    account paths. This is the list of references that do not refer to an
    existing account.

    This can be used in conjunction with the --listAccounts file to create a
    renameFile.

  --renameFile=filename
    Specifies a file of from/to account paths. Bayes slot data the matchs
    the 'from' path, will be replaced with the 'to' path. This is the best
    way to retain slot data for accounts that have been renamed.

    The file is expected in the form of baby quoted (') 'from' and 'to'
    paths one per line. The regular expression used to parse this is

         '(.*)'[:,\/\s]{1,}'(.*)'$

    Any of ":", "," "/", space, or tab, (one or more) can be used as a
    separators between the from and to account paths.

  --removeRegex='regex'
    The bayesian matcher tokenizes transaction descriptions on word
    boundaries (delimited by white space). It is argued that this creates a
    lot of fairly useless key to account relationships. This option can be
    used to remove key/value pairs using regular expression matching against
    the slot keys.

    Purely numeric keys can be removed as follows:

       --removeRegex='^\d{1,}$'

    Similarly,

       --removeRegex='^#\d{1,}$'

    would remove all slots whose keys start with a hash (#) and are purely
    numeric after the hash.

    This switch may be specified multiple times to specify multiple Regexes.

  --saveslots=filename
    Specifies a file to which will be written the removed slots in XML
    format. Use if you want to save the XML version of the slot data for
    some paranoid reason. This was implemented because the author was
    originally paranoid.

  --verbose
    Print very verbose output to STDOUT. Used for debugging. Don't bother

  --help (or -h)            : print help text
  --path(Separator)=:       : the character used as the path separator in 
                                your GC datafile default value is a colon (:)
  --help (or -h)            : print this help text
  --path(Separator)=:       : the character used as the path separator in 
                                your GC datafile default value is a colon (:)
  --substString=substitution_pattern
       --subst='s/fromRegex/toRegex/' #or 
       --subst='s,fromRegex,toRegex,'

    repeat as often as you need

    Substition expressions used to edit slot key values. The regular
    expression is executedly literally in a perl eval(). If the eval fails,
    a message will be written to STDERR, and the supstitution regex will be
    removed from the list.

Examples
  gc_prune_bayes_data.pl --listA=- uncompressed_gcfile
    List all accounts in the input file to stdout A file make be provided
    instead of -

  gc_prune_bayes_data.pl --onlyFile=only.txt source-CG-file.xml rewritten-GC-file.xml
    Operate only on the accounts specified in only.txt. Produce
    rewritten-GC-file.xml with the results of the operations specified.

  gc_prune_bayes_data.pl
    prune bayes data from all accounts and write and updated gcfile

    gc_prune_bayes_data.pl uncompressed_gcfile modified_gcfile

    Lincoln A. Baxter email: my intials (all three) (at) lincolnbaxter (dot)
    com



More information about the gnucash-user mailing list