Questions for a fresh GNUCASH ledger in 2016
Lincoln A Baxter
lab at lincolnbaxter.com
Sun Jan 3 23:27:09 EST 2016
Hi Thevenin,
Your Bayes import Slot data was carried over to the new GC files so the
importer is confused about how to do the matches. But it's fixable.
Attached is a perl script (and documentation -- which is imbedded in
the script too). I've been working on this on an off, and I believe it
will help you clean up you bayes matching data. The script does not
modify the GC file it reads, it creates a new GC file with the name you
specify. You can then check that, with GC before over writing you
original GC file.
Larry (CC'ed) Sorry for not getting back to you sooner. I finally got
back to working this last weekend. I believe I have fixed the problem
you found. And I have subsequently run this a bunch of times to clean
up my bayes data, after a bunch of account renames etc.
The script can be used to get rid of all the data that references
accounts that no long exist in the file. And if your renamed accounts,
you can re-target the bayes data.
Enjoy...
Lincoln
ps: I also have a script to remove all transactions older than a
specified date (and I plan to modify it to remove all transaction after
a given date... thus allowing one to split a large GC file on a given
date.
And I just wrote on to move transaction splits. I needed that when I
consolidated too many account and wanted to split things back out
again. That script uses the from account, the to account, and a
regular expression searching transaction descriptions to find the
transaction one wants to move.
On Sat, 2016-01-02 at 12:16 -0800, Thevenin wrote:
> I've been using Gnucash for many years now, ever since living abroad.
> During
> that time, I have stopped using some foreign accounts or working in
> these
> currencies. This week I decided to create a new account by saving the
> account tree structure and recording my end-of-year balances.
>
> Here is my problem: when I opened the new zero account file, I
> deleted the
> accounts I no longer need (i.e. pre-school or AUD currency accounts,
> etc.).
> When I enter transactions, though, the transfer suggestions persist
> in
> suggesting these deleted accounts.
>
> Can anyone offer some help on this?
>
> <http://gnucash.1415818.n4.nabble.com/file/n4682289/Gnucash_problem_2
> 016-01-02_temp.png>
>
> Thevenin
>
>
>
> --
> View this message in context: http://gnucash.1415818.n4.nabble.com/Qu
> estions-for-a-fresh-GNUCASH-ledger-in-2016-tp4682289.html
> Sent from the GnuCash - User mailing list archive at Nabble.com.
> _______________________________________________
> gnucash-user mailing list
> gnucash-user at gnucash.org
> https://lists.gnucash.org/mailman/listinfo/gnucash-user
> -----
> Please remember to CC this list on all your replies.
> You can do this by using Reply-To-List or Reply-All.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gc_prune_bayes_data.pl
Type: application/x-perl
Size: 38580 bytes
Desc: not available
URL: <http://lists.gnucash.org/pipermail/gnucash-user/attachments/20160103/d0df6c83/attachment-0001.pl>
-------------- next part --------------
NAME
gc_prune_bayes_data.pl
COPYRIGHT
Copyright (C) 2015 Lincoln A Baxter
This program is free software: you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation, either version 3 of the License, or (at your
option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
Public License for more details.
Please see the GNU General Public License at
<http://www.gnu.org/licenses/>
ABSTRACT
Remove Bayes data associated with one or all accounts in an uncompressed
GnuCash XML data file.
This enables the user to analyze, modify, trim (purge), or delete
bayesian data from his GnuCash data file.
The script reads an uncompressed "version 2" GnuCash XML datafile and
removes all Bayes data that references non-existent accounts.
The script reads the GnuCash xml file *as XML* not as text. The script
uses XML::LibXML CPAN perl module to read, traverse modify and output a
new XML file. It is not dependant on the formating of the gnucash XML
unlike most perl scripts this author has seen on the GnuCash users email
list.
Because this script does not treat the GnuCash file as text, it is not
subject to the the breakage would occur if the formatting of the XML
data were to change.
Instead, this script reads the GnuCash data a DOM structure, and then
manipulates that structure.
The script does not modify the input GnuCash XML datafile. The user must
specify an outout data filename. The results of the script's operations
are written to this file, which should then be opened and checked in
GnuCash. Before replacing the original file.
To print a usage synopsis: gc_prune_bayes_data.pl --help
To print the synopsis, plus option descriptions: gc_prune_bayes_data.pl
--help --verbose
To print the entire man page: gc_prune_bayes_data.pl --man
SYNOPSIS
gc_prune_bayes_data.pl [options] CG-file.xml [modified-GC-file.xml]
options:
--help print this synopsis help text
--man print the full man page (this file's POD)
--verbose increased verbosity (mainly for debugging)
--only=only_acts.txt (input) only operate on these accounts
--rename=slot_acts.txt (input) for renaming slot keys
--listKVP=kvps.txt (output) print a Bayes KVP report by account
--listAccounts=accounts.txt (output) print full GC path of all accounts
--listBayesAccounts=filename (output) accounts with bayes data
--removed=removed.txt (output) writes the XML of removed slots
--orphaned=orphaned.txt (output) writes orphaned slot keys
--pathSeparator=char GnuCash account path separator (default=:)
--removeRegex=regex prune slots matching regular expresions (repeatable)
--substExpresssions=expression sed like substition patterns applied
to slot key names (repeatable)
The second file argument is is optional. With no destination output
file, gc_prune_bayes_data.pl runs in analytical/trial mode, and reports
all actions taken and produces all specified analysis outputs.
The source input file is never modified.
If your gnucash data file is compressed you must uncompress it first (on
unix based OSes) as follows:
cat gnucach_data_file gunzip > uncompressed_gnucash_file.xml
Or you can just uncheck the "compressed file" option in GnuCash and
save.
DISCUSSION
Initially (when you start out with GnuCash, or with a new account), you
have to "train" the matcher by manually identifying the destination
account for every transaction. When you do, (assuming you have enabled
Bayesian matching on import), the matching will remember you choices by
storing transaction description data and the account you selected in
key/value pairs called "slots" under the account you we importing into.
The values of an existing slots will be incremented making the matching
stronger for the balancing account selected.
When the matcher builds up a strong enough match for a particular
account, the importer will propose that account as the destination
account, and it will show you the match "strenth."
This works very well when one is consistent.
When the matcher can not make up its mind the importer will ask you to
choose a destination account.
Problems can creep in for the following reasons:
1 Accounts have been renamed
In this case the bayes keys end up referencing accounts names that
no longer exist. Unfortunately, as someone on the GC user list
pointed out, the bayes key values for accounts the reference,
reference them by name not by the account's UUID (GUID). This means
that when accounts are renamed in GnuCash all the bayes that
reference the original account name become as this author puts it
"orphaned" -- they no longer reference existing accounts.
This author had done a lot of renames over time.
2 Inconsistent balancing accounts selected during import
This happens when you purchase items from the same vendor -- Home
Depot, for instance -- but the purchases were for different
purposes, so you balance the transactions with different expense
accounts. In this authors experience, the matcher almost never even
selects a destination account during import. And when if it does, it
is often not the right account for the subject transaction.
3 You have imported unbalanced transactions
If you allow transactions to be imported without balancing accounts,
then GnuCash will balance them in a default "Imbalance" account.
This might happen because you want to get the transactions imported,
but you want to fix the imbalance transactions later. Maybe you had
already spent a lot of time selecting accounts and wanted get the
them import finished, but there were still some transactions, that
unbalanced. These will be balanced by default to Imbalance account.
*The Bayes slot data will remember this!* Perhaps, the importer
should explicity NOT remember unbalanced transactions. The author
may decide to volunteer some time to work the the importor, and if I
do, this is one "feature" I would want to fix.
Degredation over time
Over a long period of time, even when you have been careful with
balancing account selections, the matching account selection can degrade
significantly (see reason 2 above). Eventually this authors frustration
rose to the level of desiring to "fix something" or delete all the Bayes
data and start over. Finally this lead to this script.
Repairing GnuCash Bayes slot data
This script will allow one remove or repair Bayes slot data associated
with all or individual accounts in a GnuCash xml data file. If you
remove all Bayes data in an account, the account will be returned to the
state it was in at the time you first created it, and you will have to
identify balancing accounts for transactions imported into it.
Instead of starting over, however, this script will allow you to "prune"
or edit the bayes slot data. By default, all bayes data referencing
accounts that no longer exist in the GnuCash data file are removed.
The script will also allow you to "adjust" or tune the the bayes data in
one or all accounts.
Details: GnuCash Bayes Matching Data
The Bayes matching data is stored in the GnuCash data file, in "slots"
associated with accounts that have been target of transaction imports as
as hierachical "Key/Value" pairs. The Bayes data slots are stored in a
toplevel "frame" slot (array) with the name of import-map-bayes:
<slot:key>import-map-bayes</slot:key>
The array of slots with this slots' "frame" contains the KVPs written by
the importer when Bayesian matching is enabled.
<slot:key>import-map-bayes</slot:key>
<slot:value type="frame">
<slot>
<slot:key>#000001342</slot:key>
<slot:value type="frame">
<slot>
<slot:key>Expenses:Auto:Repair and Maintenance</slot:key>
<slot:value type="integer">1</slot:value>
</slot>
<slot>
<slot:key>Expenses:Auto:Fuel</slot:key>
<slot:value type="integer">1</slot:value>
</slot>
</slot:value>
</slot>
<slot>
<slot:key>#000001359</slot:key>
<slot:value type="frame">
<slot>
<slot:key>Expenses:Gifts</slot:key>
<slot:value type="integer">1</slot:value>
</slot>
</slot:value>
</slot>
The line
<slot:key>#000001342</slot:key>
in the above, represents a word from a transaction descript that has
been imported. The value of this slot key is a frame (array) of slots
that provide account names
<slot:key>Expenses:Auto:Repair and Maintenance</slot:key>
<slot:key>Expenses:Auto:Fuel</slot:key>
And the matching "strength" values for the named account
<slot:value type="integer">1</slot:value>
ENVIRONMENT
Because gc_prune_bayes_data.pl reads the input XML file *as XML* using
the CPAN XML::LibXML module, using an XPath expression to find the bayes
matching data slots in each account, the script requires that your perl
environment have the CPAN XML::LibXML module installed.
The command
perl -c -MXML::LibXML </dev/null
will report an error if the module is not installed in your environment.
Of course the script will report this also, because if it is not
present, the script will not compile.
Unix/Linux environments
Most Linux distributions make this available via the their standard
package managers. On Debian based distributions this can be install with
the following command:
sudo apt-get install libxml-perl
On Unix/Linux environments this script should be made executable with
the chmod command
chmod +x gc_prune_bayes_data.pl
Windows environments
XML::LibXML is also available in Active Perl, and in the cygwin
environments.
This author has not tested this script on windows, but there should be
no reason why it will not work once the required environment is
installed.
On windows the easiest way to run the script would be by using perl from
a cmd prompt:
perl gc_prune_bayes_data.pl
Macintosh environments
This author is not familiar with the OSX environment. Patches to these
instructions are welcome.
OPTIONS
All options may be abbreviated as long as the option is distinct from
all other options.
--listKVP=filename
Specifies a file into which while be written the Bayes key/value pairs
for each account in the GnuCash file. This is useful for analyzing your
bayes matching data in n easy to read abbreviated (indented) format.
--only=filename
Specifies an input control file which is expected to contain a list of
account paths from which import-map-bayes slots will be pruned.
--listBayesAccounts=filename
Specifies a file to which the account paths with bayes import data will
be written (just names the accounts that contain import data). This can
be used to create in input file for --onlyFile=filename).
--listAccounts=filename
Specifies a file to which will be written a list accounts in the gnucash
file. The full account path name is written. If filename is '-' then the
files will be listed to STDOUT
This is may be useful for creating the following input files specified
by other options.
--orphanedFile=filename
Specifies an output file to which will be written a list of orphaned
account paths. This is the list of references that do not refer to an
existing account.
This can be used in conjunction with the --listAccounts file to create a
renameFile.
--renameFile=filename
Specifies a file of from/to account paths. Bayes slot data the matchs
the 'from' path, will be replaced with the 'to' path. This is the best
way to retain slot data for accounts that have been renamed.
The file is expected in the form of baby quoted (') 'from' and 'to'
paths one per line. The regular expression used to parse this is
'(.*)'[:,\/\s]{1,}'(.*)'$
Any of ":", "," "/", space, or tab, (one or more) can be used as a
separators between the from and to account paths.
--removeRegex='regex'
The bayesian matcher tokenizes transaction descriptions on word
boundaries (delimited by white space). It is argued that this creates a
lot of fairly useless key to account relationships. This option can be
used to remove key/value pairs using regular expression matching against
the slot keys.
Purely numeric keys can be removed as follows:
--removeRegex='^\d{1,}$'
Similarly,
--removeRegex='^#\d{1,}$'
would remove all slots whose keys start with a hash (#) and are purely
numeric after the hash.
This switch may be specified multiple times to specify multiple Regexes.
--saveslots=filename
Specifies a file to which will be written the removed slots in XML
format. Use if you want to save the XML version of the slot data for
some paranoid reason. This was implemented because the author was
originally paranoid.
--verbose
Print very verbose output to STDOUT. Used for debugging. Don't bother
--help (or -h) : print help text
--path(Separator)=: : the character used as the path separator in
your GC datafile default value is a colon (:)
--help (or -h) : print this help text
--path(Separator)=: : the character used as the path separator in
your GC datafile default value is a colon (:)
--substString=substitution_pattern
--subst='s/fromRegex/toRegex/' #or
--subst='s,fromRegex,toRegex,'
repeat as often as you need
Substition expressions used to edit slot key values. The regular
expression is executedly literally in a perl eval(). If the eval fails,
a message will be written to STDERR, and the supstitution regex will be
removed from the list.
Examples
gc_prune_bayes_data.pl --listA=- uncompressed_gcfile
List all accounts in the input file to stdout A file make be provided
instead of -
gc_prune_bayes_data.pl --onlyFile=only.txt source-CG-file.xml rewritten-GC-file.xml
Operate only on the accounts specified in only.txt. Produce
rewritten-GC-file.xml with the results of the operations specified.
gc_prune_bayes_data.pl
prune bayes data from all accounts and write and updated gcfile
gc_prune_bayes_data.pl uncompressed_gcfile modified_gcfile
Lincoln A. Baxter email: my intials (all three) (at) lincolnbaxter (dot)
com
More information about the gnucash-user
mailing list