Bayesian matching- Imbalance

Sun Jul 19 20:42:27 EDT 2015

On Thu, 2015-07-16 at 18:18 -0700, Mark Sutton wrote:
> On Thu, Jul 16, 2015 at 11:11:49AM -0400, David T. wrote:
> > I will mention that some time ago, a user wrote in about pruning 
> the Bayesian matching portion of their data file. You might search 
> the archives and see whether the code for that was made available or 
> not…
> 
> Yes that perl script was posted to the list. 
> Should be sometime before the 4th of March.

I found the posting... pulled down the script... fixed the line breaks
created from the web page formating. The script now compiles (with perl
-c ).   I have been studying the script... and a copy of my data file
(unzipped).

I must say, I found the script a bit to hard to understand... (it is
not because I don't know perl, I do, I'm a CPAN contributor).  

The author had obviously analyzed the xml file, and knew just what he
was looking for, and the script obviously scratched his itch, and in a
relatively surgical way.  The script leaves Bayes data intact, and only
"prunes" out slot-key values that are targeted by the configuration
variables at the top of the script. In this respect the script is
pretty slick, if not specific to the authors itch. 

What I don't like is that the script essentially just reads the entire
gunzip'ed data file into memory (as a TEXT file) and then goes through
it line by line using regex matches to find what it will operate on. It
depends on a _formatted_ layout of the GC xml datafile. This makes it
quite fragile, though, at the moment it appears it would still work,
(for book version = 2.0.0 at least).

I would have preferred that the script parse the XML structure AS xml
into an internal hashmap representing that structure, then walking that
structure, to perform its edits, but I can see why the author chose to
do what he did.  It would have been much more work (but much safer and
more generically useful and maintainable in the end.)  At least the
script stimulated me to dig a little deeper... now I don't think I'll
be using it. 

If I were to build some kind of external editor in perl (and I'm mildly
tempted) I would use the XML::LibXML CPAN module to actually read the
xml and operate on the resulting structure, and then rewrite the XML
from the manipulated structure.  (I started to look into this but it is
going to take me a little more time than I have right now). This would
at least be relatively safe, and it would be possible to create a
script that would be relatively easy to maintain... and modify.  

The thing that makes any editing of this data in an editor a bit
tricky, is that "slots" are used, to describe just everything in the GC
account data file except for the highest level GC entities which are
described with tags like xxx:yyyyyy specific (where xxx:yyyyyy is a
specific key name known to the core program).  

In my data file, the following is a sample of the high level slot data
is associated with an account:

	color
	hbci
	import-map	 #imported transactions (used by the match editor?)
	import-map-bayes #HERE IT IS! (its HUGE!) (11 years worth of imports)
	last-num	#maintaining a last counter for ???
	online_id	#I think this is used by aq banking
	reconcile-info  #about the last reconciliation

Slots appear to be used through out the datafile, to provide detailed
(additional) data, and theoretically (I think) could be used to
describe any GC entity. The definition of a slot is recursive and can
go infinitely deep and wide, though for the most part slots describe
arrays (of slots) usually not more than two or three levels deep. These
arrays are contained withing keys whose value type is frame.  This
infinity flexible (and fairly opaque if looking at the file in a text
editor)... hence all the warnings we receive. Slots have the value of
keeping ALL data associated with a GC entity, whether understood by the
program (or not I think), within the GC data file. (I would like to
think that the program would ignore slots whose key names it does not
understand -- I assume that if one did not use aqbanking stuff, the
data would just be ignored... but carried along by the base GC program)

Along the way of looking at this, I decided that a standard text editor
like vi was not going to cut it for understanding/editing the GC file
format (and my data file in particular, which is close to half a
million lines.)   So I looked for a FOSS xml editor, and found and
installed xmlcopyeditor which was in the Debian repositories.  This is
a pretty nice little program.  

Embedding something like this within GC (disabled by default) for
manipulating GC account data, would be REALLY cool... as one could just
browse from the accounts editor, right into the slot data (and edit it)
if one wanted, in a relatively safe way. That would awesome!  It would
provide a mechanism helping users (and programmers!), especially when
more function specific manipulation capabilities are not supplied. 

The bottom line for removing all the bayes matching data, if one wants
to start over with matching, is that one has to remove all the slots
with the key of 

	<slot:key>import-map-bayes</slot:key>

The frame that follows this is an LONG array of data associated with
this slot.

xmlcopyeditor does appear to provide a _relatively_ safe way to do this
at least on unix (linux) based platforms.  Other xml editors exist for
windows. That said, I still had to be very careful, and I operated on a
copy of the data.

I did this and tested the resulting file, and it seems to read into GC
without error.  A trial import definitely matched nothing. :-)

Lincoln

I have not used yet since if i have to edit 
the file by hand to remove the massive
about of junk it has accumulated over the years, 
I want to be able to stop it from adding more afterwards.

It has gone from matching 90+% to not matching 70%, 
and exporting the chart of accounts includes all the bayes keys.

It is sad to hear the data in SQL is more difficult to work  with
than the XML. I had hoped the bayes data would be in it's own table
or column, or have keys a SELECT could fetch and delete.

I thought the trade off with XML was an increase in storage requirements
in return for an easier ability for the uninitiated to read and understand
the organization of the data, self describing data.
But 12 instances of the word "slot" to store a tuple of token, account name
and weight? for a token of "-"? 

The data file should really have the suffix .slot, not .gnucash

Don't get me wrong, I like gnucash, it makes perfect sense to me.
I'm not saying double entry book keeping was the only thing of value
to come from Venice, but it is at the top. Once one groks debit/credit
nothing else will suffice.

cheers, mark
_______________________________________________
gnucash-user mailing list
gnucash-user at gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-user
-----
Please remember to CC this list on all your replies.
You can do this by using Reply-To-List or Reply-All.