Bayesian matching- Imbalance

Lincoln A Baxter lab at lincolnbaxter.com
Tue Jul 28 07:49:26 EDT 2015


Hi John and others,

Ok, so, I've been scratching my itch. And learning the structure of the
GnuCash XML file and how slots are used.  I've now got several perl
scripts which use the cpan XML::LibXML module to edit a gnucash file by
reading the XML as XML and then traversing the Document Object Model
using LibXML functions and xPath queries.  

One script will simply remove all bayes data from the gnucash file (or
from specified accounts), while the other (more interesting) will prune
the bayes slots that reference accounts that no long exist in the file.
  I'd like to offer these to the gnucash community.  

What is the preferred method of sharing scripts?

Can I attach them (as attachements) to an email to the list?  Will
attachments be stripped by the list server?

If so, then I guess they would have to be pasted into the text of the
email.

Lincoln


On Sun, 2015-07-19 at 20:42 -0400, Lincoln A Baxter wrote:
> On Thu, 2015-07-16 at 18:18 -0700, Mark Sutton wrote:
> > On Thu, Jul 16, 2015 at 11:11:49AM -0400, David T. wrote:
> > > I will mention that some time ago, a user wrote in about pruning 
> > the Bayesian matching portion of their data file. You might search 
> > the archives and see whether the code for that was made available 
> or 
> > not…
> > 
> > Yes that perl script was posted to the list. 
> > Should be sometime before the 4th of March.
> 
> I found the posting... pulled down the script... fixed the line 
> breaks
> created from the web page formating. The script now compiles (with 
> perl
> -c ).   I have been studying the script... and a copy of my data file
> (unzipped).
> 
> I must say, I found the script a bit to hard to understand... (it is
> not because I don't know perl, I do, I'm a CPAN contributor).  
> 
> The author had obviously analyzed the xml file, and knew just what he
> was looking for, and the script obviously scratched his itch, and in 
> a
> relatively surgical way.  The script leaves Bayes data intact, and 
> only
> "prunes" out slot-key values that are targeted by the configuration
> variables at the top of the script. In this respect the script is
> pretty slick, if not specific to the authors itch. 
> 
> What I don't like is that the script essentially just reads the 
> entire
> gunzip'ed data file into memory (as a TEXT file) and then goes 
> through
> it line by line using regex matches to find what it will operate on. 
> It
> depends on a _formatted_ layout of the GC xml datafile. This makes it
> quite fragile, though, at the moment it appears it would still work,
> (for book version = 2.0.0 at least).
> 
> I would have preferred that the script parse the XML structure AS xml
> into an internal hashmap representing that structure, then walking 
> that
> structure, to perform its edits, but I can see why the author chose 
> to
> do what he did.  It would have been much more work (but much safer 
> and
> more generically useful and maintainable in the end.)  At least the
> script stimulated me to dig a little deeper... now I don't think I'll
> be using it. 
> 
> If I were to build some kind of external editor in perl (and I'm 
> mildly
> tempted) I would use the XML::LibXML CPAN module to actually read the
> xml and operate on the resulting structure, and then rewrite the XML
> from the manipulated structure.  (I started to look into this but it 
> is
> going to take me a little more time than I have right now). This 
> would
> at least be relatively safe, and it would be possible to create a
> script that would be relatively easy to maintain... and modify.  
> 
> The thing that makes any editing of this data in an editor a bit
> tricky, is that "slots" are used, to describe just everything in the 
> GC
> account data file except for the highest level GC entities which are
> described with tags like xxx:yyyyyy specific (where xxx:yyyyyy is a
> specific key name known to the core program).  
> 
> In my data file, the following is a sample of the high level slot 
> data
> is associated with an account:
> 
> 	color
> 	hbci
> 	import-map	 #imported transactions (used by the match 
> editor?)
> 	import-map-bayes #HERE IT IS! (its HUGE!) (11 years worth of 
> imports)
> 	last-num	#maintaining a last counter for ???
> 	online_id	#I think this is used by aq banking
> 	reconcile-info  #about the last reconciliation
> 
> Slots appear to be used through out the datafile, to provide detailed
> (additional) data, and theoretically (I think) could be used to
> describe any GC entity. The definition of a slot is recursive and can
> go infinitely deep and wide, though for the most part slots describe
> arrays (of slots) usually not more than two or three levels deep. 
> These
> arrays are contained withing keys whose value type is frame.  This
> infinity flexible (and fairly opaque if looking at the file in a text
> editor)... hence all the warnings we receive. Slots have the value of
> keeping ALL data associated with a GC entity, whether understood by 
> the
> program (or not I think), within the GC data file. (I would like to
> think that the program would ignore slots whose key names it does not
> understand -- I assume that if one did not use aqbanking stuff, the
> data would just be ignored... but carried along by the base GC 
> program)
> 
> Along the way of looking at this, I decided that a standard text 
> editor
> like vi was not going to cut it for understanding/editing the GC file
> format (and my data file in particular, which is close to half a
> million lines.)   So I looked for a FOSS xml editor, and found and
> installed xmlcopyeditor which was in the Debian repositories.  This 
> is
> a pretty nice little program.  
> 
> Embedding something like this within GC (disabled by default) for
> manipulating GC account data, would be REALLY cool... as one could 
> just
> browse from the accounts editor, right into the slot data (and edit 
> it)
> if one wanted, in a relatively safe way. That would awesome!  It 
> would
> provide a mechanism helping users (and programmers!), especially when
> more function specific manipulation capabilities are not supplied. 
> 
> The bottom line for removing all the bayes matching data, if one 
> wants
> to start over with matching, is that one has to remove all the slots
> with the key of 
> 
> 	<slot:key>import-map-bayes</slot:key>
> 
> The frame that follows this is an LONG array of data associated with
> this slot.
> 
> xmlcopyeditor does appear to provide a _relatively_ safe way to do 
> this
> at least on unix (linux) based platforms.  Other xml editors exist 
> for
> windows. That said, I still had to be very careful, and I operated on 
> a
> copy of the data.
> 
> I did this and tested the resulting file, and it seems to read into 
> GC
> without error.  A trial import definitely matched nothing. :-)
> 
> Lincoln




More information about the gnucash-user mailing list