[GNC] Fixing bad Bayesian data

David Cousens davidcousens at bigpond.com
Tue Dec 18 20:32:25 EST 2018


Steve
My understanding is that the matcher tokenizes the description and possibly the memo field and constructs a probaility
table for combinations of tokens being used as the basis of assigning a transaction to a given account and matches the
incoming transaction tokens to the stored tokens to calculate a probability of a match. I haven't looked at that part of
the code in detail, just skimmed over it so far, so it may be a bit more complicated that that. I am planning to explore
parts of the matcher and CSV import code in the next few weeks/months and get a better idea of how it works.  Hence my
reluctance to go editing the data file.

I presume each account in the file that has had data imported into it will have similar data in it and the account ID's
refer to the transfer account. The key entry identifies the token and the transfer account and the value the number of
times the token has occured associated with a transfer to that account. The partial tokens may have occurred where a
line break had somehow been inserted in the incoming field and the parser has interpreted it as whitespace. GnuCash
stores a lot of its data even in the data structures as key-value pair format. Allows adding new data without having to
rewrite a lot of code.

You seem to have worked out the .gnucash being the XML format file and the compressed and uncompressed formats
preference on your own.

Cheers

David


On Tue, 2018-12-18 at 18:34 -0600, Steve Cohen wrote:
> OK, I figured out that .gnucash does not describe the file format which 
> is either compressed or non-compressed XML depending on the compression 
> setting you choose.
> 
> So I switched to non-compressed and look at the bayesian elements and 
> it's not what I would have expected.  The expressions that are mapped 
> from are not phrases but "words."  Something like
> 
>      <slot>
>  
> <slot:key>import-map-bayes/INDEPENDENCE/f5cb4b5b31decc01c394dd7170078254</slot:key>
>        <slot:value type="integer">1</slot:value>
>      </slot>
>      <slot>
>  
> <slot:key>import-map-bayes/INDIA/af48360c2fb9b039b4707ad7d7517950</slot:key>
>        <slot:value type="integer">1</slot:value>
>      </slot>
>      <slot>
>  
> <slot:key>import-map-bayes/INGTON/94ec6c9aae683c9125fb0dd2b1bb8846</slot:key>
>        <slot:value type="integer">1</slot:value>
>      </slot>
>      <slot>
>  
> <slot:key>import-map-bayes/INN/c6447afebc9564fded7d1bafbe1e026e</slot:key>
>        <slot:value type="integer">1</slot:value>
>      </slot>
>      <slot>
>  
> <slot:key>import-map-bayes/INTEREST/b572baae5a56a30ce384ab58ff12ed7d</slot:key>
>        <slot:value type="integer">1</slot:value>
>      </slot>
>      <slot>
>  
> <slot:key>import-map-bayes/INTUIT/9c204d33baf137f4f0b078f9b61531d1</slot:key>
>        <slot:value type="integer">1</slot:value>
>      </slot>
>      <slot>
>  
> <slot:key>import-map-bayes/INVESTM/5920c9dbe1d24308893a5eeb32d01e09</slot:key>
>        <slot:value type="integer">3</slot:value>
>      </slot>
>      <slot>
>  
> <slot:key>import-map-bayes/IS/c6447afebc9564fded7d1bafbe1e026e</slot:key>
>        <slot:value type="integer">4</slot:value>
>      </slot>
> 
> So I am trying to understand how these are applied.  I get that the long 
> hex numbers are GUIDs representing accounts and that the expressions 
> before this are bits of the transaction description.  But what if the 
> transaction description is multiple words, each mapping to a different 
> account?  Obviously "INVESTM" and "IS" are going to be pulled in many 
> different directions.  How does "INGTON" get in there?  Why isn't it 
> "WAASHINGTON"? So I'm trying to understand how this works at all.
> 
> I know that it does, but I can't imagine how.
> 
> The long hex numbers are GUIDs corresponding to accounts.
> On 12/18/18 5:59 PM, Stephen M. Butler wrote:
> > On 12/18/18 3:31 PM, Steve Cohen wrote:
> > > Thanks.
> > > 
> > > Seems like none of these solutions will work if your data is stored as 
> > > a .gnucash file, they only work with .xml files.
> > > 
> > > Is there a way to convert this?
> > > 
> > > Is the Bayesian matching applied to entries that are corrected in the 
> > > account editor, or is it only applied to entries made in the importer?
> > > 
> > > I am somewhat comfortable with the bleeding edge, but, when is the 
> > > release of version 4 expected?
> > > 
> > > 
> > > On 12/18/18 5:17 PM, David Cousens wrote:
> > > > Steve
> > > > 
> > > > These may help.
> > > > https://wiki.gnucash.org/wiki/Bayes
> > > > https://lists.gnucash.org/pipermail/gnucash-user/2016-July/066299.html
> > > > http://gnucash.1415818.n4.nabble.com/Fixing-confused-bayesian-matching-data-td4685819.html 
> > > > 
> > > > http://blog.jdlh.com/en/2016/07/29/resetting-gnucashs-import-transaction-matching/ 
> > > > 
> > > > 
> > > > Make a backup of your data file and only work on a copy until you are 
> > > > sure
> > > > it is working after changing it if you attempt any of the solutions
> > > > mentioned in the above posts.
> > > > 
> > > > The importer stores the map data and probabilities during the final 
> > > > step of
> > > > the import process. If you let transactions go through to Imbalance 
> > > > then it
> > > > obviously gets no data to work with. If you assign all transactions to a
> > > > specific transfer account before import and continue to do that, it will
> > > > eventually correct itself. There are a few situations in which the 
> > > > bayesian
> > > > matcher does not work. I find where there is a transaction unique number
> > > > which changes with each periodic transaction the matcher seems to run 
> > > > into
> > > > problems. An number identifying the payer/payee and not the transaction
> > > > itself is OK. Some of mine have both.
> > > > 
> > > > There will be a feature to be added in GnuCash V4 which allows multiple
> > > > selection of transactions and assignment of a single transfer account 
> > > > in the
> > > > import matcher which speeds up the transaction matching process
> > > > significantly. It can be incorporated in V3.x as a patch if you build
> > > > GnuCash from source, but the risk is that future bug fixes in the 
> > > > importer
> > > > which change the two affected files could result in a non-working 
> > > > GnuCash.
> > > > It incorporated in the master barnch of the GitHub repository and can be
> > > > built from that if you are comfortable working with the bleeding edge.
> > > > 
> > > > David Cousens
> > > > 
> > > > 
> > > > 
> > > > 
> > > > -----
> > > > David Cousens
> > > > -- 
> > 
> > 
> > Steve,
> > 
> > In GnC, click on the Tools menu and then on the Import Map Editor.  Once 
> > on the new screen you can see all the mappings that have been generated.
> > 
> > In my case, I did some restructuring of my accounts and found that the 
> > existing mappings no longer worked.  I highlighted the top levels and 
> > clicked on the DELETE key.  That reset everything for me and I'm in the 
> > process of building the new set of mappings.
> > 
> > The high level is based on the imports you do.  I had three: Checking 
> > account, Credit Card, and Savings account.  The last one is used so 
> > little that it isn't worth the hassle of downloading the 1-2 entries 
> > each month so I now enter them by hand.  That will leave me with just 
> > two imports -- which I plan to do multiple times each month to keep the 
> > number of transactions low.
> > 
> > Anyway, if you decide to clear everything out, the above is a nice and 
> > easy way to do that.
> > 
> > --Steve
> > 
> 
> _______________________________________________
> gnucash-user mailing list
> gnucash-user at gnucash.org
> To update your subscription preferences or to unsubscribe:
> https://lists.gnucash.org/mailman/listinfo/gnucash-user
> If you are using Nabble or Gmane, please see https://wiki.gnucash.org/wiki/Mailing_Lists for more information.
> -----
> Please remember to CC this list on all your replies.
> You can do this by using Reply-To-List or Reply-All.



More information about the gnucash-user mailing list