[GNC] Fixing bad Bayesian data

John Ralls jralls at ceridwen.us
Tue Dec 18 20:28:50 EST 2018


The slot value is a score. You may find that some fragments have several entries with different scores. The Bayes matcher finds all of the fragments in a string and sums up the scores for each GUID. The GUID with the highest sum is the match.

For example suppose you have the string “Joe’s Fish Market” and the following table represented part of your match results:

Joe's         123    1
Joe's         7a3    2
Joe's         9b6    1
Fish          9b6    1
Meat         7a3    2
Barber       123   1
Emporium  7a3   2
Market       9b6   1
Shop          123   1

That would result in the following scores:

123   1
7a3   2
9b6   3

And the matcher would select account 9b6.

Regards,
John Ralls

> On Dec 18, 2018, at 4:34 PM, Steve Cohen <stevecoh2 at gmail.com> wrote:
> 
> OK, I figured out that .gnucash does not describe the file format which is either compressed or non-compressed XML depending on the compression setting you choose.
> 
> So I switched to non-compressed and look at the bayesian elements and it's not what I would have expected.  The expressions that are mapped from are not phrases but "words."  Something like
> 
>    <slot>
> <slot:key>import-map-bayes/INDEPENDENCE/f5cb4b5b31decc01c394dd7170078254</slot:key>
>      <slot:value type="integer">1</slot:value>
>    </slot>
>    <slot>
> <slot:key>import-map-bayes/INDIA/af48360c2fb9b039b4707ad7d7517950</slot:key>
>      <slot:value type="integer">1</slot:value>
>    </slot>
>    <slot>
> <slot:key>import-map-bayes/INGTON/94ec6c9aae683c9125fb0dd2b1bb8846</slot:key>
>      <slot:value type="integer">1</slot:value>
>    </slot>
>    <slot>
> <slot:key>import-map-bayes/INN/c6447afebc9564fded7d1bafbe1e026e</slot:key>
>      <slot:value type="integer">1</slot:value>
>    </slot>
>    <slot>
> <slot:key>import-map-bayes/INTEREST/b572baae5a56a30ce384ab58ff12ed7d</slot:key>
>      <slot:value type="integer">1</slot:value>
>    </slot>
>    <slot>
> <slot:key>import-map-bayes/INTUIT/9c204d33baf137f4f0b078f9b61531d1</slot:key>
>      <slot:value type="integer">1</slot:value>
>    </slot>
>    <slot>
> <slot:key>import-map-bayes/INVESTM/5920c9dbe1d24308893a5eeb32d01e09</slot:key>
>      <slot:value type="integer">3</slot:value>
>    </slot>
>    <slot>
> <slot:key>import-map-bayes/IS/c6447afebc9564fded7d1bafbe1e026e</slot:key>
>      <slot:value type="integer">4</slot:value>
>    </slot>
> 
> So I am trying to understand how these are applied.  I get that the long hex numbers are GUIDs representing accounts and that the expressions before this are bits of the transaction description.  But what if the transaction description is multiple words, each mapping to a different account?  Obviously "INVESTM" and "IS" are going to be pulled in many different directions.  How does "INGTON" get in there?  Why isn't it "WAASHINGTON"? So I'm trying to understand how this works at all.
> 
> I know that it does, but I can't imagine how.
> 
> The long hex numbers are GUIDs corresponding to accounts.
> On 12/18/18 5:59 PM, Stephen M. Butler wrote:
>> On 12/18/18 3:31 PM, Steve Cohen wrote:
>>> Thanks.
>>> 
>>> Seems like none of these solutions will work if your data is stored as a .gnucash file, they only work with .xml files.
>>> 
>>> Is there a way to convert this?
>>> 
>>> Is the Bayesian matching applied to entries that are corrected in the account editor, or is it only applied to entries made in the importer?
>>> 
>>> I am somewhat comfortable with the bleeding edge, but, when is the release of version 4 expected?
>>> 
>>> 
>>> On 12/18/18 5:17 PM, David Cousens wrote:
>>>> Steve
>>>> 
>>>> These may help.
>>>> https://wiki.gnucash.org/wiki/Bayes
>>>> https://lists.gnucash.org/pipermail/gnucash-user/2016-July/066299.html
>>>> http://gnucash.1415818.n4.nabble.com/Fixing-confused-bayesian-matching-data-td4685819.html 
>>>> http://blog.jdlh.com/en/2016/07/29/resetting-gnucashs-import-transaction-matching/ 
>>>> 
>>>> Make a backup of your data file and only work on a copy until you are sure
>>>> it is working after changing it if you attempt any of the solutions
>>>> mentioned in the above posts.
>>>> 
>>>> The importer stores the map data and probabilities during the final step of
>>>> the import process. If you let transactions go through to Imbalance then it
>>>> obviously gets no data to work with. If you assign all transactions to a
>>>> specific transfer account before import and continue to do that, it will
>>>> eventually correct itself. There are a few situations in which the bayesian
>>>> matcher does not work. I find where there is a transaction unique number
>>>> which changes with each periodic transaction the matcher seems to run into
>>>> problems. An number identifying the payer/payee and not the transaction
>>>> itself is OK. Some of mine have both.
>>>> 
>>>> There will be a feature to be added in GnuCash V4 which allows multiple
>>>> selection of transactions and assignment of a single transfer account in the
>>>> import matcher which speeds up the transaction matching process
>>>> significantly. It can be incorporated in V3.x as a patch if you build
>>>> GnuCash from source, but the risk is that future bug fixes in the importer
>>>> which change the two affected files could result in a non-working GnuCash.
>>>> It incorporated in the master barnch of the GitHub repository and can be
>>>> built from that if you are comfortable working with the bleeding edge.
>>>> 
>>>> David Cousens
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -----
>>>> David Cousens
>>>> -- 
>> Steve,
>> In GnC, click on the Tools menu and then on the Import Map Editor.  Once on the new screen you can see all the mappings that have been generated.
>> In my case, I did some restructuring of my accounts and found that the existing mappings no longer worked.  I highlighted the top levels and clicked on the DELETE key.  That reset everything for me and I'm in the process of building the new set of mappings.
>> The high level is based on the imports you do.  I had three: Checking account, Credit Card, and Savings account.  The last one is used so little that it isn't worth the hassle of downloading the 1-2 entries each month so I now enter them by hand.  That will leave me with just two imports -- which I plan to do multiple times each month to keep the number of transactions low.
>> Anyway, if you decide to clear everything out, the above is a nice and easy way to do that.
>> --Steve
> 
> _______________________________________________
> gnucash-user mailing list
> gnucash-user at gnucash.org
> To update your subscription preferences or to unsubscribe:
> https://lists.gnucash.org/mailman/listinfo/gnucash-user
> If you are using Nabble or Gmane, please see https://wiki.gnucash.org/wiki/Mailing_Lists for more information.
> -----
> Please remember to CC this list on all your replies.
> You can do this by using Reply-To-List or Reply-All.



More information about the gnucash-user mailing list