OFX Bayesian import not working for me

Eliot Rosenbloom eliot at ejr.me
Wed Dec 2 00:35:24 EST 2015


For the benefit of other readers: I am presuming my first paragraph is 
correct, and your "It seems not," refers to non-examination of the Memo 
field for matching and/or the non-existence of "an easier way" to 
consider the detailed info in the Memo field.

(I'm not sure what to make of the comment you cite, but things do seem 
to be working.)   - Eliot

> John Ralls <mailto:jralls at ceridwen.us>
> December 1, 2015 at 10:00 PM
>
>
> Eliot,
>
> It seems not. The following comment may or may not explain:
>     /* Disable matching by memo, until bayesian filtering is implemented.
>      * It's currently unlikely to help, and has adverse effects,
>      * causing false positives, since very often the type of the
>      * transaction is stored there.
>
> Unfortunately the author of that comment didn’t explain what he meant 
> by bayesian filtering anywhere.
>
> Regards,
> John Ralls
>
> Eliot Rosenbloom <mailto:eliot at ejr.me>
> December 1, 2015 at 7:35 PM
> Ok, I think I understand now:  If GC identifies a single key-word 
> ("token") in the transaction's description, then it assigns the Amount 
> to the account with the highest "score."  If the transaction has more 
> than one token, then GC totals the scores for each account (across all 
> the relevant tokens) and, again, assigns the Amount to the account 
> with the highest (total) score.
>
> I did choose to delete the whole import-match-bayes slot for each 
> relevant account, and it seems to be working fine!  I'm VERY appreciative!
>
> Does anyone know if GC looks for tokens only in the Description / 
> <NAME> field, or does it also examine the Memo field?  My credit union 
> often fills the Description field with generic info such as "ACH 
> Withdrawal" and puts the more specific, helpful info into the Memo 
> field.  :-(
>
> I found that doing a global change in the .ofx import file CHANGING, 
> for example:
>           "<NAME>ACH Withdrawal<MEMO>"
> TO:    "<XXX>dummy<NAME>ACHW:  "
> moved the useful information into the Description field (pre-pended 
> with "ACHW:  ").
>
> Making 2-3 similar global changes on each month's .ofx file is not 
> prohibitively time consuming, but if there were an easier way, I'd be 
> happy to hear it.
>
> Again, many thanks, John!
>
> Eliot
>
> John Ralls <mailto:jralls at ceridwen.us>
> November 28, 2015 at 2:38 PM
>
>
> Eliot,
>
> Please remember to copy the list on all replies. “Reply all” works well.
>
> The .gnucash file without a date. It’s compressed with gzip, and you 
> can uncompress it on the command line with gunzip or you can unselect 
> “Compress Files” in Preferences>General and the next save will be 
> uncompressed.
>
> I summed across all three in the first MEDICARE example because I 
> deleted two of them with the unstated assumption that only one was 
> correct. I explained that that was just an example and that a real 
> case would be more complicated, which I thought that I’d clarified 
> later by explaining the way Bayesian matching tokenizes descriptions, 
> scores the token - account pair, and then sums the scores across the 
> tokens to select the matching account.
>
> So to “make sense” of a set of token scores you need to run that 
> process yourself for the tokens you intend to change: From a set of 
> import files find the descriptions containing each token you’re 
> contemplating changing, find the other tokens in those descriptions, 
> look at the token-account scores for each and work out what account 
> the matcher will select in each case. If it appears that the matcher 
> will do the right thing, remove only the “:” delimited account tokens; 
> you probably don’t need to change the scores of the remaining ones, 
> because the token-account scores for “:” delimited accounts were all 
> created together and you’ve decided that the other scores provide the 
> right answer. Deleting the “:” scores is still helpful because the 
> matcher won’t have to look at those scores any more and that will 
> speed it up. If the matcher is guessing wrong then by working out the 
> match process by hand you’ll understand why and can remove or adjust 
> token-account scores as necessary. If all of that seems like too much 
> work you can just delete the whole import-match-bayes slot and start 
> over generating new matches.
>
> If I understand your question about posting, just “reply all”. The 
> list is in the CC field of this message and “reply all” will ensure 
> that it’s in your reply as well.
>
> Regards,
> John Ralls
>
>
> Eliot Rosenbloom <mailto:eliot at ejr.me>
> November 28, 2015 at 1:03 PM
> Thanks John!
>
> 1.  I hate to be dumb, but where do I find the file with <slot> tags?  
> I see only 3 types of files:  .log (with minimal info in them);  and 
> .gnucash with and without a date (both are "garbage" when opened with 
> TextEditor).
>
> 2.  "look at slots for the other tokens to see that they all make 
> sense." -- I'm a bit fuzzy on the relation (below) between the two 
> entries for "Medicare" [I would delete the one with":" , I assume]  
> AND the one for "Health Insurance," and why you summed across all 3?  
> And can you say a bit more about what "making sense" means?  What 
> should I be looking to find or avoid?
>
> 3.  To post this (obviously w/o my personal files I sent you), do I 
> remember that there is an email address I can forward it to ... 
> probably proceeded by a brief statement of the problem?
>
> Thanks!
>
> Eliot
> Mac  OS 10.11
> (And BTW:  I downloaded GC v. 2.6.9, but in Get Info and the Finder 
> regular column listing, it shows up as 2.6.7.  The .dmg was named 
> Intel 2.6.9-1)
>
> John Ralls <mailto:jralls at ceridwen.us>
> November 28, 2015 at 10:03 AM
>
>
> Eliot,
>
> This should go on the list, it might be useful for others. Also, 
> Robert Fewell has a pull request in on the development branch to add a 
> token viewer/editor and this case might be helpful to him.
>
> You can edit the data file, though it can get complicated as I’ll 
> explain later. I suggest making a copy first. TextEdit will do fine 
> for editing, just be sure to save as plain text. Delete whole xml 
> elements, so in the example I used earlier,
>
> <slot>
> <slot:key>MEDICARE</slot:key>
> <slot:value type="frame">
> <slot>
> <slot:key>Expenses,Medical Expenses,Health Insurance</slot:key>
> <slot:value type="integer">2</slot:value>
> </slot>
> <slot>
> <slot:key>Expenses,Medical Expenses,Medicare</slot:key>
> <slot:value type="integer">8</slot:value>
> </slot>
> <slot>
> <slot:key>Expenses:Medical Expenses:Medicare</slot:key>
> <slot:value type="integer">6</slot:value>
> </slot>
> </slot:value>
> </slot>
>
> you want to make sure that you delete corresponding <slot>..</slot>. 
> So if you want to make it use Expenses, Medical Expenses, Health 
> Insurance with key MEDICARE you would delete the other two slots. I 
> suggest that you also change the score of the remaining slot to the 
> sum of all:
>
> <slot>
> <slot:key>MEDICARE</slot:key>
> <slot:value type="frame">
> <slot>
> <slot:key>Expenses,Medical Expenses,Health Insurance</slot:key>
> <slot:value type="integer”>16</slot:value>
> </slot>
> </slot:value>
> </slot>
>
> Before you dive in it’s important to understand that the Bayesian 
> matcher tokenizes the description string on spaces and assigns scores 
> to each token. If the description in the import is “CMS MEDICARE” 
> (invented for illustration, I didn’t look at the OFX file) then there 
> will be another slot with key CMS and sub-slots with various accounts 
> and scores. Here’s where you need to be careful: You might have other 
> transactions with the string “CMS” in the description and when 
> combined with some word other than “MEDICARE” will go to a different 
> account. The sum of the scores for each token is what actually 
> determines which account the matcher selects, which is why I suggested 
> keeping the total MEDICARE score the same.  Note that the same applies 
> to the MEDICARE key: If you have different transactions with MEDICARE 
> in the description, some of which should go to Expenses,Medical 
> Expenses,Medicare and others to Expenses, Medical Expenses,Health 
> Insurance then you should delete only the slot whose key has ‘:’ 
> separators and look at slots for the other tokens to see that they all 
> make sense. Depending on how much token duplication there is and how 
> many tokens each description has the combinations can get rather 
> daunting, so you may elect to just delete the value tokens with ‘:’ 
> separated keys.
>
> Regards,
> John Ralls



More information about the gnucash-user mailing list