[GNC-dev] New bayesian account matching algorithm

Tue Dec 15 18:00:56 EST 2020

Hi devs,

I'd like to propose a new algorithm for matching imported transactions 
to accounts. I've been dealing with the bayesian account matcher for 
several months now during my spare free time. Starting point of my 
efforts was the observation, that sometimes matching fails, where I 
wouldn't expect that. For instance some recurrent transactions, which 
could be matched successfully for a long time, suddenly started to fail 
matching. Additionally several users in the german mailing list reported 
similar problems and asked for help on how to "improve" the matching 
results. The mostly recommended workaround was to "clean up" the import 
mapping table using the corresponding import map editor.

Therefore I started trying to understand the current implementation. See 
my mails from that thread:

    * 
http://gnucash.1415818.n4.nabble.com/GNC-dev-Understanding-the-bayesian-import-matching-algorithm-td4719812.html

I could fix some bugs already. See this issue on Bugzilla:

    * https://bugs.gnucash.org/show_bug.cgi?id=797587

Another issue is still open:

    * https://bugs.gnucash.org/show_bug.cgi?id=797744

Unfortunatelly I was not able to fully understand the implementation, 
see my mails from july in the mail thread mentioned above. Moreover it 
seemed, that nobody else could explain the current implementation anymore.

Therefore I stopped trying to further understand the algorithm and 
prepared a new and improved approach. I did a little research and tried 
different approaches. The final approach, I want to propose now, is a 
probalistic (bayesian) approach as well and should solve the following 
two problems of the current algorithm:

    * it should be possible to understand and explain the matching results

    * the account matching probabilities should be "real" probabilities 
and not only result to 1 or 0 (see my last mail from october in the mail 
thread mentioned above)

This new algorithm is even simpler, but still as effective as the 
current algorithm. And the current algorithm can be easily replaced 
using the existing import mapping table (token frequency table). It is 
provided as PR #839 on Github:

    * https://github.com/Gnucash/gnucash/pull/839

I tested the algorithm with the transactions from my personal account. I 
wrote a simple test application, which simulates the import of all my 
transactions one by one in chronological order. For each transaction 
this application compares the matched account calculated by the 
algorithm with the account I manually assigned the transaction to. I 
used this application to compare the new proposed algorithm with the 
current algorithm.

At first all transactions, which did match correctly before still match 
correctly using the new algorithm. Additionally several transactions, 
which did not match correcly before, match correctly now. The ratio of 
correct matches could be improved from approx. 60 % to 70 %.

But the most important improvement is, that false matching results can 
be explained now.

And this is a first draft of the new approach only. There is still space 
for further improvement. The algorithm has a new feature. You can select 
or exclude individual tokens from being considered for calculation of 
account matching probability. I already proposed this idea on Bugzilla

    * https://bugs.gnucash.org/show_bug.cgi?id=797779

>From my point of view this is the key for further improvement. The 
current as well as the new proposed algorithm can solve all the unique 
cases, i.e. transactions with have at least one token, which is unique 
to exactly one account (see also my last mail from october in the mail 
thread mentioned above). The more complicated cases are transactions, 
which can not be matched uniquely to one account, since similar 
transactions have been assigned to different accounts in the past by the 
user.

I played around with these transactions and could already achieve 
promising results by reducing the 'token_account_probability_threshold' 
used for automatic token selection. But this raises new problems, since 
there are some tokens, which can distract the matching result from the 
correct account. Furthermore it wasn't that easy anymore to find a 
meaningful threshold to detect "new" transactions, which haven't occured 
before and shouldn't be tried to match.

Therefore I decided for a very conservative threshold in the first draft 
and left this open for future improvements. I prefered a robust solution 
at the moment.

Ok, now you can have a look and try out the new approach on your own. 
And I'm curious about your feedback.

Christian