[GNC-dev] Understanding the bayesian import matching algorithm

Christian Gruber christian.gruber at posteo.de
Sun Oct 11 18:39:34 EDT 2020

Hi Derek,

I further investigated the implemented algorithm during the last months
using the transactions from my personal account. I wrote a simple
application which simulates the import of all my transactions one by one
in chronological order. For each transaction this application compares
the matched account calculated by the implemented bayes algorithm with
the account I manually assigned the transaction to. In case of a
mismatch the application provides several debug information like the
calculated probabilities for each account and the frequencies of each
token for instance to see what's going wrong.

What I found out was very surprising to me. Here are my results:

* the calculated probabilities for each account P(A_k | T_1, ..., T_N) are
always whether exactly 1 or almost 0, i.e. the matched account is actually a
binary decision
* sometimes there is more than one account with probability 1 and the finally
selected matching account (function highest_probability()) can be the
correct one or not, it's random
* the condition for a probability of 1 is, that there exists a token, which is
unique for a single account, i.e. all transactions containing this token have
always been assigned to the same account
* if there are several tokens, which are unique for different accounts, all
these accounts have probability 1
* if there is no token, which is unique to one account, the probability is
almost 0 for all accounts

That means, as soon as there is no unique token, which can be used for
classification, the implemented algorithm fails. Moreover if the
relevant tokens are not unique, but there are unique tokens, which are
actually irrelevant, these irrelevant tokens are used for
classification. This was the most common observed case for wrong account
selection, since the irrelevant tokens where mostly not assigned to the
correct account. In many cases these irrelevant tokens had frequencies
of 1, i.e. there was only one transaction before with randomly the same
token.

In addition this would mean, that the bayes classifier is actually
needless, since the same classification results can be achieved with a
much simpler rule-based decision algorithm and without any probalistic
approach.

These observations encourage my suspicion, that something is wrong with
the implemented algorithm.

Christian

Am 06.07.20 um 01:05 schrieb Christian Gruber:
>
> Am 02.07.20 um 21:28 schrieb Derek Atkins:
>> Hi,
>>
>> On Thu, July 2, 2020 3:10 pm, Christian Gruber wrote:
>>> Hi,
>>>
>>> while further studying the bayesian import matching algorithm I'm
>>> now at
>>> the point, where I wanted to understand, how the bayes formula is
>>> applied to the problem of matching transactions to accounts using
>>> tokens. But I need further information, since it doesn't come clear to
>>> me what is really calculated there.
>>>
>>> The implementation can be found in the following functions in
>>> Account.cpp:
>>>
>>>    * get_first_pass_probabilities()
>>>    * build_probabilities()
>>>    * highest_probability()
>>>
>>> Actually, the latter could be omitted as it only selects the account
>>> with the highest matching probability.
>>>
>>> Studying the code and the rare comments on the implementation it seems
>>> to be a variant of the naive bayes classifier
>>> <https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Probabilistic_model>
>>>
>>> with the tokens used as (independent) "features" and the accounts used
>>> as "classes". But comparing this algorithm to the code leaves several
>>> questions open.
>>>
>>> Does anybody know a more precise algorithm description, on which the
>>> implementation in GnuCash is based on?
>> I'm not sure how detailed you need right now; I helped with some of the
>> initial implementations but I'm sure it's all been rewritten by now.
>> The
>
> No, I don't think so. The basic algorithm seems to be unchanged, since
> it was implemented the first time by you with commit b2ccbf6 in 2003.
> Yes, the code was moved between files and files have been renamed.
> Also the code was restructured in commit b3667c76f "Implement flat
> bayes kvp" and there were lots of smaller code changes over the time.
> But the algorithm still seems to be the same.
>
>> idea is that the description/memo strings are tokenized and used as
>> inputs
>> into the probabilities that the transaction would go into the target
>> account.  If you have a high-enough probability it will auto-select that
>> account for that transaction.
>>
>> When you assign an account (during import), it adds those tokens to the
>> account's list of tokens for future guessing.
> The main principle is clear to me.
>>
>> Did you have a specific question about the process?  For the complete
>> algorithm you can look at the code.  It's really not all that
>> complicated
>> (or at least it wasn't when first implemented).
>
> Ok, I try to be more precise.
>
> I. The implemented algorithm I read from the code is the following:
>
>                                 product_k
> P(A_k | T_1, ..., T_N) = --------------------------
>                          product_k + product_diff_k
> with
>
> product_k      =      P(A_k | T_1)  * ... *      P(A_k | T_N)
> product_diff_k = [1 - P(A_k | T_1)] * ... * [1 - P(A_k | T_N)]
>
> and with
>
> P(A_k | T_n)           ... probability of account A_k under the
>                            condition of token T_n
> P(A_k | T_1, ..., T_N) ... probability of account A_k under the
>                            condition of a list of tokens T_1, ..., T_N
> N                      ... number of tokens
>
>
> II. By applying the formulas of the Bayes classifier (see e.g.
> Wikipedia) I get the following:
>
>                                  product_k
> P(A_k | T_1, ..., T_N) = ---------------------------
>                          product_1 + ... + product_K
> with
>
> product_k = P(T_1 | A_k) * ... * P(T_N | A_k)
>
> and with
>
> P(T_n | A_k) ... probability of token T_n under the condition
>                  of account A_k
> K            ... number of accounts
> N            ... number of tokens
>
> assuming equal total probability P(A_k) for all accounts.
>
>
> III. Simplifying these formulas by only distinguishing between account
> A_k and all other accounts leads to the following formulas:
>
>                                 product_k
> P(A_k | T_1, ..., T_N) = -------------------------
>                          product_k + product_not_k
> with
>
> product_k     = P(T_1 |  A_k) * ... * P(T_N |  A_k)
> product_not_k = P(T_1 | ¬A_k) * ... * P(T_N | ¬A_k)
>
> and with
>
> P(T_n |  A_k) ... probability of token T_n under the condition
>                   of account A_k
> P(T_n | ¬A_k) ... probability of Token T_n under the condition
>                   of any other account than A_k
> K             ... number of accounts
> N             ... number of tokens
>
> assuming equal total probability for P(A_k) and P(¬A_k).
>
>
> As you can see neither the formulas in derivation II nor the formulas
> in derivation III fit to formulas in derivation I.
>
> One obvious difference is, that the implemented code uses
> probabilities P(A_k | T_n), whereas the formulas derived from bayes
> classification use P(T_n | A_k).
>
>>
>>> Regards,
>>> Christian
>> -derek