[GNC-dev] Understanding the bayesian import matching algorithm

Sun Jul 5 19:05:06 EDT 2020

Am 02.07.20 um 21:28 schrieb Derek Atkins:
> Hi,
>
> On Thu, July 2, 2020 3:10 pm, Christian Gruber wrote:
>> Hi,
>>
>> while further studying the bayesian import matching algorithm I'm now at
>> the point, where I wanted to understand, how the bayes formula is
>> applied to the problem of matching transactions to accounts using
>> tokens. But I need further information, since it doesn't come clear to
>> me what is really calculated there.
>>
>> The implementation can be found in the following functions in Account.cpp:
>>
>>    * get_first_pass_probabilities()
>>    * build_probabilities()
>>    * highest_probability()
>>
>> Actually, the latter could be omitted as it only selects the account
>> with the highest matching probability.
>>
>> Studying the code and the rare comments on the implementation it seems
>> to be a variant of the naive bayes classifier
>> <https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Probabilistic_model>
>> with the tokens used as (independent) "features" and the accounts used
>> as "classes". But comparing this algorithm to the code leaves several
>> questions open.
>>
>> Does anybody know a more precise algorithm description, on which the
>> implementation in GnuCash is based on?
> I'm not sure how detailed you need right now; I helped with some of the
> initial implementations but I'm sure it's all been rewritten by now.  The

No, I don't think so. The basic algorithm seems to be unchanged, since 
it was implemented the first time by you with commit b2ccbf6 in 2003. 
Yes, the code was moved between files and files have been renamed. Also 
the code was restructured in commit b3667c76f "Implement flat bayes kvp" 
and there were lots of smaller code changes over the time. But the 
algorithm still seems to be the same.

> idea is that the description/memo strings are tokenized and used as inputs
> into the probabilities that the transaction would go into the target
> account.  If you have a high-enough probability it will auto-select that
> account for that transaction.
>
> When you assign an account (during import), it adds those tokens to the
> account's list of tokens for future guessing.
The main principle is clear to me.
>
> Did you have a specific question about the process?  For the complete
> algorithm you can look at the code.  It's really not all that complicated
> (or at least it wasn't when first implemented).

Ok, I try to be more precise.

I. The implemented algorithm I read from the code is the following:

                                 product_k
P(A_k | T_1, ..., T_N) = --------------------------
                          product_k + product_diff_k
with

product_k      =      P(A_k | T_1)  * ... *      P(A_k | T_N)
product_diff_k = [1 - P(A_k | T_1)] * ... * [1 - P(A_k | T_N)]

and with

P(A_k | T_n)           ... probability of account A_k under the
                            condition of token T_n
P(A_k | T_1, ..., T_N) ... probability of account A_k under the
                            condition of a list of tokens T_1, ..., T_N
N                      ... number of tokens

II. By applying the formulas of the Bayes classifier (see e.g. Wikipedia) I get the following:

                                  product_k
P(A_k | T_1, ..., T_N) = ---------------------------
                          product_1 + ... + product_K
with

product_k = P(T_1 | A_k) * ... * P(T_N | A_k)

and with

P(T_n | A_k) ... probability of token T_n under the condition
                  of account A_k
K            ... number of accounts
N            ... number of tokens

assuming equal total probability P(A_k) for all accounts.

III. Simplifying these formulas by only distinguishing between account A_k and all other accounts leads to the following formulas:

                                 product_k
P(A_k | T_1, ..., T_N) = -------------------------
                          product_k + product_not_k
with

product_k     = P(T_1 |  A_k) * ... * P(T_N |  A_k)
product_not_k = P(T_1 | ¬A_k) * ... * P(T_N | ¬A_k)

and with

P(T_n |  A_k) ... probability of token T_n under the condition
                   of account A_k
P(T_n | ¬A_k) ... probability of Token T_n under the condition
                   of any other account than A_k
K             ... number of accounts
N             ... number of tokens

assuming equal total probability for P(A_k) and P(¬A_k).

As you can see neither the formulas in derivation II nor the formulas in derivation III fit to formulas in derivation I.

One obvious difference is, that the implemented code uses probabilities P(A_k | T_n), whereas the formulas derived from bayes classification use P(T_n | A_k).

>
>> Regards,
>> Christian
> -derek