[GNC-dev] Understanding the bayesian import matching algorithm

Derek Atkins derek at ihtfp.com
Thu Jul 2 15:28:26 EDT 2020


Hi,

On Thu, July 2, 2020 3:10 pm, Christian Gruber wrote:
> Hi,
>
> while further studying the bayesian import matching algorithm I'm now at
> the point, where I wanted to understand, how the bayes formula is
> applied to the problem of matching transactions to accounts using
> tokens. But I need further information, since it doesn't come clear to
> me what is really calculated there.
>
> The implementation can be found in the following functions in Account.cpp:
>
>   * get_first_pass_probabilities()
>   * build_probabilities()
>   * highest_probability()
>
> Actually, the latter could be omitted as it only selects the account
> with the highest matching probability.
>
> Studying the code and the rare comments on the implementation it seems
> to be a variant of the naive bayes classifier
> <https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Probabilistic_model>
> with the tokens used as (independent) "features" and the accounts used
> as "classes". But comparing this algorithm to the code leaves several
> questions open.
>
> Does anybody know a more precise algorithm description, on which the
> implementation in GnuCash is based on?

I'm not sure how detailed you need right now; I helped with some of the
initial implementations but I'm sure it's all been rewritten by now.  The
idea is that the description/memo strings are tokenized and used as inputs
into the probabilities that the transaction would go into the target
account.  If you have a high-enough probability it will auto-select that
account for that transaction.

When you assign an account (during import), it adds those tokens to the
account's list of tokens for future guessing.

Did you have a specific question about the process?  For the complete
algorithm you can look at the code.  It's really not all that complicated
(or at least it wasn't when first implemented).

> Regards,
> Christian

-derek
-- 
       Derek Atkins                 617-623-3745
       derek at ihtfp.com             www.ihtfp.com
       Computer and Internet Security Consultant



More information about the gnucash-devel mailing list