[GNC-dev] Understanding the bayesian import matching algorithm
derek at ihtfp.com
Thu Jul 2 15:28:26 EDT 2020
On Thu, July 2, 2020 3:10 pm, Christian Gruber wrote:
> while further studying the bayesian import matching algorithm I'm now at
> the point, where I wanted to understand, how the bayes formula is
> applied to the problem of matching transactions to accounts using
> tokens. But I need further information, since it doesn't come clear to
> me what is really calculated there.
> The implementation can be found in the following functions in Account.cpp:
> * get_first_pass_probabilities()
> * build_probabilities()
> * highest_probability()
> Actually, the latter could be omitted as it only selects the account
> with the highest matching probability.
> Studying the code and the rare comments on the implementation it seems
> to be a variant of the naive bayes classifier
> with the tokens used as (independent) "features" and the accounts used
> as "classes". But comparing this algorithm to the code leaves several
> questions open.
> Does anybody know a more precise algorithm description, on which the
> implementation in GnuCash is based on?
I'm not sure how detailed you need right now; I helped with some of the
initial implementations but I'm sure it's all been rewritten by now. The
idea is that the description/memo strings are tokenized and used as inputs
into the probabilities that the transaction would go into the target
account. If you have a high-enough probability it will auto-select that
account for that transaction.
When you assign an account (during import), it adds those tokens to the
account's list of tokens for future guessing.
Did you have a specific question about the process? For the complete
algorithm you can look at the code. It's really not all that complicated
(or at least it wasn't when first implemented).
Derek Atkins 617-623-3745
derek at ihtfp.com www.ihtfp.com
Computer and Internet Security Consultant
More information about the gnucash-devel