[GNC-dev] Understanding the bayesian import matching algorithm
Derek Atkins
derek at ihtfp.com
Thu Jul 2 15:28:26 EDT 2020
Hi,
On Thu, July 2, 2020 3:10 pm, Christian Gruber wrote:
> Hi,
>
> while further studying the bayesian import matching algorithm I'm now at
> the point, where I wanted to understand, how the bayes formula is
> applied to the problem of matching transactions to accounts using
> tokens. But I need further information, since it doesn't come clear to
> me what is really calculated there.
>
> The implementation can be found in the following functions in Account.cpp:
>
> * get_first_pass_probabilities()
> * build_probabilities()
> * highest_probability()
>
> Actually, the latter could be omitted as it only selects the account
> with the highest matching probability.
>
> Studying the code and the rare comments on the implementation it seems
> to be a variant of the naive bayes classifier
> <https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Probabilistic_model>
> with the tokens used as (independent) "features" and the accounts used
> as "classes". But comparing this algorithm to the code leaves several
> questions open.
>
> Does anybody know a more precise algorithm description, on which the
> implementation in GnuCash is based on?
I'm not sure how detailed you need right now; I helped with some of the
initial implementations but I'm sure it's all been rewritten by now. The
idea is that the description/memo strings are tokenized and used as inputs
into the probabilities that the transaction would go into the target
account. If you have a high-enough probability it will auto-select that
account for that transaction.
When you assign an account (during import), it adds those tokens to the
account's list of tokens for future guessing.
Did you have a specific question about the process? For the complete
algorithm you can look at the code. It's really not all that complicated
(or at least it wasn't when first implemented).
> Regards,
> Christian
-derek
--
Derek Atkins 617-623-3745
derek at ihtfp.com www.ihtfp.com
Computer and Internet Security Consultant
More information about the gnucash-devel
mailing list