Import transactions change proposal

Christopher Browne cbbrowne at
Mon Feb 10 17:59:02 CST 2003

>> I had a plan for a matching heuristic, but I think the bayesian
>> filter is a better idea. Any hard coded heuristic will work well for
>> some people but fail completely for others. A bayesian filter should
>> adapt to various systems with different data formats much better.

> I haven't looked at ifilter, so I don't know what kind of data-store
> it requires.  I still want to maintain a data store rather than trying
> to build it at runtime.  I just have no clue about what would need to
> be stored in such a database.

At first thought, ifilter sounded insane, but at second thought, it may
not be /too/ horribly bad.

What it does is to store statistics on the incidences of particular
words in association with each "mail folder."

So, for email, we'd have a list of folders:

- Spam = 0, GnuCash = 1, PostgreSQL = 2, and Family = 3

Then there are words that occur in email.  We collect stats on words, in
association with the folders.

Hot   --> Spam:100  GnuCash:0  PostgreSQL:5  Family:10
Sex   --> Spam:250  GnuCash:0  PostgreSQL:0  Family:1
David --> Spam:0  GnuCash:0  PostgreSQL:0  Family:45
Carla --> Spam:0  GnuCash:0  PostgreSQL:0  Family:30
QIF   --> Spam:1  GnuCash:40 PostgreSQL:0  Family:0
SQL   --> Spam:2  GnuCash:8 PostgreSQL:48  Family:0

There would presumably be a whole lot more words than that.

We then look at a new message, and look at what words it has in it.  For
each word, compare the stats with each of the message folders.

One that has "David" and "Carla" in it will likely show a strong association
with "Family", and none with any of the other folders.

One that has the words "Hot" and "Sex" will likely have weak correlation
with PostgreSQL and Family, and show strong association with Spam.

There's a log-based weighting done so that /all/ the words get taken
jointly into consideration based on their various weightings.

On the one hand, it seems an attractive idea to try to do something like
this with GnuCash; this should provide a nice way of adaptively having
incoming transactions categorize themselves to the "nearest" existing

Three challenges/problems leap to mind:

1.  Ifilter performance gets increasingly /greatly/ ugly as the number
of categories grows.  Doing this totally automagically means that each
transaction is a "category," and if there are thousands of transactions,
that's not terribly nice.

2.  Related to 1, what if the transaction "looks nearly like" a dozen
transactions?  Which do you choose?

3.  What if a bunch of transactions are already very similar?  For
instance, my monthly rent goes in to the same payee for the same amount
on the same day of each month.  These transactions are 'essentially the
same' for our purposes, and really should get collected together into
one "category."  It would sure be nice to collect them together; that
cuts down on the number of categories, and means that rather than there
being a bunch of "nearly similar" categories, one for each transaction,
that there's just one category.

But how do we provide a "user interface" that allows them to be so

Solve #3 and the other two fall into insignificance.  

Jumping on to #5 (there is NO #4!)...

5.  Of course, give me the ability to "memorize" a transaction and have
it repeat each month and I may not even /want/ to import such
transactions anymore...

Suppose I'm generating monthly rent transactions as scheduled
transactions, and doing the same with various other transactions,
loading data from the bank might become totally redundant...
(reverse (concatenate 'string "ac.notelrac.teneerf@" "454aa"))
Frisbeetarianism: The belief that when  you die, your  soul goes up on
the roof and gets stuck...

More information about the gnucash-devel mailing list