Import transactions change proposal

Tue Feb 11 15:37:56 CST 2003

On Monday 10 February 2003 05:59 pm, Christopher Browne wrote:
> >> I had a plan for a matching heuristic, but I think the bayesian
> >> filter is a better idea. Any hard coded heuristic will work well for
> >> some people but fail completely for others. A bayesian filter should
> >> adapt to various systems with different data formats much better.
> >
> > I haven't looked at ifilter, so I don't know what kind of data-store
> > it requires.  I still want to maintain a data store rather than trying
> > to build it at runtime.  I just have no clue about what would need to
> > be stored in such a database.
>
> At first thought, ifilter sounded insane, but at second thought, it may
> not be /too/ horribly bad.
>
> What it does is to store statistics on the incidences of particular
> words in association with each "mail folder."
>
> So, for email, we'd have a list of folders:
>
> - Spam = 0, GnuCash = 1, PostgreSQL = 2, and Family = 3
>
> Then there are words that occur in email.  We collect stats on words, in
> association with the folders.
>
> Hot   --> Spam:100  GnuCash:0  PostgreSQL:5  Family:10
> Sex   --> Spam:250  GnuCash:0  PostgreSQL:0  Family:1
> David --> Spam:0  GnuCash:0  PostgreSQL:0  Family:45
> Carla --> Spam:0  GnuCash:0  PostgreSQL:0  Family:30
> QIF   --> Spam:1  GnuCash:40 PostgreSQL:0  Family:0
> SQL   --> Spam:2  GnuCash:8 PostgreSQL:48  Family:0
>
> There would presumably be a whole lot more words than that.
>
> We then look at a new message, and look at what words it has in it.  For
> each word, compare the stats with each of the message folders.
>
> One that has "David" and "Carla" in it will likely show a strong
> association with "Family", and none with any of the other folders.
>
> One that has the words "Hot" and "Sex" will likely have weak correlation
> with PostgreSQL and Family, and show strong association with Spam.
>
> There's a log-based weighting done so that /all/ the words get taken
> jointly into consideration based on their various weightings.
>
> On the one hand, it seems an attractive idea to try to do something like
> this with GnuCash; this should provide a nice way of adaptively having
> incoming transactions categorize themselves to the "nearest" existing
> transaction.
>

Yep, this is the plan based upon Mr.Graham's webpage on spam filtering.

> Three challenges/problems leap to mind:
>
> 1.  Ifilter performance gets increasingly /greatly/ ugly as the number
> of categories grows.  Doing this totally automagically means that each
> transaction is a "category," and if there are thousands of transactions,
> that's not terribly nice.
>
> 2.  Related to 1, what if the transaction "looks nearly like" a dozen
> transactions?  Which do you choose?
>

The match isn't based on a single transaction but rather the sum of the 
transactions in any given destination account.  A patch should be forthcoming 
in a few days that implements the whole algorithm.

> 3.  What if a bunch of transactions are already very similar?  For
> instance, my monthly rent goes in to the same payee for the same amount
> on the same day of each month.  These transactions are 'essentially the
> same' for our purposes, and really should get collected together into
> one "category."  It would sure be nice to collect them together; that
> cuts down on the number of categories, and means that rather than there
> being a bunch of "nearly similar" categories, one for each transaction,
> that there's just one category.
>

There is nothing that can be done with these transactions as they contain 
little or no information regarding what they are for.  Unfortunately they 
must be sorted with what little information they provide.

> But how do we provide a "user interface" that allows them to be so
> grouped???
>
> Solve #3 and the other two fall into insignificance.
>
> Jumping on to #5 (there is NO #4!)...
>
> 5.  Of course, give me the ability to "memorize" a transaction and have
> it repeat each month and I may not even /want/ to import such
> transactions anymore...
>

What does this have to do with importing transactions? ;-)

Chris