Import transactions change proposal

Christopher Browne gnucash at cbbrowne.com
Tue Feb 11 22:59:35 CST 2003


> On Monday 10 February 2003 05:59 pm, Christopher Browne wrote:
> > >> I had a plan for a matching heuristic, but I think the bayesian
> > >> filter is a better idea. Any hard coded heuristic will work well for
> > >> some people but fail completely for others. A bayesian filter should
> > >> adapt to various systems with different data formats much better.
> > >
> > > I haven't looked at ifilter, so I don't know what kind of data-store
> > > it requires.  I still want to maintain a data store rather than trying
> > > to build it at runtime.  I just have no clue about what would need to
> > > be stored in such a database.
> >
> > At first thought, ifilter sounded insane, but at second thought, it may
> > not be /too/ horribly bad.
> >
> > What it does is to store statistics on the incidences of particular
> > words in association with each "mail folder."
> >
> > So, for email, we'd have a list of folders:
> >
> > - Spam = 0, GnuCash = 1, PostgreSQL = 2, and Family = 3
> >
> > Then there are words that occur in email.  We collect stats on words, in
> > association with the folders.
> >
> > Hot   --> Spam:100  GnuCash:0  PostgreSQL:5  Family:10
> > Sex   --> Spam:250  GnuCash:0  PostgreSQL:0  Family:1
> > David --> Spam:0  GnuCash:0  PostgreSQL:0  Family:45
> > Carla --> Spam:0  GnuCash:0  PostgreSQL:0  Family:30
> > QIF   --> Spam:1  GnuCash:40 PostgreSQL:0  Family:0
> > SQL   --> Spam:2  GnuCash:8 PostgreSQL:48  Family:0
> >
> > There would presumably be a whole lot more words than that.
> >
> > We then look at a new message, and look at what words it has in it.  For
> > each word, compare the stats with each of the message folders.
> >
> > One that has "David" and "Carla" in it will likely show a strong
> > association with "Family", and none with any of the other folders.
> >
> > One that has the words "Hot" and "Sex" will likely have weak correlation
> > with PostgreSQL and Family, and show strong association with Spam.
> >
> > There's a log-based weighting done so that /all/ the words get taken
> > jointly into consideration based on their various weightings.
> >
> > On the one hand, it seems an attractive idea to try to do something like
> > this with GnuCash; this should provide a nice way of adaptively having
> > incoming transactions categorize themselves to the "nearest" existing
> > transaction.
> >
> 
> Yep, this is the plan based upon Mr.Graham's webpage on spam filtering.

No, it is most certainly not.

It is based on /my/ web page on spam filtering, which was in place several 
years before Mr. Graham ever considered the idea.  I presented a talk on this 
back in the mid-90s, and the code I'm using has been pretty solid since about 
1997.

> > Three challenges/problems leap to mind:
> >
> > 1.  Ifilter performance gets increasingly /greatly/ ugly as the number
> > of categories grows.  Doing this totally automagically means that each
> > transaction is a "category," and if there are thousands of transactions,
> > that's not terribly nice.
> >
> > 2.  Related to 1, what if the transaction "looks nearly like" a dozen
> > transactions?  Which do you choose?

> The match isn't based on a single transaction but rather the sum of the 
> transactions in any given destination account.  A patch should be forthcoming
> in a few days that implements the whole algorithm.

Including the logarithmic normalization associated with Bayesian Filtering?  
(Graham's method oversimplifies it...)

> > 3.  What if a bunch of transactions are already very similar?  For
> > instance, my monthly rent goes in to the same payee for the same amount
> > on the same day of each month.  These transactions are 'essentially the
> > same' for our purposes, and really should get collected together into
> > one "category."  It would sure be nice to collect them together; that
> > cuts down on the number of categories, and means that rather than there
> > being a bunch of "nearly similar" categories, one for each transaction,
> > that there's just one category.

> There is nothing that can be done with these transactions as they contain 
> little or no information regarding what they are for.  Unfortunately they 
> must be sorted with what little information they provide.

Actually, since the point is to figure out the destination category, the fact 
that the similar transactions /are/ in the same category means that they will 
all strengthen the association.

> > But how do we provide a "user interface" that allows them to be so
> > grouped???
> >
> > Solve #3 and the other two fall into insignificance.
> >
> > Jumping on to #5 (there is NO #4!)...
> >
> > 5.  Of course, give me the ability to "memorize" a transaction and have
> > it repeat each month and I may not even /want/ to import such
> > transactions anymore...

> What does this have to do with importing transactions? ;-)

Someone was complaining today at the office about the fact that their bank 
only supplied OFX files, and that they weren't sure what would load it.  I 
suggested that scheduled transactions would be pretty helpful in handling this 
even if the GnuCash OFX support /didn't/ exist...  There's certainly more than 
one way to skin a cat...
--
(reverse (concatenate 'string "gro.gultn@" "enworbbc"))
http://www.ntlug.org/~cbbrowne/ifilter.html
Whatever  you  do don't  mail  me at  pink-and-wobbly at asdkjlwelkj.com,
because then I'll know you're just an address-harvester, and blacklist
your IP until the end of time




More information about the gnucash-devel mailing list