Import transactions change proposal
gnucash at cbbrowne.com
Tue Feb 11 22:59:35 CST 2003
> On Monday 10 February 2003 05:59 pm, Christopher Browne wrote:
> > >> I had a plan for a matching heuristic, but I think the bayesian
> > >> filter is a better idea. Any hard coded heuristic will work well for
> > >> some people but fail completely for others. A bayesian filter should
> > >> adapt to various systems with different data formats much better.
> > >
> > > I haven't looked at ifilter, so I don't know what kind of data-store
> > > it requires. I still want to maintain a data store rather than trying
> > > to build it at runtime. I just have no clue about what would need to
> > > be stored in such a database.
> > At first thought, ifilter sounded insane, but at second thought, it may
> > not be /too/ horribly bad.
> > What it does is to store statistics on the incidences of particular
> > words in association with each "mail folder."
> > So, for email, we'd have a list of folders:
> > - Spam = 0, GnuCash = 1, PostgreSQL = 2, and Family = 3
> > Then there are words that occur in email. We collect stats on words, in
> > association with the folders.
> > Hot --> Spam:100 GnuCash:0 PostgreSQL:5 Family:10
> > Sex --> Spam:250 GnuCash:0 PostgreSQL:0 Family:1
> > David --> Spam:0 GnuCash:0 PostgreSQL:0 Family:45
> > Carla --> Spam:0 GnuCash:0 PostgreSQL:0 Family:30
> > QIF --> Spam:1 GnuCash:40 PostgreSQL:0 Family:0
> > SQL --> Spam:2 GnuCash:8 PostgreSQL:48 Family:0
> > There would presumably be a whole lot more words than that.
> > We then look at a new message, and look at what words it has in it. For
> > each word, compare the stats with each of the message folders.
> > One that has "David" and "Carla" in it will likely show a strong
> > association with "Family", and none with any of the other folders.
> > One that has the words "Hot" and "Sex" will likely have weak correlation
> > with PostgreSQL and Family, and show strong association with Spam.
> > There's a log-based weighting done so that /all/ the words get taken
> > jointly into consideration based on their various weightings.
> > On the one hand, it seems an attractive idea to try to do something like
> > this with GnuCash; this should provide a nice way of adaptively having
> > incoming transactions categorize themselves to the "nearest" existing
> > transaction.
> Yep, this is the plan based upon Mr.Graham's webpage on spam filtering.
No, it is most certainly not.
It is based on /my/ web page on spam filtering, which was in place several
years before Mr. Graham ever considered the idea. I presented a talk on this
back in the mid-90s, and the code I'm using has been pretty solid since about
> > Three challenges/problems leap to mind:
> > 1. Ifilter performance gets increasingly /greatly/ ugly as the number
> > of categories grows. Doing this totally automagically means that each
> > transaction is a "category," and if there are thousands of transactions,
> > that's not terribly nice.
> > 2. Related to 1, what if the transaction "looks nearly like" a dozen
> > transactions? Which do you choose?
> The match isn't based on a single transaction but rather the sum of the
> transactions in any given destination account. A patch should be forthcoming
> in a few days that implements the whole algorithm.
Including the logarithmic normalization associated with Bayesian Filtering?
(Graham's method oversimplifies it...)
> > 3. What if a bunch of transactions are already very similar? For
> > instance, my monthly rent goes in to the same payee for the same amount
> > on the same day of each month. These transactions are 'essentially the
> > same' for our purposes, and really should get collected together into
> > one "category." It would sure be nice to collect them together; that
> > cuts down on the number of categories, and means that rather than there
> > being a bunch of "nearly similar" categories, one for each transaction,
> > that there's just one category.
> There is nothing that can be done with these transactions as they contain
> little or no information regarding what they are for. Unfortunately they
> must be sorted with what little information they provide.
Actually, since the point is to figure out the destination category, the fact
that the similar transactions /are/ in the same category means that they will
all strengthen the association.
> > But how do we provide a "user interface" that allows them to be so
> > grouped???
> > Solve #3 and the other two fall into insignificance.
> > Jumping on to #5 (there is NO #4!)...
> > 5. Of course, give me the ability to "memorize" a transaction and have
> > it repeat each month and I may not even /want/ to import such
> > transactions anymore...
> What does this have to do with importing transactions? ;-)
Someone was complaining today at the office about the fact that their bank
only supplied OFX files, and that they weren't sure what would load it. I
suggested that scheduled transactions would be pretty helpful in handling this
even if the GnuCash OFX support /didn't/ exist... There's certainly more than
one way to skin a cat...
(reverse (concatenate 'string "gro.gultn@" "enworbbc"))
Whatever you do don't mail me at pink-and-wobbly at asdkjlwelkj.com,
because then I'll know you're just an address-harvester, and blacklist
your IP until the end of time
More information about the gnucash-devel