Import transactions change proposal

Sun Feb 9 12:54:28 CST 2003

Greg Stark <gsstark at mit.edu> writes:

> Derek Atkins <warlord at MIT.EDU> writes:
> 
> > Chris and I were talking on #gnucash about potentially expanding the
> > match mapper to use some sort of Bayesian filtering to determine the
> > destination account mapping.  However I'm not sure how such a system
> > would work -- or where the necessary databases would get stored (or
> > even what the databases would look like).
> 
> That would be what I proposed a few weeks ago. I've been thinking about it
> further, I'm not sure the database would have to be stored anywhere. When the
> import begins you scan the last 1,000 or so transactions on the account you're
> importing to; load them into an in-memory database and use that.

The problem is that data import is "lossy", you don't necessarily have
all the import information in the GNC Transaction.  For example, you
lose the QIF Category name, but you DEFINITELY want to be able to map
from QIF Category to GNC Account.  In order to just load txns and
build the map at runtime you'd need to be able to store all this
information.  You'd also lose badly when you try to go across
Accounting Periods.

> > Adding in other information to the bayesian mix would certainly be
> > possible, once we come up with an architecture.  But you really don't
> > have a lot of information to work with when trying to choose a
> > destination account.
> 
> My thinking was to use the levenshtein distance (same idea as agrep) for the
> text fields, the difference between the amounts in percentage, the day of
> month, day of week etc.

Sure -- this makes total sense, provided there is some way to build a
reasonable database to match against (see above).

> The algorithm would be a bit different from e-mail spam matching though.
> Instead of pulling out hundreds of attributes from an e-mail message and using
> an index to find the weights quickly, gnucash would have only a half dozen or
> so attributes but would have to scan the database completely to find
> approximate matches.

Yea, I've read the email spam-matching schemes and you're absolutely
right that our dest-account matching would be different..  Actually I
think it would be VERY different.  With spam you only need a binary
(or perhaps tri-state) answer to the question, "is this spam?".  The
answers are yes, no, or maybe.

Choosing a destination account is much more tricky -- you've got
potentially hundreds of choices to match into.  If you have ideas for
a decent matching algorithm I'd love to hear it.  Code would be
better, but we should work on designs before coding, IMHO.

> greg

-derek

-- 
       Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
       Member, MIT Student Information Processing Board  (SIPB)
       URL: http://web.mit.edu/warlord/    PP-ASEL-IA     N1NWH
       warlord at MIT.EDU                        PGP key available