Import transactions change proposal

Sun Feb 9 13:23:29 CST 2003

Derek Atkins <warlord at MIT.EDU> writes:

> The problem is that data import is "lossy", you don't necessarily have
> all the import information in the GNC Transaction.  For example, you
> lose the QIF Category name, but you DEFINITELY want to be able to map
> from QIF Category to GNC Account.  

Well, my first reaction is that importing shouldn't be lossy. Having lossy
steps closes options such as being able to show the user where a transaction
came from and what the information the bank presented was. I could see that
being useful in a dispute with a bank or debugging problems after the fact.

> In order to just load txns and build the map at runtime you'd need to be
> able to store all this information. You'd also lose badly when you try to go
> across Accounting Periods.

That sounds like a case of denormalizing the underlying data representation in
order to implement a presentation level feature. I would have expected
accounting periods to simply mark date boundaries or mark individual
transactions as unmodifiable. I wouldn't have expected to actually move the
transactions around and make accessing them require special actions.

> Yea, I've read the email spam-matching schemes and you're absolutely
> right that our dest-account matching would be different..  Actually I
> think it would be VERY different.  With spam you only need a binary
> (or perhaps tri-state) answer to the question, "is this spam?".  The
> answers are yes, no, or maybe.

Well not really. Spam-filtering is a degenerate case of a more general
algorithm. "ifilter" for example is a mail filtering program that sorts mail
into multiple folders automatically using the same algorithm.

The difference I was pointing out is that while those implementations are
optimized for having lots of fields and lots of data gnucash has a lot less
data to work with. As a result those implementations use indexed databases but
I'm thinking Gnucash will just iterate through a fixed number of transactions
of history and apply the matching heurstics to every entry.

I'm hoping it will be just as effective because the data is much more
structured. e-mail is free-form text, the transaction information is at least
trying to identify itself.

> Choosing a destination account is much more tricky -- you've got
> potentially hundreds of choices to match into.  If you have ideas for
> a decent matching algorithm I'd love to hear it.  Code would be
> better, but we should work on designs before coding, IMHO.

I had a plan for a matching heuristic, but I think the bayesian filter is a
better idea. Any hard coded heuristic will work well for some people but fail
completely for others. A bayesian filter should adapt to various systems with
different data formats much better.

-- 
greg