[GNC-dev] avoid the brain dead import

Wed Aug 29 20:46:10 EDT 2018

No, the matcher doesn’t troll through the books to construct matching data though that’s an interesting idea. It works purely off of matches you make during imports.

There’s a difference between the 2.6 and 3 match tables: The former uses the full account name (e.g. Expenses:Auto:Gas) and the latter uses the GUID like the description matcher always has. The first time you run the matcher in GnuCash 3 it updates the table, converting the account names to GUIDs. If you’ve got a large match table it can take a while to do the conversion.

Note that each asset account has its own match table so if you do imports for more than one account you’ll have to endure the conversion for each of them.

If you think you have a lot of obsolete match data that’s slowing down matches you can tidy them up in GnuCash 3. Expect it to be a long and tedious process, though, unless you decide to nuke them all and retrain the matcher from scratch. After all if there weren’t a *lot* of records it wouldn’t be slow.

None of which is to suggest that the matcher is a particularly efficient design, it’s not. There’s a lot of room for improvement if someone has an itch they want to scratch.

Regards,
John Ralls

> On Aug 29, 2018, at 3:52 PM, David Cousens <davidcousens at bigpond.com> wrote:
> 
> William,
> 
> I have experienced the importer trying to match data out of the range
> of dates in the current import. It only occurred from memory when I
> first changed over to version 3.0. The matcher appeared to have lost
> all memory of what accounts to assign in the changeover form 2.6.
> However I found after importing 1-2 months data it was functioning
> normally again. I have been using the OFX importer for 3-4 years with
> OFX without any significant problems.
> 
> Your point about large data files sounds valid. I havent looked at the
> code for the match picker so I don't know how it works or whether it
> works on the historical data to extract the information it needs to
> make a choice of an accounts to assign or data to match. 
> 
> As it is a Bayesian mechanism at some point it has to examine the
> existing data and construct some sort of probability table, so my guess
> would be that this could be a step which is taking so long. Being able
> to set a preference for a date range or period to use in constructing
> the initial probability tables is probably a good idea if this is the
> case. 
> 
> My experience on the changeover from 2.6 to 3.0 when it appeared to
> have lost any memory of previous import assignments indicated that the
> importer was constructing those tables from the data it imports and not
> from the historical data, but I could be wrong.  I would expect it to
> be using a Kalman filtering approach on the input data but can't be
> sure until I get a good look at the code. It did attempt to match
> transactions that were otherwise similar to transactions in the
> previous month or two initially. I only have data going back~8 years
> and have been retired for a large percentage of that so my files aren't
> huge so I may not be hitting your problem if it is the case that it
> does look further back.
> 
> I think the decision about whether to import a small number of
> transactions by hand is really one for the user and not the importer to
> make. I would import small batches, maybe 20-30  to test the importer
> function and ensure it was working as expected before attempting to
> import 10k.
> 
> 
> On Wed, 2018-08-29 at 22:00 +0100, Wm via gnucash-devel wrote:
>> On 25/08/2018 07:22, David Cousens wrote:
>> 
>> i thank David for his posting which i have read, I don't address all
>> he said
>> 
>>> Keep trying. Tthe brain dead importer does get less brain dead with
>>> repeated
>>> use.
>> 
>> i'm not sure it does get better as implemented because 2 of the bits
>> of 
>> brain dead-ity are
>> 
>> 1. the universe against which the importer is comparing imported tx
>> is 
>> going to be growing so as a strategy it is doO0MED to sluggishness
>> and 
>> eventually not being used unless there is some limit to the universe 
>> (week / month / quarter / year / decade)
>> 
>> 2. unless there is something better users are going to try and use
>> it 
>> and become more frustrated and stop using it.
>> 
>> ====
>> 
>> fairly easy to think about ways of fixing 1. like "do you want the 
>> importer to really, really, really compare the imported tx against
>> your 
>> stuff from the 1980's ?  y/N"  at the moment this is defaulting to Y 
>> without asking and I don't think that makes sense.
>> 
>> I mean, think of inflation?  Why would one of anything in 2018 be 
>> sensibly matched against the same thing 30 years ago?
>> 
>> There isn't even the opportunity to time limit the universe and some 
>> folk have stuff going back much longer than me and have many more tx 
>> than me.
>> 
>> fixing 2. just involves some thought about the user, almost no 
>> programming.  Redundant questions for the user would be, "you are 
>> importing 3 tx, you have 10K tx in your file, this could take
>> fucking 
>> hours, do you want to continue or just type them in by hand?  if you 
>> want my advice by hand is quicker"
>> 
>> See?  the importer has no idea of scale, 3 tx incoming ?  I'll do it
>> by 
>> hand.
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> gnucash-devel mailing list
>> gnucash-devel at gnucash.org
>> https://lists.gnucash.org/mailman/listinfo/gnucash-devel
> _______________________________________________
> gnucash-devel mailing list
> gnucash-devel at gnucash.org
> https://lists.gnucash.org/mailman/listinfo/gnucash-devel