[GNC] importing multiple CSVs to Gnucash

Sat Jun 6 04:57:48 EDT 2020

Op vrijdag 5 juni 2020 12:57:44 CEST schreef Gio Bacareza:
> Thank you Adrien, David and @flywire for your very helpful responses.
> 
> I actually created a script that would transform different file formats and
> data structures and through a machine learning classifier autopopulate the
> "Transfer Account '' depending on the description.
> 
> My script would scour through the different formats since banks generally
> are not standardized and transform these different files into a standard
> simple CSV with DATE, DESCRIPTION, ACCOUNT, DEPOSIT, WITHDRAWAL.
> 
> After transformation, I run it through my classifier and that will add the
> TRANSFER ACCOUNT field and autopopulate with the predicted transfer account
> that is an exact string match, eg if transfer account is "XXXYYYZZZ" in the
> csv, there is such an account in GNUCASH that the exact string equivalent
> "XXXYYYZZZ". I figured that if I explicitly include the specific Transfer
> Account per line in the csv, it should help GNUCASH match.
> 
> This could save me a lot of time since I don't have to transform each bank
> account statement or logs individually. I have lots of accounts. I can
> imagine I'm not alone in this case.
> 
> I understand from your comments that GNUCASH follows a Naive-bayes
> algorithm to predict the TRANSFER ACCOUNT given the description and you
> have to feed it transactions little by litle for it to learn. But as I have
> said, I have already pre-processed that data so GNUCASH does not have to do
> that. I explicitly feed it with the exact transfer account.
> 

Not necessarily. The process is slightly more complicated.

In the absence of transfer account data, gnucash can only fall back to a bayesian matching 
algorithm based on the other data available (description, date, amounts,...)

However if there is transfer account data in the csv, gnucash will use it and generate 
transactions based on it. It will still use bayesian matching to check if a similar transaction (to 
the exact transfer account) already exists or not, but that would be much more accurate if a 
transfer account is known.

The caveat is that initially gnucash won't know how to map transfer account data to actual 
accounts in gnucash. This is also something you have to train gnucash with. However this is not 
done via a bayesian based training. Instead it's a direct mapping between a string in your csv 
data and and account in gnucash. This process currently has to happen manually and is a step 
in the csv importer you have to go through. This unfortunately makes initial imports 
cumbersome, as in those you have to do a lot of this mapping. As over time the mapped data 
gets bigger it will happen much less often to have to add a manual account map, speeding up 
imports the more you do them.
Contrary to the bayesian training (where small batches of training material are better) the 
import file size doesn't matter at all for this account map creation. It's more important your 
preprocessing script is very consistent in its generation of transfer account names because then 
each transfer account has to be taught only once ever. Be it on a first huge import or on any 
future import that uses this account for the first time.

I will add I believe there is room for improvement here. It would certainly help if gnucash could 
be made smarter at recognizing exact matches between transfer account names or codes in the 
csv data and actual gnucash accounts. While gnucash can never assume automatically that 
constitutes a proper match it could suggest it as an account match and leave it to the user to 
confirm. That could save a lot of clicks. But someone would have to implement this.

Regards,

Geert