[GNC-dev] Is the import match map still required?

Sat May 30 08:37:36 EDT 2020

David,

thanks for your detailed explanations. Implementing a procedure, which 
could be run as needed and which updates the frequency table according 
to the current transactions for an account, seems to be a meaningful 
first step. This could be used to measure performance next. Then it 
could be decided, if this procedure can also run on the fly.

I also thought more about the user's part of interaction with the 
frequency table. The current situation seems to be a bit like "hacking" 
the frequency table to achieve better matching results. You can remove 
some entries, which seem to be wrong or seem to corrupt the matching 
results or whatever. If the user would not change the frequency table 
directly, but could instead set some personal preferences on how the 
data is used, this would solve the problem, that these preferences are 
not influenced by running the procedure updating the frequency table. 
And by regular updates of the frequency table, wrong or outdated entries 
are removed reliably and the data is up-to-date. The user could for 
example exclude some tokens from the bayesian algorithm, which are not 
relevant for him.

Christian

Am 25.05.20 um 01:13 schrieb David Cousens:
> Christian,
>
> I haven't experimented to know whether constructing the frequency table on
> the fly creates a performance bottleneck or not but am guessing the original
> developer thought it might. It would require a detailed look at the code
> involved but my suspicion would be that the performance penalty is likely to
> be significant.
>
> My comment about bloat is that at present data is only maintained for
> accounts you specifically import data into and if that data is stored. If it
> isn't then bloat doesn't apply obviously. Any sort of generalized procedure
> could allow selection of accounts for which Bayesian matching is required,
> i.e. those for which importing is used to input data. My initial thought was
> that you would run it for all accounts but it is really only necessary for
> the specific subset of accounts into which you import data. It would then
> require the ability to run the procedure on an account if it occurred in
> import data but didn't have existing account matching data. If it is on the
> fly then no problem it can run whenever a new account being imported into
> appears in the imported data. The most common use case is probably importing
> data to one specific account but GnuCash is also able to specify the account
> being imported into in the import data itself.  I haven't looked at how the
> frequency table is currently stored in memory but I am guessing it is
> constructed in memory when the data file is read in.
>
> The up-to-date aspect is one advantage and if the current procedure  is
> changed to improve performance then that is not hampered by the presence of
> historical data which would be updated automatically when the procedure is
> run. If the table is stored as it is at present and a procedure was
> available to trawl the current transactions for an account then it can be
> kept up to date by running that procedure periodically. the use of data from
> manually entered transactions would then be incorporated whether on the fly
> or just run as required.
>
> Having a standalone procedure to trawl an existing file to update the stored
> data for an account  would allow exploration of whether this is likely to be
> a significant performance hit if it was run on the fly so that could perhaps
> be a first step.  The core part of the code to store the data has to exist
> in the matcher code already and it will be a case of wrapping this in a loop
> through the transactions existing in an account and setting up the gui
> interface to select accounts to run on.
>
> The problem with pruning the data is that GnuCash has no way of knowing
> apriori which tokens are most relevant. I would think that date information
> is not really relevant and amount/value information does little in most
> cases to identify a transfer account.
>
> The main difficulty I have  with transfer account assignment is that some
> regular transactions use a unique code in the description each time they
> occur with no separate unique identifier of the transaction source. My wife
> and I both have separte gym membership subscriptions and the transaction
> descriptions neither identify the gym or for which of us  the transaction
> applies. Options are to persuade the source to include specific data or only
> use a single account to record both but I like to track both our individual
> and joint expenses
>
> Some regular transactions also get matched to previous payments in the
> transaction matching within the date range window where the amounts and
> descriptions are usually identical. The current 42 day window captures both
> fortnightly and monthly regular income transactions for example.  This only
> affects a few transactions each month and I don't have huge numbers of
> transactions to process now that I have retired but that may not be the case
> for other users. Maybe making the date range window adjustable rather than
> fixed might be a cure for this. Setting it at <14 days would cure the
> problems I have for example, but that again would not work for everybody.
>
> I am currently committed to a bit on the documentation front so I will be
> unlikey to consider this for the near future in other than general terms but
> someone else may be willing to take it up.
>
> David
>
>
>
> -----
> David Cousens
> --
> Sent from: http://gnucash.1415818.n4.nabble.com/GnuCash-Dev-f1435356.html
> _______________________________________________
> gnucash-devel mailing list
> gnucash-devel at gnucash.org
> https://lists.gnucash.org/mailman/listinfo/gnucash-devel