[GNC-dev] Normalizing live data, a suggestion for discussion

Mon Feb 4 05:38:37 EST 2019

Op zaterdag 2 februari 2019 22:36:18 CET schreef Wm via gnucash-devel:
> On 02/02/2019 15:24, Geert Janssens wrote:
> > Yes, if you use business features, you may have entered business
> > identifying data in File->Properties. It think that's what David is
> > referring to.
> I agree, the third party should not be identified.
> 
> > Similarly there may be customer and vendor data (names addresses) in the
> > book that should equally be obfuscated. Just random data is fine.
> 
> Yes.
> 
> Geert, at the moment I am putting guid in place of random, do you think
> that is a wrong way to approach this?
> 
I think GUIDs are probably fine as well.

Note I'm going by the theoretical goal of not being able to reconstruct the 
user's real financial data from the obfuscated file. Personally I'm not 
interested in doing that at all,  but people's paranoia levels may vary.

So talking of guids. If I remember correctly the default guids for accounts 
coming from gnucash account templates are hard-coded (or at least they used to 
be until somewhere in the 2.6 series.

So if that is still true then guid for account names is only fake obfuscation. 
And perhaps these guids should be replaced throughout the book during the 
obfuscation before replacing account names with guids

> Actually, the nearer we get to complete random the less useful the file
> becomes.  Actual random data is harder than most people think and pretty
> much defeats the purpose if you think about it.
> 
>From a human's point of view a guid is just random numbers. So I don't see how 
that makes a difference. If the same random value is used where the data was 
the same in the original book, it's just like using a guid. And I'm no talking 
of numbers for this part, I'm talking about customer names, vendor addresses, 
that kind of stuff.

> > Continuing on that vein, if you have bills and invoices, aside from
> > randomizing the transaction's split amounts and values you'll also have to
> > do the same for invoice entries.
> 
> I don't think that is true in most situations and even if what you say
> is true, I don't see it as a good argument against *attempting* a
> normalized book for most people.
> 
It's true if the bug to investigate is somewhere in the business code. In that 
case what your invoice data says should match what the resulting transactions 
say. Those are stored in different parts in the book, but are interrelated.

But even if the bug is not in business data, the business data should be 
properly anonymized or removed anyway such that the user can confidently share 
it without risking real financial or private info can be extracted from it. Of 
course in that context the business data no longer has to be consistent though 
I still believe it makes debugging harder if it isn't.

> > And to make the book useful for detecting
> > business data bugs this should happen in such a way that invoice tax and
> > discount amounts remain consistent after multiplying with random numbers
> > *and* that the invoice totals continue to match the business transactions
> > amounts in AR/AP accounts.
> 
> There will be situations that involve the person doing the triage
> needing to see actual transactions, I have already commented on that.
> 
Sure. However that's not what I'm implying here. The extra business 
requirements are an extension of your initial concept that transactions should 
continue to balance. From a business data point of view invoices with their 
entries should continue to balance with their invoice transactions or the data 
quickly becomes meaningless.

> > And to make that one level more complicated, after that the payment
> > transactions *also* have to continue to match the new randomized invoice
> > amount (if the invoice was paid in full).
> 
> Ummmm, I don't think that is true.  If the munged numbers match (and
> they will, that is what the script will do) the transaction stream will
> be OK.
> 
> It is possible I have missed your point, Geert, but I think it is
> looking like I understand the contents of the gnc files better than you :(
> 
You did miss the point. You only think of balancing transactions. I'm also 
thinking of balancing lots, a more hidden aspect of the business data that's 
crucial to debug payment issues. My next reservation was also about consistent 
lots.

> > It doesn't end there, payments can be split over multiple invoices, so
> > again when one randomizes invoice amounts care must be taken to adjust
> > the payments in proportion to the invoice amount change or fully paid
> > invoices suddenly can become partially paid or overpaid.
> 
> Not true.
> 
> Geert, I don't want to say this but I believe you are actually wrong,
> for once.

It would be more useful to explain why you think that.
> 
> > While this is probably all possible I believe the resulting script will be
> > so complex that it will become a source of bugs in itself which would
> > divert developer time to debugging and maintaining this script rather
> > than working on the effectively reported bug for which a sample data file
> > was asked in the first place...
> 
> Hmmmm, I accept your point and disagree.
> 
I agree that may have been overly pessimistic :)

> > Up until a book with only transactions, no business data at all it sounded
> > like a useful tool.
> 
> Be a brave man, Geert, most people don't use the business functions :)
> 
Right. For those who do that data still needs to be anonymized or you should 
explicitly state somewhere it isn't to avoid misunderstandings.

> > Oh and we haven't mentioned SXs and budgets yet...
> 
> Unless they are material to the file being investigated I suggest we
> just delete all SXs and budget stuff.
> 
Reasonable.

> As far as I am concerned this conversation is ongoing, if only because
> Geert says he still needs a file from me to replicate a basic problem
> that I don't think needs any data from me at all.

It has been a while so you may want to refresh my memory... Which bug 
triggered this again ?

Geert