[GNC-dev] Normalizing live data, a suggestion for discussion

Wm wm_o_o_o at yahoo.co.uk
Fri Feb 1 08:36:52 EST 2019


Situation: someone reports a problem with gnc, at triage it becomes 
clear some data is going to be required to identify or solve the 
problem. Normal question?  Can you give us a file.

Problem: for any number of reasons ranging from plain old personal 
privacy through to people that live in supposed liberal societies 
avoiding tax and people in supposed conservative societies avoiding 
persecution, sending live data isn't always appropriate.  The USA has 
become very weird about this and most of our development people are in 
the USA so hopefully they'll understand the politics of privacy, eventually.

Suggestion: we try to make providing a file easier for people.

My suggestion is we ask people to save a *copy* of their data in SQLite 
and they then run a script across that copy that munges and obfuscates

1. account names [1]

2. numbers [2]

[1] people following this will probably be aware that gnc doesn't know 
about account names much beyond broad classes in spite of providing lots 
of names and not accommodating other accounting concepts such as the 
fact there is a level one up [3]  My point here is that account names 
are important to people but not gnc so why not just randomize them? 
Obvious way? copy the actual account name (the guid) to the user visible 
one.  this is a one way change unless someone has unusual settings on 
their SQLite file, if someone has those settings it seems reasonable to 
presume they also know how to turn them off and save the file again.

[2] as long as the transaction stream balances the actual numbers don't 
matter (their will be occasions where the numbers are important but 
these tend to be number extremes related to commodities rather than 
anyone using gnc to do a Mr Putin vs Mr Trump sports bet).  In most 
cases multiplying any matching numbers by the same semi-random should 
produce a good file for examination so long as it is done consistently [4]

[3] that is a long argument I am interested in conceptually rather than 
personally, it doesn't affect me as a UK person but makes me think 
Internationally.

[4] I don't think a reductive discussion of true vs near true random [5] 
is appropriate, the significant point is the person viewing the data 
won't be able to work out the original number without significant effort 
and in most cases simply won't be able to work it out at all, we're 
talking computing assets I doubt anyone here has access to in order to 
get back *and* I believe the gnc people are actually motivated by 
solving problems, belief in the project and ordinary stuff like that so 
they won't even be looking.

[5] Random is fun if only because there are so many ways of doing it.

Questions: why SQLite rather than XML?  Because if a person runs an 
agreed script across their file we can be sure of an outcome.  Editing 
an XML file informally is scary, it immediately raises questions about 
consistency of data. Other SQL formats are not widely used, my proposal 
is we go for LCD where we can achieve normalization.

Normalization will have to be balanced: privacy vs contribution to the 
project.

I definitely want contribution from other people that work well with 
SQL, let's think about this together, people, I have written some 
scripts that confuse *my* data and I know that Geert is still waiting 
for me to send him a file.

Geert is a good person, I just don't want to show him very personal 
stuff in my file.

I have a plan for making showing a file easier, is anyone interested?

This is the *start* of a conversation, I welcome thoughts.

































More information about the gnucash-devel mailing list