[GNC-dev] Normalizing live data, a suggestion for discussion

Fri Feb 1 14:17:39 EST 2019

On 2/1/19 5:36 AM, Wm via gnucash-devel wrote:
> Situation: someone reports a problem with gnc, at triage it becomes
> clear some data is going to be required to identify or solve the
> problem. Normal question?  Can you give us a file.
>
> Problem: for any number of reasons ranging from plain old personal
> privacy through to people that live in supposed liberal societies
> avoiding tax and people in supposed conservative societies avoiding
> persecution, sending live data isn't always appropriate.  The USA has
> become very weird about this and most of our development people are in
> the USA so hopefully they'll understand the politics of privacy,
> eventually.
>
> Suggestion: we try to make providing a file easier for people.
>
> My suggestion is we ask people to save a *copy* of their data in
> SQLite and they then run a script across that copy that munges and
> obfuscates
>
> 1. account names [1]
>
> 2. numbers [2]
>
> [1] people following this will probably be aware that gnc doesn't know
> about account names much beyond broad classes in spite of providing
> lots of names and not accommodating other accounting concepts such as
> the fact there is a level one up [3]  My point here is that account
> names are important to people but not gnc so why not just randomize
> them? Obvious way? copy the actual account name (the guid) to the user
> visible one.  this is a one way change unless someone has unusual
> settings on their SQLite file, if someone has those settings it seems
> reasonable to presume they also know how to turn them off and save the
> file again.
>
> [2] as long as the transaction stream balances the actual numbers
> don't matter (their will be occasions where the numbers are important
> but these tend to be number extremes related to commodities rather
> than anyone using gnc to do a Mr Putin vs Mr Trump sports bet).  In
> most cases multiplying any matching numbers by the same semi-random
> should produce a good file for examination so long as it is done
> consistently [4]
>
> [3] that is a long argument I am interested in conceptually rather
> than personally, it doesn't affect me as a UK person but makes me
> think Internationally.
>
> [4] I don't think a reductive discussion of true vs near true random
> [5] is appropriate, the significant point is the person viewing the
> data won't be able to work out the original number without significant
> effort and in most cases simply won't be able to work it out at all,
> we're talking computing assets I doubt anyone here has access to in
> order to get back *and* I believe the gnc people are actually
> motivated by solving problems, belief in the project and ordinary
> stuff like that so they won't even be looking.
>
> [5] Random is fun if only because there are so many ways of doing it.
>
> Questions: why SQLite rather than XML?  Because if a person runs an
> agreed script across their file we can be sure of an outcome.  Editing
> an XML file informally is scary, it immediately raises questions about
> consistency of data. Other SQL formats are not widely used, my
> proposal is we go for LCD where we can achieve normalization.
>
> Normalization will have to be balanced: privacy vs contribution to the
> project.
>
> I definitely want contribution from other people that work well with
> SQL, let's think about this together, people, I have written some
> scripts that confuse *my* data and I know that Geert is still waiting
> for me to send him a file.
>
> Geert is a good person, I just don't want to show him very personal
> stuff in my file.
>
> I have a plan for making showing a file easier, is anyone interested?
>
> This is the *start* of a conversation, I welcome thoughts. 

It might be better to have a standardized test file that folks could
download, and run their scenario against. 

However, there are situations that arise where the only solution is to
look at the original file.  In that case some obfuscation would be
helpful.  I would think that memos and descriptions would also need to
be randomized.  After a careful read, I realized you did intend to
randomize the transaction amoun  ts (which would have to be careful to
ensure the DR/CR remained balanced.  Otherwise, one could at least get
the total Assets/Liabilities/Income/Expense values known for the
submitter.  That may be sensitive information.  I know that I've shared
some information that later reflection was "did I really give them that!"

Now, to the XML vs SQLite argument.  Whatever script is applied to one
could easily have a counterpart that would apply to the other.  You
wouldn't have to manually (informally) edit the XML.  A known script
should provide a known outcome.  I suspect that many folks are using an
XML back-end and would rather not fiddle with a database back-end.  I'm
in that camp even though I'm a trained Oracle DBA and spent a couple
decades using that back-end professionally.

I think the first step is having a standard test file that a use could
apply to their favorite back-end, run their scenario, check the
results.  If the problem is verified, then we have pretty good evidence
the problem is in the application.  If the problem doesn't show up, then
it indicates the problem may be in the data.  That would require a "data
forensic expert" (aka developer or some assistant) to look deeper into
the user's data file.  In that case a good obfuscation tool would come
in handy.

--Steve

-- 
Stephen M Butler, PMP, PSM
Stephen.M.Butler51 at gmail.com
kg7je at arrl.net
253-350-0166
-------------------------------------------
GnuPG Fingerprint:  8A25 9726 D439 758D D846 E5D4 282A 5477 0385 81D8