XML size (was: no subject)

Bob Willan bwillan@matrix-systems-inc.com
Wed, 3 Apr 2002 11:31:46 -0500


On Wednesday 03 April 2002 10:10 am, Bill Gribble wrote:
> On Wed, 2002-04-03 at 08:54, Derek Atkins wrote:
> > Note that this will not only fail to do what you want, but could
> > leave your data file unreadable and unusable.  This is _EXACTLY_ the
> > kind of thing that we DON'T want people to be doing!  If you want to
> > change your data you should use the application to do it.
>
> Extremely Strongly Disagree.
>
> I think it's a fundamental part of the Unix and free software
> philosophy that the data belongs to the user, not to the application.=20
> "It's none of your d**n business what I do with my data!"  If the user
> wants to pipe their data through perl or sed or whatnot that's their
> business.
>
> That's the main reason *I* wanted to go to the XML format to start
> with.  People *hate* applications that bottle their data up in opaque
> formats.  Databases get a special exemption because of the extremely
> delicate nature of the interrelationships between bits of data, but all
> real dbs have a way to dump text (SQL) that can be used to exactly
> restore the db.  Not just a text "export" (which is usually lossy) but
> a dump which exposes all of the data's guts.
>
> Sure, it's ill-advised to make precipitous changes to your XML data
> file, but it's also ill-advised to make precipitous changes to the
> kernel source code... does that mean it shouldn't be available for easy
> editing?

I usually just read these and move on, but these statements are a little=20
too much to pass up.

The core concern in all of data processing is the sanctity of the data. =20
The data must be correct and accessible.  Period.  All the rest is=20
decades of learning how to achieve this goal.  Transaction logging,=20
backups, audit trails, a way for easy and correct access, led to SQL and=20
Relational databases, and I'm sure will lead further from there.

'Easy access' does not mean some user trying to read a big text file. =20
And easy access does NOT mean easy to modify/delete data from outside of=20
the applications that were created to do the job that requires the data=20
in the first place.  It means being able to read some portion of that=20
data, maybe process it in some adhoc fashion, and then maybe saving it to=
=20
another file or spreadsheet or whatever - NOT back into the database. =20
That is reserved for applications that have been specifically created for=
=20
that purpose and thoroughly TESTED and verified to be correct (minus the=20
hopefully few unfortunate bugs that always exist in anything people do). =
=20
Otherwise, your data may (usually does in my 20 years data processing=20
experience) become less and less useful as it aquires errors in=20
consistency and/or outright invalid values because someone tried to 'fix'=
=20
something they didn't really understand (or possibly did understand but=20
didn't test).  Yes, some people make changes that are correct (but=20
everyone makes mistakes sometime).  These same people could usually take=20
the time to write and thoroughly test a script/program to do the same=20
changes rather than making an adhoc, quickie change in a text editor.

The data is not 'bottled' up because its in a database or binary file=20
format - so long as the format is known (published).  The spirit of Unix=20
and Open Source is not 'data in text files'.  It is not data in ANY=20
particular format whatsoever.  The 'open' refers to the data (and=20
programs, of course) being accessible, in a known format, that anyone can=
=20
access by writing a script/program/whathaveyou.  The idea is still that=20
the data be correct and, so, useful.  And by the way, not all databases=20
have 'text' dumps for back-ups or whatever, but instead backup to a=20
binary format to save space.  If you want to dump the data, you can=20
always write a SQL routine (or some other language) to output any/all=20
tables in text format, cvs, whatever you like.

XML was created to go along in the new web/Internet world and be to data=20
definition what html is to programs.  Fast, easy to access, and portable=20
across the Internet.  But that's all its really meant to be - a way for=20
disparate systems to exchange relatively small amounts of data (for=20
e-commerce between businesses, etc).  Its not meant to replace databases,=
=20
with their millions (and more) rows of data.  It doesn't encompass=20
transaction logging, etc, and the data takes up lots more space than=20
other formats (even with compression techniques, its a factor of 10-100=20
or more, going up quickly as the number of columns/fields increases, and=20
the quantity of those that are numeric - to say nothing of problems with=20
blob fields which are binay by definition).  Not that we have this=20
drastic a situation in gnucash with its relatively limited number of=20
tables and data.  It is also more open to formatting errors.

Which brings up the application (gnucash here) messing up the data.  Yes,=
=20
it can happen that a bug causes a problem with a single row or something=20
and it got by testing.  But, I've never seen a case where any application=
=20
actually made all the data unreadable - if it was in a database.  The=20
problem you had in gnucash was almost certainly caused by the XML engine,=
=20
or gnucash's interface to it.  Can't say as I've heard of the Oracle=20
engine, or postgreSQL engine, trashing their databases.  Nor is there the=
=20
case of a single comma or misplaced character in the 'file' causing the=20
'engine' (XML parser) to not be able to read any of the data.  It sounds=20
like you're arguing to keep the XML because you can manually fix the=20
problems that can only occur because its in a flat XML file in the first=20
place - kind of circular logic.

Bob