Performance improvement for xml loads (+comments)

07 Dec 2000 18:11:13 -0500

Rob Browning <rlb@cs.utexas.edu> writes:

> Derek Atkins <warlord@MIT.EDU> writes:
> 
> > Honestly, I think this is a red herring.  I'm not at all convinced
> > that if said tools did exist they would at all be useful.  Sure, you
> > have a tagged data tree, but you have to know what the tags mean in
> > order to do anything with them.
> 
> Well, for me it hasn't been a red herring.  I've already used
> perl/sgrep several times to check various things about my data file
> (number of accounts, number of transactions, count transactions
> containing foo and write total to a file, etc.).  Now granted, most
> people won't want/need to do this, but as a developer (and as a
> curious higher-level user), I've already found this quite valuable.

You're a developer.  You don't count.  A developer should be expected
to go to extra trouble to debug a program.  What we shouldn't do is
make end-users have to go through trouble (or, store lots of extra
data).

> Further, the binary format was completely opaque, and very hard to
> debug.  The XML one has been quite easy.  I was able to do a number of
> validity checks, and spot errors with obvious fixes just using diff.
> You can't say that of non-text formats.

With a good binary format, you can write a small program that prints
out the data-file in a text format for debugging.  Then you can 'diff'
that, if you like.  Binary isn't necessarily opaque (it may be opaque
to humans, but a human isn't the one supposed to read the data files).

> Further, say the file gets minor corruption for some reason.  With the
> text file, you can just open it up and fix it with emacs/vi/whatever.
> With a binary format, you're probably screwed unless you're *really*
> an expert, and have a lot more time.

Honestly, if the data file gets screwed up, you may never know it.
With a binary file, you KNOW that the file is screwed up.  With a
text file, the screwup may be somewhere where you'll never find it.

> As I said, you and I may just have different perspectives here.  I've
> *already* found the text format useful.

Perhaps.  I'm trying to keep data storage and memory usage at a
minimum while still being extensible for future expansion.  I think
users would prefer that.  Us developers just need better tools to help
us along.  Creating a format just because it's easier to debug is not
necessarily a good reason to do it that way.

> > Well, using MySQL or PostgreSQL is just one part of it.  It's a
> > storage mechanism, but you still need to create the data formats
> > that are stored.  You still need to define the transaction objects
> > or split objects or whatever that get stored in the database.  So,
> > defining a binary data format now would certainly be useful, IMHO,
> > down the road when we move to a DBMS.
> 
> But for the most part, this would just involve defining the SQL tables
> we need.  I don't see how that involves a "binary data format".  I
> must not understand what you mean.

Not everything in the GnuCash data is an SQL primitive data type.
That means we're still going to have to build data structures (or
tables).  And some data, most likely, might get stored as a BLOB,
becuase it doesn't necessarily make sense to store it any other way.

> Rob Browning <rlb@cs.utexas.edu> PGP=E80E0D04F521A094 532B97F5D64E3930

-derek
-- 
       Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
       Member, MIT Student Information Processing Board  (SIPB)
       URL: http://web.mit.edu/warlord/      PP-ASEL      N1NWH
       warlord@MIT.EDU                        PGP key available