Performance improvement for xml loads (+comments)

Derek Atkins warlord@MIT.EDU
07 Dec 2000 12:36:23 -0500


Tyson Dowd <trd@cs.mu.OZ.AU> writes:

> On 06-Dec-2000, Derek Atkins <warlord@MIT.EDU> wrote:
> > Nobody is suggesting going back to the old binary format.  I'm
> > certainly not.  I *AM*, however, suggesting a NEW binary format.
> 
> Any new binary format will have to be at least as extensible as XML.
> After all, there's no point writing a nice tight binary format today,
> when tomorrow there will be another field that needs to be added.

Please define 'extensible'?  I can easily devise a binary format where
I can add new fields in the future.  The straightforward means of
doing this would imply that older file-parsers would not be able to
read newer file-formats, but I think this is a reasonable limitation.
This can be accomplished by simple object (or file-format) versioning.
It's also extremely easy to make sure you don't have byte-order or
word-size issues.  Indeed, this can easily be done with something like
XDR or ASN.1, where we just define our data structures and generate
parsing (and unparsing) routines to read and write data files/streams.
When our data structures change, we up the version number (and keep
the old structures around).  That way, when we see a file of version
#X, we know to use objects of version X.

> If you are worried about load times and memory usage, we should consider
> using a SAX interface to read in the XML.  See this link for tradeoffs:
> http://www.daa.com.au/~james/gnome/xml-sax/xml-sax.html

Unfortunately the problem isn't just at read-time.  It seems that the
problem is also during file-writes.  And according to this web page,
the SAX interfacewont affect file writes, only file reads.

> Personally, I'm not convinced that performance of the XML routines is
> going to be a long term problem.  Besides, a lot of people feel more
> comfortable with XML (or compressed XML) than being "locked in" to a
> binary format (even if the source is available).  I'd much rather see
> improvements to the XML based system than a completely different system,
> because there's a lot of synergy to be gained by going with XML.

What synergy?  I was never enthused about XML (mostly because I don't
like ascii file formats for large data objects or network protocols).
However, I was willing to let others take a gander at it (mostly
because I _DO_ think that XML input/output is necessary, especially
once we want OFX support).  The fact that storing 10000 transactions
requires 50M of ram in order to build the XML tree is, IMHO,
unconscionable.

I think I'll actually try to write an XDR-based data storage system
and we'll see.  I just don't believe anymore that XML is a reasonable
way to store large data sets.  XML is a cool technology, but just
because a technology is cool doesn't mean that it's the right tool for
the job.

>        Tyson Dowd           # 

-derek
-- 
       Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
       Member, MIT Student Information Processing Board  (SIPB)
       URL: http://web.mit.edu/warlord/      PP-ASEL      N1NWH
       warlord@MIT.EDU                        PGP key available