Performance improvement for xml loads (+comments)

Rob Browning rlb@cs.utexas.edu
08 Dec 2000 16:54:26 -0600


Derek Atkins <warlord@MIT.EDU> writes:

> Rob Browning <rlb@cs.utexas.edu> writes:
> 
> > The 50MB of RAM you're worried about is a *BUG*, and it needs to be
> > fixed.  Other than that, and some performance work that seems fairly
> 
> Is this a bug we can fix?  Or is it a bug in libXML?  I've not done
> any profiling to determine this.

It's most likely a bug in libxml, but I consider that a bug we can
fix.  We've been improving various libs we depend on for a while, and
that seems both good for us a a project, and good for the community as
a whole, though as you probably suspect, it's not *quite* as trivial
as sending a patch to dave :>

> Mostly because I haven't bought into the XML camp, and I don't see the
> value of jumping on board the latest technology craze until I firmly
> understand the exact requirements for which that particular technology
> is the only technology that can solve the problem.  Basically, I'm in
> the "reuse existing technology, and choose the best tool that solves
> the problem" camp, not the "choose the latest, hottest technology that
> maybe solves the problem if we squeeze a bit".  Clearly I'm in the
> minority here.  Perhaps once these implementation bugs are fixed, some
> people will "see the light"? ;)

You're not as much in the minority as it might sound.  Even though I
was the one to implement the XML stuff, XML's only a notch or so above
Java for me in the too-trendy-for-words category.

In fact, I wanted to use scheme forms for our output format initially,
but for various reasons, I was finally convinced that XML was probably
comparable in performance/aggravation, and it did have the "people are
becoming more used to it" factor.

scheme write/read is far simpler than most of the alternatives :>

> Also, I'm looking at network protocols.  When designing a network
> protocol, I don't like assuming you have 100BaseT between your client
> and server.  That means you have to make protocols as compact as
> possible.  I also don't like to compress a network protocol, because
> that makes it REALLY hard to debug.  That's why I like binary formats,
> they are easy to parse (if you know the format, or at least has a tool
> that understands the format) and take up less space on the wire.
> Imagine sending XML over a 14.4 modem. :)

No argument here.  I totally understand that perspective.  I had to
worry about the same kinds of things when designing a semi-resilient
RPC over serial-line protocol that Bill and I needed to talk to one of
our robots when we were back in the lab.

However, I'm generally against binary formats unless there's a
situation, like this where it's obvious that every byte is critical,
bandwith/latency-wise.  In essentially all other cases, I think text
makes more sense.

An exception might be an extremely standard, reflexive
(self-describing) binary format with a whole set of types that cover
everything you need and handle all the platform dependence issues
gracefully, a format which also includes a whole set of tools that
handle encoding/decoding/passthrough for debugging purposes.

(However, I suspect that in many cases, the network and cross-process
 boundary latencies are going to dwarf any text->binary/binary->text
 conversion costs, so why not just use a text format which already
 satisfies all these properties...)

> Honestly, I don't believe that a normal user with a couple of years
> worth of data stored in GnuCash would have any prayer of being able
> to find, let alone fix, a data corruption problem in their data
> file, regardless of format.  Indeed, my empty set of accounts with
> no data at all was 15k in the old binary format and 100k in XML.
> And that's before I add any data to the system.

Well, it was easier for me.  With the right code in libxml and/or the
IO code, we can give line numbers, etc. for failures.  And those are
easy to interpret in emacs, grep, cut, etc.

> Sure, compressing the data would reduce the disk storage, but you
> cannot grep through a compressed file either.

zgrep :>

> Do all DBMS systems support the same set of primitive data types?

No, but I think there's a common SQL subset we could rely on, and I
suspect that set is rich enough to build nearly everything else we
need -- i.e. we can implement gnc_numerics as two columns of large
integers, or similar.

-- 
Rob Browning <rlb@cs.utexas.edu> PGP=E80E0D04F521A094 532B97F5D64E3930