Performance improvement for xml loads (+comments)

Derek Atkins warlord@MIT.EDU
08 Dec 2000 11:45:49 -0500


Rob Browning <rlb@cs.utexas.edu> writes:

> The 50MB of RAM you're worried about is a *BUG*, and it needs to be
> fixed.  Other than that, and some performance work that seems fairly

Is this a bug we can fix?  Or is it a bug in libXML?  I've not done
any profiling to determine this.

> straightforward, I still have a hard time seeing why you're pushing so
> hard for a whole new format.

Mostly because I haven't bought into the XML camp, and I don't see the
value of jumping on board the latest technology craze until I firmly
understand the exact requirements for which that particular technology
is the only technology that can solve the problem.  Basically, I'm in
the "reuse existing technology, and choose the best tool that solves
the problem" camp, not the "choose the latest, hottest technology that
maybe solves the problem if we squeeze a bit".  Clearly I'm in the
minority here.  Perhaps once these implementation bugs are fixed, some
people will "see the light"? ;)

Also, I'm looking at network protocols.  When designing a network
protocol, I don't like assuming you have 100BaseT between your client
and server.  That means you have to make protocols as compact as
possible.  I also don't like to compress a network protocol, because
that makes it REALLY hard to debug.  That's why I like binary formats,
they are easy to parse (if you know the format, or at least has a tool
that understands the format) and take up less space on the wire.
Imagine sending XML over a 14.4 modem. :)

Even when we move to a database, I think we'll still need to write our
own network protocol for a distributed system (assuming we move to a
multiple-client, single datastore model).  That's because I know of no
database network protocol that actually implements data encryption,
data integrity, and strong user authentication.  Even CORBA's network
protocol is rather limited in terms of security.  I would love to be
proven wrong on this, as it would help me immensely in various areas.

So, with that in mind, I was hoping that if the 'local-file' format
happened to be exactly the same as the 'network-transmission' format
we could reuse a lot of code :)

> > With a good binary format, you can write a small program that prints
> > out the data-file in a text format for debugging.
> 
> As long as it recovers gracefully from corruption, I'm fine with
> that.  It's just been my experence that writing binary->text
> converters that handle corruptions gracefully is a very difficult
> business.  Humans are much better at that and text lets them act.

Honestly, I don't believe that a normal user with a couple of years
worth of data stored in GnuCash would have any prayer of being able to
find, let alone fix, a data corruption problem in their data file,
regardless of format.  Indeed, my empty set of accounts with no data
at all was 15k in the old binary format and 100k in XML.  And that's
before I add any data to the system.  Sure, compressing the data would
reduce the disk storage, but you cannot grep through a compressed file
either.

At least with a binary file you can load it in emacs and search for a
string.  I read my Palm databases all the time that way :)

> > Not everything in the GnuCash data is an SQL primitive data type.
> 
> I'm having a hard time thinking of anything.  I was planning that we
> go out of the way to make sure we use primitives for all the primary
> stuff.  As I mentioned before, this might even mean getting rid of the
> arbitrary hierarchical frames stuff.

Do all DBMS systems support the same set of primitive data types?

> Rob Browning <rlb@cs.utexas.edu> PGP=E80E0D04F521A094 532B97F5D64E3930

-derek
-- 
       Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
       Member, MIT Student Information Processing Board  (SIPB)
       URL: http://web.mit.edu/warlord/      PP-ASEL      N1NWH
       warlord@MIT.EDU                        PGP key available