Performance improvement for xml loads (+comments)

Al Snell alaric@alaric-snell.com
Sat, 9 Dec 2000 02:34:56 +0000 (GMT)


On 8 Dec 2000, Derek Atkins wrote:

> Also, I'm looking at network protocols.  When designing a network
> protocol, I don't like assuming you have 100BaseT between your client
> and server.  That means you have to make protocols as compact as
> possible.  I also don't like to compress a network protocol, because
> that makes it REALLY hard to debug.  That's why I like binary formats,
> they are easy to parse (if you know the format, or at least has a tool
> that understands the format) and take up less space on the wire.
> Imagine sending XML over a 14.4 modem. :)

Yes. People talk at great length about the wonderfulness of XML's human
readability, then say that you can account for it's verbosity by
gzipping it. A gzipped file is *very* corruption sensitive, more
so than many conventional binary formats. It's also impossible to
hand tweak without un-gzipping it. Gzipping and ungzipping are actually
quite memory and CPU intensive operations, not to emtnion the cost of
parsing XML syntax.

Instead of gzipping XML, it's far neater to use a binary data format
with a small suite of debugging tools to parse the files and hand-tweak
them - perhaps even a filter to convert to and from a textual
representation.

If I'd been around where people were first discussing an XML file format
for GnuCash, I'd have fought it like crazy, since it'll be so much more
effort than it's worth :-(

We have a working but inefficient implementation. How long has it taken?
An XDR version would take no time at all; I've got much of it already
written in a CVS repository. From the .x files I've written, rpcgen
will create the C type defintiions for GnuCash data structures, and
C code to load and save them. Easy peasy!

> Even when we move to a database, I think we'll still need to write our
> own network protocol for a distributed system (assuming we move to a
> multiple-client, single datastore model).  That's because I know of no
> database network protocol that actually implements data encryption,
> data integrity, and strong user authentication.  Even CORBA's network
> protocol is rather limited in terms of security.  I would love to be
> proven wrong on this, as it would help me immensely in various areas.

...whereas XDR conventialy ties in nicely with ONC RPC. ONC RPC does have
advanced encryption available, just rarely implemented in most
environments, but we can hack in our own system quite easily. I'm on the
IETF working group for RPC, pushing to develop a portable implementation
of the security stuff anyway.

> Honestly, I don't believe that a normal user with a couple of years
> worth of data stored in GnuCash would have any prayer of being able to
[...]
> reduce the disk storage, but you cannot grep through a compressed file
> either.

With regards to error correction, hopeing for users to fix broken files
is a dead end. Avoiding the problem in the first place is the
responsibilty of file transfer protocols and disk storage subsystems. The
people who applaud Unix for storing criticial configuration in text files,
saying that they can correct all sorts of errors in them, always amuse me
by storing those text files in a *binary filesystem*! Fix *that* with vi!

If we want the data files to be able to survive bit rot, we can write a
set of filters to encode arbitrary files into Hamming codes and back, and
release them to the general public to use for *any* file they fear will be
randomly corrupted.

I've heard people say that XML is good because a semantic error ("bug") in
the application may cause it to produce an invalid output that you have to
fix. However, I think that the chance of it doing this without actually
destroying any data (and thus making it more worthwhile finding a backup
copy!) are so tiny that it's not wirht the costs...

> Do all DBMS systems support the same set of primitive data types?

There's a basic set you can depend on, but it gets hairy above that. I
design RDBMS schemas for a living... in cross-DBMS
("heterogenous") environments (MySQL and PostgreSQL a speciality -
www.upmystreet.com runs on my schemas, and since the introduction of the
classified ads systems, it's taking more load than the GnuCash databases
of a pretty large organisation will ever take :-)

Using an embedded SQL "server" within GnuCash will be a Good Thing. From
an outside perspective, it will just mean that GnuCash uses a file of some
wierd binary format (that nobody has a hope in hell of hand-tweaking); but
it means that with little more than the flick of acompile-time switch, it
could also use a "live" RDBMS server, sharing access with other users and
all that.

ABS

-- 
                               Alaric B. Snell
 http://www.alaric-snell.com/  http://RFC.net/  http://www.warhead.org.uk/
   Any sufficiently advanced technology can be emulated in software