Performance improvement for xml loads (+comments)

Rob Browning rlb@cs.utexas.edu
05 Dec 2000 11:50:42 -0600


Bill Carlson <wwc@wwcnet.nu> writes:

>  	I've been trying to make the xml stuff go a bit faster.  The
> following patch will cut my large file load (~10000 transactions)
> from about 2 minutes to about 30 seconds.  This is obviously good
> (and along the lines of a change I made a while ago for the binary
> file load).

Thanks for the patch.  I'll look at it shortly - I need to swap back
in my xml brain, though Dave may have already put it in (the patch,
not my xml brain).

> The only problem I have now is that it is still MUCH slower than the
> binary.  The file size is about 6x the size before (9 Meg vs. 1.5
> Meg) and to actually do the write it seems to use about 50 Meg of
> ram, because xml builds the whole tree in memory before writing
> anything.

One thing that's on my to-do list is to integrate zlib.  For writing,
all you have to do is set a flag and libxml will compress the output,
but for reading, since we parse incrementally (to save RAM), we need
to use zlib directly, and I've been too busy with g-wrap to fix it
yet.  This won't help RAM usage, but it will help reduce storage space
tremendously (I believe it was actually smaller than the binary format
last time I tested), and it didn't noticably affect performance (maybe
5%, as long as you don't use too high a zlib level).

Another thing that's on my to-do list that might help runtime RAM
usage tremendously is to investigate speeding up libxml itself.  If
you do profiling, much of the time is spent just
allocating/deallocating and moving strings around.  I suspect, though
I haven't had time to check, that libxml is allocating a new string
for every tag in the tree.  If so, then reworking it to use a string
hash (or a GCache) so that it only allocates each tag one time might
dramatically reduce runtime RAM usage, and could even help
performance.

Another thing I wanted to investigate is if there's any way we could
add incremental writing to libxml, and if so, how hard it would be.

> This all concerns me a great deal as xml seems to be the future for
> gnucash and this experience tells me that we are in trouble as far
> as even moderatly large databases of transactions.  Any comments?
> I'd be more than happy to help make things better.  One thought I
> had was to try a "io-gncsql-r.c" and "io-gncsql-w.c" which would
> talk to an sql database (this would be different than using the
> database as the engine, it would just read and write like a file).
> Will the architecture of gnucash allow for user preferences in the
> style of saving?

All my comments about speeding up libxml aside, I also have a little
hesitation putting too much time into it.  Our long term plan is
probably to switch to SQL across the board.  In that scenario, the XML
engine would only be used as the import/export mechanism (as far as
File IO goes), and so you have to wonder if it's going to take a lot
of time to make the XML code faster, if that time would be better
spent just integrating PosgreSQL or MySQL, both of whose developers
I've spoken to and who claim that it's now possible to use each as an
embedded DB if you're willing to accomodate a few tricks...

Though there are many, another argument in favor of getting SQL soon
is that as I rework the quote-system, it seems like I'm going to have
to write a bunch of code to do things that would be natural (and
trivial) for any even semi-respectable DB.  I'm not looking forward to
re-inventing that wheel.

So, presuming we keep our current frame of mind, the XML format may
very well only be the actual file format, short term.

> In any case, what follows is my brief patch which I'd appreciate if
> it were put into the CVS.  Thanks!

Thanks very much for the work.

-- 
Rob Browning <rlb@cs.utexas.edu> PGP=E80E0D04F521A094 532B97F5D64E3930