Startup speed

prl prl at ozemail.com.au
Fri Jun 21 23:37:37 EDT 2013


On 22/06/13 12:04, R. Victor Klassen wrote:
> Without actually looking at the code, I would assume that it is the manner in which the datafile is read and processed that is the bottleneck.    Disk I/O is much slower than memory accesses, but any modern system can read a rather large data file in a single gulp in a matter of a fraction of a second.   But parsing the file and building a data structure that uses the entire file's contents, now that's another matter.   And if the file is being read a line (or a byte) at a time, that would also be slow.
If the I/O goes through the U*ix stdio package or anything similar, the 
file will be read blockwise, even if the user code is requesting that 
data one byte at a time (or one line at a time), so system overheads 
should not be too bad in this case.

If the file is really being read by byte-at-a-time system (for example, 
by read(fd, &ch, 1)) calls then there could be significant system overhead.

System CPU time is not more than 2-3% on my MacBook (2.26GHz Intel Core 
2 Duo) when I open GnuCash and read my accounts file, so it doesn't look 
as though system overheads are significant in the load speed, or at 
least not for me.

I'd expect that if a stdio-like library is being used, the main 
limitation on file read speed would be the XML parse, as David Carlson 
notes in another post.

XML is not line-at-a-time, and parsing it that way would be less 
efficient than parsing it byte at a time through a good buffering I/O 
library. <tag1>stuff</tag1><tag2>morestuff</tag2> is perfectly legal and 
doesn't require any line breaks, and I doubt that many XML parsers would 
read line-at-a-time. I'd expect them to read byte at a time (and leave 
buffering for lower level libraries).

The whitespace and line breaks in a GnuCash accounts file are there 
almost entirely for a human reader.

Some small speedup of the parse might be achieved by not pretty-printing 
the XML as it is written out, saving all the whitespace used for neat 
indenting from needing to be skipped by the parser. But there's not much 
to be gained there: my accounts file is ~6.5MB uncompressed, and of 
that, about 0.8MB is non-linebreak white space. There are about a 
further ~180000 line breaks, which translate to the same number of 
characters on OS X and Linux, and ~36000 characters on Windows if 
Windows rather than Linux line terminators are used, so again, the line 
breaks wouldn't seem to me to be imposing a large burden on the parser.

The amount of non-linebreak white space could be about halved by using 
an indent of one space rather than two spaces in the pretty-printer, for 
only a small loss of human readability.

As well as the XML parsing and GnuCash data structure creation, there is 
also the cost of decoding the compressed binary stream, creating the 
decoder's data structures and uncompressing it. However, that's a quite 
small cost for me, 0.05sec elapsed time to uncompress my 0.5MB 
compressed/6.5MB uncompressed accounts file.

Peter


More information about the gnucash-user mailing list