prl at ozemail.com.au
Fri Jun 21 23:37:37 EDT 2013
On 22/06/13 12:04, R. Victor Klassen wrote:
> Without actually looking at the code, I would assume that it is the manner in which the datafile is read and processed that is the bottleneck. Disk I/O is much slower than memory accesses, but any modern system can read a rather large data file in a single gulp in a matter of a fraction of a second. But parsing the file and building a data structure that uses the entire file's contents, now that's another matter. And if the file is being read a line (or a byte) at a time, that would also be slow.
If the I/O goes through the U*ix stdio package or anything similar, the
file will be read blockwise, even if the user code is requesting that
data one byte at a time (or one line at a time), so system overheads
should not be too bad in this case.
If the file is really being read by byte-at-a-time system (for example,
by read(fd, &ch, 1)) calls then there could be significant system overhead.
System CPU time is not more than 2-3% on my MacBook (2.26GHz Intel Core
2 Duo) when I open GnuCash and read my accounts file, so it doesn't look
as though system overheads are significant in the load speed, or at
least not for me.
I'd expect that if a stdio-like library is being used, the main
limitation on file read speed would be the XML parse, as David Carlson
notes in another post.
XML is not line-at-a-time, and parsing it that way would be less
efficient than parsing it byte at a time through a good buffering I/O
library. <tag1>stuff</tag1><tag2>morestuff</tag2> is perfectly legal and
doesn't require any line breaks, and I doubt that many XML parsers would
read line-at-a-time. I'd expect them to read byte at a time (and leave
buffering for lower level libraries).
The whitespace and line breaks in a GnuCash accounts file are there
almost entirely for a human reader.
Some small speedup of the parse might be achieved by not pretty-printing
the XML as it is written out, saving all the whitespace used for neat
indenting from needing to be skipped by the parser. But there's not much
to be gained there: my accounts file is ~6.5MB uncompressed, and of
that, about 0.8MB is non-linebreak white space. There are about a
further ~180000 line breaks, which translate to the same number of
characters on OS X and Linux, and ~36000 characters on Windows if
Windows rather than Linux line terminators are used, so again, the line
breaks wouldn't seem to me to be imposing a large burden on the parser.
The amount of non-linebreak white space could be about halved by using
an indent of one space rather than two spaces in the pretty-printer, for
only a small loss of human readability.
As well as the XML parsing and GnuCash data structure creation, there is
also the cost of decoding the compressed binary stream, creating the
decoder's data structures and uncompressing it. However, that's a quite
small cost for me, 0.05sec elapsed time to uncompress my 0.5MB
compressed/6.5MB uncompressed accounts file.
More information about the gnucash-user