SQL backend performance

Tue Feb 23 09:15:27 EST 2010

On Tuesday 23 February 2010, Donald Allen wrote:
> As I've mentioned in other posts, I have a pretty large gnucash
> datafile -- more than 20 Mb uncompressed. I've been testing the SQL
> backend and I'm concerned about the performance, particularly startup
> performance.
> 
> I've been doing this testing on an inexpensive little HP desktop
> machine, dual-core 2.8 Ghz AMD Athlon, 2 Gb memory, 300 Gb 7200 rpm
> SATA drive, Arch Linux, 2.6.32 kernel. It may have been cheap, but
> it's damned fast (I've been in the computer business since 1964 until
> I retired last Oct. and the cost-performance of today's hardware just
> boggles my mind, especially when I think about what we put up with 20
> or 30 years ago; you haven't lived until you've tried getting your
> work done *and* stayed sane while sitting in front of a VT100 talking
> to an overloaded Vax 780 running 4.1BSD; motto: "It was hard to build,
> it ought to be hard to use"). From my gnucash xml file, I've created
> sqlite3 and postgresql databases containing the same data.
> 
> Here are the average data-load timings, in seconds, from just a couple
> of tests per storage method (this is from the point gnucash says
> "reading file" until it's up):
> 
> xml	9.10
> sqlite3	45.41
> postgresql	45.46
> 
> My mental model, which may be wrong, of what the SQL backend is doing
> is that it is reading the entire database at startup into memory, as
> opposed to retrieving the data as-needed (which is the way I'd guess
> gnucash+database would be architected if it was being written from
> scratch; I'll guess further that given the file-based history of
> gnucash, it's existing architecture precluded using the incremental
> approach when adding the database backend). I'm guessing this because
> of the amount of time it takes to start up when pointed either at
> postgresql or sqlite3 versions of the same data. I am further assuming
> that, with the SQL backend, as the user adds new things (accounts,
> transactions, etc.), new objects get inserted into the database and
> modified objects get updated, on the spot. I'm guessing this because
> as I make changes, the 'save' button remains grayed-out. So the
> primary advantage of the database backend is that file saves are not
> necessary. But, at least in my case, I am paying about a 36 second
> price at startup to avoid the file saves. File saves on my machine
> take about 2 seconds, much faster than reading the file, probably
> because the data is being written to the buffer cache. So I'd need to
> do 18 file saves during a gnucash session to be worse off (kept
> waiting) with the xml file than with the data in either flavor of
> database. And that assumes that I am always waiting for file saves to
> complete, which is not always the case (I frequently do them after
> completing a chunk of work, e.g., entering an investment transaction
> from a statement, and the file-save happens while I turn my attention
> to the paperwork to figure out what I need to do next).
> 
> While I didn't do a lot of timed tests and I've used the word "guess"
> an awful lot above (corrections to what I've said above from someone
> who actually knows what (s)he is talking about are perfectly welcome),
> the numbers confirm what I already knew from many more untimed tests,
> that at least in my case, startup takes an unacceptably long time. If
> this is the way the released version performs and my
> back-of-the-envelope analysis of the tradeoff is correct, I don't
> think I'd use the database stuff and just continue with the xml file.
> But if you think I've missed something here, please feel free.
> 
Your assumptions on how things work are correct.

And I noticed this performance decrease as well.

There is one difference between the xml and the sql backends that may 
influence this (at least in part): the sql backend writes lots of debug 
information to gnucash.trace at present. I don't know how much impact this 
has, I haven't tested without debug information, but if we disable the debug 
information before the 2.4 release, it will surely narrow the gap.

In the future this may still be improved such that data is only queried for 
when needed, but this requires a lot of internal modifications. This would not 
have been possible for the next stable release.

Geert