Report Engine Speed
stimming at tuhh.de
Fri Feb 19 15:06:28 EST 2010
Am Freitag, 19. Februar 2010 schrieb Jeff Kletsky:
> I found the "optimizing reports" thread from October 2007, but not a lot
> since then. Has there been any significant understanding gained since
> then about what is so slow on what should be a relatively
> straightforward task? Was it ever narrowed down to bad algorithms in the
> reporting, or something else?
I've created some of the oldest reports in there. I have never done real
profiling of the time spent in the report generation - but what I can recall
is this: All the "number collecting calculations" in all of the reports are
horribly inefficient, or more academic: coded in the wrong level of
abstraction. That is, the calculation of the balance difference between start
and end of a reporting period is usually done by obtaining the list of splits
(transactions) in question from within the scheme code, then summing over all
of those splits. Also, the intermediate results are not cached at all, even
during one report generation run, so they are calculated over and over again
when the report is generated. This turns your O(n) problem into an O(n^2)
As I said, the problem is the level of abstaction: Most of the report code
uses too low-level accessor functions into the txn data, like "querying for a
list of splits", then "accumulating the sum over these splits". It would go a
long way to choose more higher-level query functions here, like "querying for
a balance difference for splits obeying criterion xy". The "engine" and/or the
"libqof" part of gnucash in theory should provide sufficient such higher-level
functions (implemented in C); it's unclear if they actually do.
However, the problem doesn't stop here: Even if those few higher-level query
functions were being used from the schemed side of it, it would turn out their
implementation is probably highly inefficient as well, like O(n^2) instead of
something less than that. But as long as nobody really used these function for
time-critical applications (or cared about the time), nobody had any incentive
to work on the optimization here and its functional correctness was enough as
The optimization of the report generation obviously needs work on several
levels. I would recommend the following:
- Use profiling to identify the 10% part of the report generation that
consumes most of the time. Using the time-of-day is probably good enough to
identify this section. I would guess the problem is in the calculation of each
single amount that is displayed in the report, and most probably the amount
calculation is programmed in scheme for the most part and using the "engine"
query functions only on a very low level.
- Identify some higher-level query function that can fulfil the amount
generation task, but refactors its calculation away from the reporting code.
This higher-level query function can first be a scheme function. Once the
functional correctness is somewhat verified, you can try to port this to C
and/or to replace this by use of the libqof query object...
More information about the gnucash-devel