Report Engine Speed

Fri Feb 19 15:06:28 EST 2010

Am Freitag, 19. Februar 2010 schrieb Jeff Kletsky:
> I found the "optimizing reports" thread from October 2007, but not a lot
> since then. Has there been any significant understanding gained since
> then about what is so slow on what should be a relatively
> straightforward task? Was it ever narrowed down to bad algorithms in the
> reporting, or something else?

I've created some of the oldest reports in there. I have never done real 
profiling of the time spent in the report generation - but what I can recall 
is this: All the "number collecting calculations" in all of the reports are 
horribly inefficient, or more academic: coded in the wrong level of 
abstraction. That is, the calculation of the balance difference between start 
and end of a reporting period is usually done by obtaining the list of splits 
(transactions) in question from within the scheme code, then summing over all 
of those splits. Also, the intermediate results are not cached at all, even 
during one report generation run, so they are calculated over and over again 
when the report is generated. This turns your O(n) problem into an O(n^2) 
problem.

As I said, the problem is the level of abstaction: Most of the report code 
uses too low-level accessor functions into the txn data, like "querying for a 
list of splits", then "accumulating the sum over these splits". It would go a 
long way to choose more higher-level query functions here, like "querying for 
a balance difference for splits obeying criterion xy". The "engine" and/or the 
"libqof" part of gnucash in theory should provide sufficient such higher-level 
functions (implemented in C); it's unclear if they actually do.

However, the problem doesn't stop here: Even if those few higher-level query 
functions were being used from the schemed side of it, it would turn out their 
implementation is probably highly inefficient as well, like O(n^2) instead of 
something less than that. But as long as nobody really used these function for 
time-critical applications (or cared about the time), nobody had any incentive 
to work on the optimization here and its functional correctness was enough as 
a goal.

The optimization of the report generation obviously needs work on several 
levels. I would recommend the following:

- Use profiling to identify the 10% part of the report generation that 
consumes most of the time. Using the time-of-day is probably good enough to 
identify this section. I would guess the problem is in the calculation of each 
single amount that is displayed in the report, and most probably the amount 
calculation is programmed in scheme for the most part and using the "engine" 
query functions only on a very low level.

- Identify some higher-level query function that can fulfil the amount 
generation task, but refactors its calculation away from the reporting code. 
This higher-level query function can first be a scheme function. Once the 
functional correctness is somewhat verified, you can try to port this to C 
and/or to replace this by use of the libqof query object...

Good luck!

Regards,

Christian