GDA: PostgreSQL empty slots table

Sun Feb 10 01:31:07 EST 2008

Phil Longstaff wrote:
>
> I just saw an e-mail on the libgda mailing list which said that 
> GdaQuery might be deprecated in the 4.0 series.
I just saw that too.  It is disappointing.  You have put in a lot of 
work using the GdaQuery.

I had compiled the V4 branch with the idea of testing it with 
gnucash-gda.  Not necessarily a good idea as it would be testing both 
gda and gnucash-gda simultaneously.  Thus, so far, I have not installed 
it.  As you will see below, I am having enough trouble with the current 
(3.1.2) PostgreSQL provider.
>>>
>>> It looks as though the postgresql backend has a bug.  Using the 
>>> object model, I create an XML-based representation of the table.  I 
>>> create the 'slot_id' column with type integer and add the 'AUTOINC' 
>>> flag to it.  For mysql and sqlite, this is an INTEGER AUTO_INCREMENT 
>>> column, while for postgresql, it should be a SERIAL column.  
>> Agreed.  I have tracked this far enough to file a reasonable bug 
>> report against the PostgreSQL backend.  I just ran out of time to do 
>> it.  I'll get to it in the next couple of days.
>
> Unfortunately, any fix for this will take time to propagate to the 
> distributions and make it out to people who want to use the GDA 
> backend with postgresql.  We might need a workaround.
Yes, I think an "ALTER TABLE....." might work as a workaround.  It would 
not be a good thing to require the user to do, however.  Then one gets 
into code like "for this version of this

Another possibility would be to do a "Select MAX(slot_id) FROM slots;" 
followed by an INSERT with a specified slot_id.  Two problems.  I don't 
like imposing this overhead on non-PostgreSQL users.  And the PostgreSQL 
provider has especially poor performance with its GdaQuery, so that 
imposing an extra round trip on it has a higher penalty than it should.

That's why they're called "workarounds" - they aren't good enough to be 
called "solutions".

On the other hand, there is time required to finish this gnucash 
backend.  Though, I doubt if that would be enough time.

I have now filed a bug report about this SERIAL problem:
http://bugzilla.gnome.org/show_bug.cgi?id=515528
>
>>
>>> I don't really want to special-case the code for various backends 
>>> since libgda is supposed to remove that need.  There are other cases 
>>> where it might be required.  
>> Agreed.  Not having special cases for different backends is a major 
>> reason to use libgda.  And it has enough backends that supporting 
>> special cases would be an undesirable amount of work.
>
> Well, we could limit support to sqlite, mysql and postgresql.  They 
> are the 3 backends for which the provider status is "fully functional".
I hadn't noticed that.  I had planned to limit my testing to those three 
backends though.  It would simply be too much work for me to test a 
dozen libgda backends.

I wonder what not "fully functional" implies.  For example, when you 
execute a GdaQuery, it checks to see if the provider object implements a 
function for this purpose.  If so, it calls the function on the 
provider.  If not libgda uses fallback code to implement a GdaCommand 
object and calls that.  PostgreSQL's provider implements the function 
(poorly from a performance point of view), and MySql's does not.  Both work.

The reason I say PostgreSQL's provider does this poorly from a 
performance point of view is that it converts all such queries into 
prepared statements.  This entails three round-trips to the db - one to 
prepare the statement, one to execute it, and another to deallocate it!  
This is done for every query.  There is a "FIXME" in the code here, but 
it looks like it will only be useful when the client uses GdaQuery with 
parameters, which gnucash-gda does not.  I foresee a risk that the fix 
to this FIXME may cause an extremely large buildup of cached prepared 
statements, which, in turn, could cause its own performance degradation.....

Additionally, for every query, some other part of the PostgreSQL 
provider executes two queries per table column - checking for unique and 
primary keys.  I am told this comes from 
gda_data_model_describe_column(), but I haven't studied the code to see 
why that is called for each gnucash-gda query.  (Eg. would this happen 
for a SELECT MAX(slot_id).. scalar query?)  I have filed a performance 
bug on this one:
http://bugzilla.gnome.org/show_bug.cgi?id=513149
Investigating why none of these queries ever returned a single row led 
me to:
http://bugzilla.gnome.org/show_bug.cgi?id=513543

Added to the SERIAL problem above, I seem to have found just as many 
problems with the PostgreSQL provider as with the software I am 
nominally testing - namely gnucash-gda.
>>
>>> SQLite, for example, does not allow more than one row to be added in 
>>> an INSERT statement, whereas mysql and postgresql do.  I could speed 
>>> up initial save to a blank db by writing all accounts at once, for 
>>> example, but sqlite couldn't handle it and would need them to be 
>>> written one at a time.  Note that the libgda object model does not 
>>> allow more than 1 row to be written either, so that would need to be 
>>> done with SQL directly.
>> I wasn't aware of this limitation of SQLite.  I have not had occasion 
>> to use it.  I've recently looked a little into the manual.  Enough to 
>> see that it supports sub-queries and only does joins through nested 
>> loops.  That may cause performance issues with my suggested queries, 
>> especially the last one.
>
> Would the performance be any worse than "SELECT DISTINCT tx_guid FROM 
> splits..." followed by "SELECT * from SPLIT where tx_guid IN (...)" 
> which is a nested loop broken up into 2 pieces?
These are queries 1 and 2 from the register code.  In my other posting 
("GDA register open queries"), I propose merging them into one.  I do 
believe that would improve performance.  (Testing of said belief is 
required.)

I can't do any real benchmarking right now - my  test system is building 
the latest gnucash-gda.  However, roughly, Query A appears to be faster 
than query 1 followed by query 2 .  I believe that Query A will be 
easier for you to maintain - eg. no code to split up long query 
strings.  Later, I'll do this more rigorously without the unpredictable 
load of building gnucash-gda concurrently, and not run the .db file on 
an NFS mount.  And also for MySql, and PostgreSQL.

Here is how I've benchmarked it:
$ time sqlite3 mydata.db <query1.sql >/dev/null
The first time I ran the above, I sent its output to a file, and edited 
the file to produce query2.sql.

Additionally, combining the two queries is one less IPC for MySQL (three 
less (plus two for each column in each of the two tables) for PostgreSQL 
provider).

Mark