GDA: PostgreSQL empty slots table

Sun Feb 10 14:02:30 EST 2008

Mark Johnson wrote:
> Phil Longstaff wrote:
>>
>> I just saw an e-mail on the libgda mailing list which said that 
>> GdaQuery might be deprecated in the 4.0 series.
> I just saw that too.  It is disappointing.  You have put in a lot of 
> work using the GdaQuery.
>
> I had compiled the V4 branch with the idea of testing it with 
> gnucash-gda.  Not necessarily a good idea as it would be testing both 
> gda and gnucash-gda simultaneously.  Thus, so far, I have not 
> installed it.  As you will see below, I am having enough trouble with 
> the current (3.1.2) PostgreSQL provider.

It seems to me as though I am going to need to take a step back and 
think for a bit.  Any comments are welcome.  When I started this, I 
chose the 3.x series for libgda because it was almost ready for 
release.  I was not experienced enough in the linux release world to 
realize that, of course, it would take awhile for any particular project 
to be propagated out into all of the distros.  I am on opensuse 10.2, 
and opensuse 10.3, released oct 4/07 (according to distrowatch) still 
has the 1.3.x series of libgda.  So, I think the gda backend will need 
to support both the 1.3.x series and the 3.x series (1.3.x doesn't exist 
for windows).  There are some differences in APIs and the GdaQuery 
object tree does not exist in 1.3.x, so we're back to straight SQL.  So, 
I propose that the configure script will need to detect 1.3.x vs 3.x, 
and the gda backend will need to handle those 2 libgda revs and the 3 
backends (sqlite/mysql/postgresql) using SQL only (no GdaQuery).  I 
should be able to set things up with most of the code in common and only 
a few special cases.  I think the 1.3.x vs 3.x differences will be 
mostly api differences.

>>>>
>>>> It looks as though the postgresql backend has a bug.  Using the 
>>>> object model, I create an XML-based representation of the table.  I 
>>>> create the 'slot_id' column with type integer and add the 'AUTOINC' 
>>>> flag to it.  For mysql and sqlite, this is an INTEGER 
>>>> AUTO_INCREMENT column, while for postgresql, it should be a SERIAL 
>>>> column.  
>>> Agreed.  I have tracked this far enough to file a reasonable bug 
>>> report against the PostgreSQL backend.  I just ran out of time to do 
>>> it.  I'll get to it in the next couple of days.
>>
>> Unfortunately, any fix for this will take time to propagate to the 
>> distributions and make it out to people who want to use the GDA 
>> backend with postgresql.  We might need a workaround.
> Yes, I think an "ALTER TABLE....." might work as a workaround.  It 
> would not be a good thing to require the user to do, however.  Then 
> one gets into code like "for this version of this
>
> Another possibility would be to do a "Select MAX(slot_id) FROM slots;" 
> followed by an INSERT with a specified slot_id.  Two problems.  I 
> don't like imposing this overhead on non-PostgreSQL users.  And the 
> PostgreSQL provider has especially poor performance with its GdaQuery, 
> so that imposing an extra round trip on it has a higher penalty than 
> it should.

In addition, it doesn't handle multiple users accessing the same db 
unless you lock around the SELECT/INSERT so no other user can get a 
SELECT MAX(slot_id) between them.
>
> That's why they're called "workarounds" - they aren't good enough to 
> be called "solutions".
>
> On the other hand, there is time required to finish this gnucash 
> backend.  Though, I doubt if that would be enough time.
>
> I have now filed a bug report about this SERIAL problem:
> http://bugzilla.gnome.org/show_bug.cgi?id=515528

How do we tell whether any particular libgda in the field has the fix?  
By rev #?  We'd still need the workaround for libgda/postgres 
combinations without the fix.

>>
>>>
>>>> I don't really want to special-case the code for various backends 
>>>> since libgda is supposed to remove that need.  There are other 
>>>> cases where it might be required.  
>>> Agreed.  Not having special cases for different backends is a major 
>>> reason to use libgda.  And it has enough backends that supporting 
>>> special cases would be an undesirable amount of work.
>>
>> Well, we could limit support to sqlite, mysql and postgresql.  They 
>> are the 3 backends for which the provider status is "fully functional".
> I hadn't noticed that.  I had planned to limit my testing to those 
> three backends though.  It would simply be too much work for me to 
> test a dozen libgda backends.
>
> I wonder what not "fully functional" implies.  For example, when you 
> execute a GdaQuery, it checks to see if the provider object implements 
> a function for this purpose.  If so, it calls the function on the 
> provider.  If not libgda uses fallback code to implement a GdaCommand 
> object and calls that.  PostgreSQL's provider implements the function 
> (poorly from a performance point of view), and MySql's does not.  Both 
> work.
>
> The reason I say PostgreSQL's provider does this poorly from a 
> performance point of view is that it converts all such queries into 
> prepared statements.  This entails three round-trips to the db - one 
> to prepare the statement, one to execute it, and another to deallocate 
> it!  This is done for every query.  There is a "FIXME" in the code 
> here, but it looks like it will only be useful when the client uses 
> GdaQuery with parameters, which gnucash-gda does not.  I foresee a 
> risk that the fix to this FIXME may cause an extremely large buildup 
> of cached prepared statements, which, in turn, could cause its own 
> performance degradation.....
>
> Additionally, for every query, some other part of the PostgreSQL 
> provider executes two queries per table column - checking for unique 
> and primary keys.  I am told this comes from 
> gda_data_model_describe_column(), but I haven't studied the code to 
> see why that is called for each gnucash-gda query.  (Eg. would this 
> happen for a SELECT MAX(slot_id).. scalar query?)  I have filed a 
> performance bug on this one:
> http://bugzilla.gnome.org/show_bug.cgi?id=513149
> Investigating why none of these queries ever returned a single row led 
> me to:
> http://bugzilla.gnome.org/show_bug.cgi?id=513543
>
> Added to the SERIAL problem above, I seem to have found just as many 
> problems with the PostgreSQL provider as with the software I am 
> nominally testing - namely gnucash-gda.

I know.  I'm kind of concerned about the higher functions in libgda.  We 
might do best just using SQL and letting libgda provide the low level 
differences in the database libraries.
>>>
>>>> SQLite, for example, does not allow more than one row to be added 
>>>> in an INSERT statement, whereas mysql and postgresql do.  I could 
>>>> speed up initial save to a blank db by writing all accounts at 
>>>> once, for example, but sqlite couldn't handle it and would need 
>>>> them to be written one at a time.  Note that the libgda object 
>>>> model does not allow more than 1 row to be written either, so that 
>>>> would need to be done with SQL directly.
>>> I wasn't aware of this limitation of SQLite.  I have not had 
>>> occasion to use it.  I've recently looked a little into the manual.  
>>> Enough to see that it supports sub-queries and only does joins 
>>> through nested loops.  That may cause performance issues with my 
>>> suggested queries, especially the last one.
>>
>> Would the performance be any worse than "SELECT DISTINCT tx_guid FROM 
>> splits..." followed by "SELECT * from SPLIT where tx_guid IN (...)" 
>> which is a nested loop broken up into 2 pieces?
> These are queries 1 and 2 from the register code.  In my other posting 
> ("GDA register open queries"), I propose merging them into one.  I do 
> believe that would improve performance.  (Testing of said belief is 
> required.)
>
> I can't do any real benchmarking right now - my  test system is 
> building the latest gnucash-gda.  However, roughly, Query A appears to 
> be faster than query 1 followed by query 2 .  I believe that Query A 
> will be easier for you to maintain - eg. no code to split up long 
> query strings.  Later, I'll do this more rigorously without the 
> unpredictable load of building gnucash-gda concurrently, and not run 
> the .db file on an NFS mount.  And also for MySql, and PostgreSQL.
>
> Here is how I've benchmarked it:
> $ time sqlite3 mydata.db <query1.sql >/dev/null
> The first time I ran the above, I sent its output to a file, and 
> edited the file to produce query2.sql.
>
> Additionally, combining the two queries is one less IPC for MySQL 
> (three less (plus two for each column in each of the two tables) for 
> PostgreSQL provider).

I'd be interested in any benchmarks you have.  Re SQL statement length: 
the mysql document says there is a 16MB limit and I didn't see any limit 
in the sqlite documentation.  Is there a way of doing something like:
    SELECT DISTINCT tx_guid FROM splits ....   /* and save the results */
    SELECT * FROM transansactions WHERE guid IN results_from_above_query 
/* and return the results */
    SELECT * FROM splits WHERE tx_guid IN results_from_above_query /* 
and return the results */

i.e. have the engine cache the results of the 1st query and reuse it in 
subsequent queries without needing to recompute?

Phil