GDA: PostgreSQL empty slots table

Mon Feb 11 22:09:15 EST 2008

Phil Longstaff wrote:
>
> It seems to me as though I am going to need to take a step back and 
> think for a bit.  Any comments are welcome.  When I started this, I 
> chose the 3.x series for libgda because it was almost ready for 
> release.  I was not experienced enough in the linux release world to 
> realize that, of course, it would take awhile for any particular 
> project to be propagated out into all of the distros.  I am on 
> opensuse 10.2, and opensuse 10.3, released oct 4/07 (according to 
> distrowatch) still has the 1.3.x series of libgda.  So, I think the 
> gda backend will need to support both the 1.3.x series and the 3.x 
> series (1.3.x doesn't exist for windows).  There are some differences 
> in APIs and the GdaQuery object tree does not exist in 1.3.x, so we're 
> back to straight SQL.  So, I propose that the configure script will 
> need to detect 1.3.x vs 3.x, and the gda backend will need to handle 
> those 2 libgda revs and the 3 backends (sqlite/mysql/postgresql) using 
> SQL only (no GdaQuery).  I should be able to set things up with most 
> of the code in common and only a few special cases.  I think the 1.3.x 
> vs 3.x differences will be mostly api differences.
I'm afraid many of my comments take the form of questions.  Hopefully, 
they will be helpful in your thinking.

I just checked libgda's website.  Version 3.0.1 was released on May 10th 
of last year.  On the one hand, this seems like plenty of time to 
propagate.  On the other hand, it clearly hasn't propagated to 
OpenSuse.  It may be that it is not seen as necessary to include the 
newer versions until someone (several someones?) begin to use them.  
What about other distributions?

How much extra work would supporting both 1.3.x and 3.0.x entail?  Is it 
possible that this would delay gnucash-gda until 1.3.x really does fade 
away?

In looking at some of your source code, I see you use the GdaQuery to 
build the SQL statement by adding objects representing tables, 
conditions, etc.  Am I correct that this is the way to do it so that the 
provider can render the SQL?  i.e. is this the way that libgda 3.0.x 
gets around variations in SQL dialects?  If that is correct, how does 
libgda 1.3.x do it?  (There must be some benefit to doing it that way; 
otherwise it seems to me to be easier to render SQL directly using 
sprintf or some other string manipulation.)

 From your comments, I suppose that libgda 1.3.x depends upon the 
programmer to manage SQL dialect differences.  Doesn't this diminish the 
value of libgda?  By how much?  (I think not by much; for the most part 
I think the SQLs are sufficiently similar.)  Is it worth the trade-off 
of getting more distributions that support gnucash-gda by having this 
prerequisite built-in?

I see that version 4 will provide a GdaStatement object to replace the 
to-be-deprecated GdaQuery.  I assume that it will provide a similar 
capability.

Speaking of version 4 of libgda, what happens to gnucash-gda when that 
is released?  Obviously, it can go on using 3.0.x for some time, and it 
looks like the two will be able to co-exist (V4 has a libgda-4.0.pc and 
libgda-4.0.so files, etc.  I haven't thoroughly checked for files with 
the same name.).  Would support of version 4 be added as well as 3.0.x 
and 1.3.x?  (Not an immediate concern, of course.)

Users of distributions which include only libgda 1.3.x do have a backend 
they can use - the current XML one.  Therefore, they are not denied the 
use of gnucash.  Only their choice of backend is limited.

So supporting 1.3.x increases your workload in at least three ways:
1. API differences,
2. possible SQL dialect differences, AND
3. redoing the work you've already done with GdaQuery.

Now, I haven't looked at every source file in the gda backend, but I am 
guessing that that last one would be a lot of work.  I would not be 
inclined to suggest you support 1.3.x.
>
>>>>>
>>>>> It looks as though the postgresql backend has a bug.  Using the 
>>>>> object model, I create an XML-based representation of the table.  
>>>>> I create the 'slot_id' column with type integer and add the 
>>>>> 'AUTOINC' flag to it.  For mysql and sqlite, this is an INTEGER 
>>>>> AUTO_INCREMENT column, while for postgresql, it should be a SERIAL 
>>>>> column.  
>>>> Agreed.  I have tracked this far enough to file a reasonable bug 
>>>> report against the PostgreSQL backend.  I just ran out of time to 
>>>> do it.  I'll get to it in the next couple of days.
>>>
>>> Unfortunately, any fix for this will take time to propagate to the 
>>> distributions and make it out to people who want to use the GDA 
>>> backend with postgresql.  We might need a workaround.
>> Yes, I think an "ALTER TABLE....." might work as a workaround.  It 
>> would not be a good thing to require the user to do, however.  Then 
>> one gets into code like "for this version of this
>>
>> Another possibility would be to do a "Select MAX(slot_id) FROM 
>> slots;" followed by an INSERT with a specified slot_id.  Two 
>> problems.  I don't like imposing this overhead on non-PostgreSQL 
>> users.  And the PostgreSQL provider has especially poor performance 
>> with its GdaQuery, so that imposing an extra round trip on it has a 
>> higher penalty than it should.
>
> In addition, it doesn't handle multiple users accessing the same db 
> unless you lock around the SELECT/INSERT so no other user can get a 
> SELECT MAX(slot_id) between them.
Good point.  However, this may be deferrable until such time as one 
decides to support multiple concurrent access to the db. 

That raises the question of what to do about multiple (accidental) 
concurrent access.  The current backend uses a lock file.  Perhaps 
something similar is needed.
>
> How do we tell whether any particular libgda in the field has the 
> fix?  By rev #?  We'd still need the workaround for libgda/postgres 
> combinations without the fix.
I had assumed that the configure script would require a libgda version 
which incorporates the fix.  I don't think this is an unreasonable thing 
to do, but it does limit the audience until the fixed version has time 
to propagate.

This is probably best: for an unfixed version, you could have the gda 
backend issue the alter table statement as a PostgreSQL special case.
>>
>> Added to the SERIAL problem above, I seem to have found just as many 
>> problems with the PostgreSQL provider as with the software I am 
>> nominally testing - namely gnucash-gda.
>
> I know.  I'm kind of concerned about the higher functions in libgda.  
> We might do best just using SQL and letting libgda provide the low 
> level differences in the database libraries.
Then one may encounter differences in the SQL dialects.  For example, 
off the top of my head, the LIKE operator is case-insensitive in MySql 
but case-sensitive in PostgreSQL.  PostgreSQL provides an ILIKE operator 
for case-insensitive conditions.  (May be used if one needs to search 
for a transaction by one of its text fields.)
>> These are queries 1 and 2 from the register code.  In my other 
>> posting ("GDA register open queries"), I propose merging them into 
>> one.  I do believe that would improve performance.  (Testing of said 
>> belief is required.)
>>
>> I can't do any real benchmarking right now - my  test system is 
>> building the latest gnucash-gda.  However, roughly, Query A appears 
>> to be faster than query 1 followed by query 2 .  I believe that Query 
>> A will be easier for you to maintain - eg. no code to split up long 
>> query strings.  Later, I'll do this more rigorously without the 
>> unpredictable load of building gnucash-gda concurrently, and not run 
>> the .db file on an NFS mount.  And also for MySql, and PostgreSQL.
>>
>> Here is how I've benchmarked it:
>> $ time sqlite3 mydata.db <query1.sql >/dev/null
>> The first time I ran the above, I sent its output to a file, and 
>> edited the file to produce query2.sql.
>>
>> Additionally, combining the two queries is one less IPC for MySQL 
>> (three less (plus two for each column in each of the two tables) for 
>> PostgreSQL provider).
>
> I'd be interested in any benchmarks you have.  Re SQL statement 
> length: the mysql document says there is a 16MB limit and I didn't see 
> any limit in the sqlite documentation.  Is there a way of doing 
> something like:
>    SELECT DISTINCT tx_guid FROM splits ....   /* and save the results */
>    SELECT * FROM transansactions WHERE guid IN 
> results_from_above_query /* and return the results */
>    SELECT * FROM splits WHERE tx_guid IN results_from_above_query /* 
> and return the results */
>
> i.e. have the engine cache the results of the 1st query and reuse it 
> in subsequent queries without needing to recompute?
I'm not aware of any ability to do this.  I have been wondering about 
the performance of sending such long SQL strings to the db.  I haven't 
run them yet, but did prepare files containing the queries for register 
opening.  Here is the directory listing (for my checking account in my 
test file):
-rw-r--r--  1 mj users     110 Feb  9 22:49 query1.sql
-rw-r--r--  1 mj users   24854 Feb  9 23:09 query2.sql
-rw-r--r--  1 mj users   24852 Feb  9 23:13 query3.sql
-rw-r--r--  1 mj users   24852 Feb  9 23:15 query4.sql
-rw-r--r--  1 mj users   82056 Feb  9 23:18 query5.sql
-rw-r--r--  1 mj users     150 Feb  9 22:48 queryA.sql
-rw-r--r--  1 mj users     150 Feb  9 23:04 queryB.sql
-rw-r--r--  1 mj users     151 Feb  9 23:05 queryC1.sql
-rw-r--r--  1 mj users     137 Feb  9 23:06 queryC2.sql
-rw-r--r--  1 mj users     183 Feb  9 23:06 queryD.sql

Notice how much longer the queries are which contain all the guids, 
especially query 5.  I expect that such long query strings carry a 
performance penalty.
>
> Phil
>
Mark