QSF XML file backend for gnucash-gnome2-dev branch

Wed Jan 26 03:58:44 EST 2005

On Wednesday 26 January 2005 1:58 am, Josh Sled wrote:
> I'd prefer to keep the discussion on -devel if you don't mind...?

Oops, it was getting late. It's when the list in CC: rather than To:, the 
email ends up in the wrong folder, when I'm tired, I'd miss that.

> So, if I define an object which has a list of strings, and the
> application uses those strings to index some other dataset, at runtime,
> is it not possible to serialize only one half of the data?

Currently, yes.

> All I wanted to say was that the caller is responsible for not doing
> that.
> QSF can't read minds. 

OK, actually I could add a simple counter for the number of parameters, even 
the number of each type. the param foreach routines in QOF could do that 
really quickly - I'm already counting all the objects and checking their 
registration. I'll see how the map code finalises, it may be useful.

> > Q. Where should I put user-edited files? In with the reports?
>
> This should definitely be brought up on-list.

My mistake - tired - it has now.

> It sounds like there is application [subsystem] data -- the
> user-editable portions of these maps -- that needs to be managed by
> GnuCash on behalf of the QSF library/module, I s'pose.  At the same
> time, it's not really the responsibility of GnuCash, so...

All applications using QSF would have their own user editable maps to convert 
data to other applications. The maps are the real inter-operability stuff. 
Application maps come with the installation, user edited maps can go with the 
user edited objects - it's just a case of coding how the library expects the 
application to 'go fetch'.

> Semantic integration is painful... really painful.  I've never really
> been sure why you're trying to undertake it in some general case.

Nah, it's not that bad.

I do it because I want to be able to interact with my data in my way and I 
want a permanent end to griping on this and the user lists about how to get 
the report to do precisely what others expect. I want to free the data from 
application handling and query it in new ways that applications just don't 
support.

e.g. I cannot currently identify a total business mileage for a unit time. 
That's basic stuff! - I can't sum the expenses to only identify Mileage 
expense types and output the total number of miles covered for X customer in 
any one period of time. When I do my tax return, even though I've 
painstakingly entered every single mileage claim under the appropriate 
customer in the expenses database on the Palm, I have to have a separate 
method to calculate how many miles I've covered for business purposes during 
the tax year. That's a horrendous situation. I can't even replicate that 
calculation from GnuCash - even though every invoice includes the number of 
miles covered as an invoice entry amount. 

QOF will do that - by putting the data in XML as a single lump of objects, QOF 
can query the QofBook read from that XML and do all kinds of SQL-like things. 
Importantly, it will do it so that ANY query becomes possible with NO extra 
programming. If we write a scheme report to do the calculation above, few 
people may use it. If someone else wants a slightly different report, we have 
to write a whole new set of scheme! With QSF, you can say:
1. Export the data you need into QSF.
2. Run a SQL query on the XML (using DWI when it's ready).
3. - oh, it's done already?

I can finally use QOF for QUERIES rather than iteration. Do you know, I still 
haven't had to use any of the QUERY code of QOF? It might as well have been 
FOF - Foreach Object Framework - because all that happens in my code so far 
is foreach this and foreach that.

> > > Hmm.  I've written many XML parsers, and none _required_ a schema
> > > definiton to parse data....
> >
> > OK, but parsing always requires that the incoming format is detectable
> > and predictable. When these are user-edited object files, I would have to
> > reinvent the majority of the schema to be sure that QSF could work with
> > the files.
>
> Hmm.  I guess the program knows that it should [impliclity] be using
> some schema [baked into the code] because:
>
> * the user is expliclity telling it that the file is of some type,
>   e.g., because they go to `File > Import > QOF/QSF object`.
>
> * [and/or; in the case of XML:] the [fully-qualified] containing root
>   element of the XML file is what you expect to see.  If the
>   fully-qualified root element for the document
>   `urn:qof-qsf-mapqsf-map` [sic], then you can expect that it is a map,
>   and conforms to the given schema.  If it's
>   `urn:qof-qsf-containerqof-qsf`, then it's a container, and conforms
>   to that schema.
>
> Nothing in the schema should be required to correctly parse the file [at
> runtime], assuming that the file conforms to the schema.

Yes, the code checks the schema when determining the file type, when preparing 
to load the file and then leaves it until a file is ready to write out - at 
which point it checks the outgoing file against the schema too. That just 
catches bad use of the API where the object parameter types don't really 
match the content. It saves having people complaining that their QSF can't be 
imported - if it can't be exported invalid (even with user / developer 
error), it provides an extra safeguard for not much effort. The principle 
reason for schema is on import - using on export is 'because I can'.

> > No, the two are separate. I validate the file against the schema to
> > detect the filetype. Then I validate the content to see if a map is
> > required. Then I validate the map and cross reference the map with both
> > the incoming QSF objects and the existing QOF objects in the application
> > to determine the right map to use.
>
> Please explain more.  I see why you need the maps and such, but I don't
> see how schematic validation is required to determine which map to
> use...

Sorry, I was mixing terms:

No, the two are separate. I validate the file against the schema to
detect the filetype. Then I check the content to see if a map is
required. Then I validate the map against the map schema and cross reference 
the map with both the incoming QSF objects and the existing QOF objects in 
the application to determine the right map to use.

The cross referencing is done after the schema, in C.
The checking of the content is an additional check, in C, that adds to the 
validation using the schema - that (will) perform:
1. Identifies each object and calls qof_class_registered()
2. If any objects are not registered in the calling application, the QSF has 
come from a different application and requires a map to convert the objects 
that are not already understood. This is where clashes in QOF object names 
will have to be watched. Personally, I'd prefer if the GnuCash object names 
all had "gnc" prefixed.
3. Calls a map - this code needs to know where to look for user edited maps - 
and checks through to find one that, when overlaid on the incoming QSF, 
leaves no objects that are unregistered. i.e. a map that converts all unknown 
objects into known objects.

> > It's determining the filetype and validating the content that would
> > require duplication of the schema *code* so that I know which tags will
> > occur where.
>
> The parser should impliclity have that knowledge.

Once the schema has been used to identify objects vs maps, yes.

> I thought that that 
> was the whole point of doing QSF at the level that it's at -- since it
> doesn't contain any of the application-object semantics, you don't
> _need_ to understand them in order to _parse_ the contents ... just to
> do the merge.
>
> I'm not understanding something.

You missed *code* in the quote: it's the C code that implements schema 
validation in libxml2 that would need to be reproduced in QOF. I really don't 
fancy that job.

> Why would the schema need to be 
> duplicated,

The schema itself, the xsd, doesn't. The code that handles the schema would - 
if we went below libxml2 >= 2.5.2

> or in-fact be any different for any other object-type?  The 
> schema for both Map and QSF serializations should be static...

They will be static.

If the schema validation wasn't used, the schema *code* would need to be 
reproduced in the QSF code. i.e. without knowing that the xml validates 
against the schema, the parser cannot have the knowledge of whether it is a 
map or object file unless at least some of the code that performs the schema 
validation is re-invented to check the tag sequence and identify the file 
type.

> > The code must know which tags to find where - this is absolutely
> > essential for the maps. The map is a complex sequential mask that
> > overlays one set of objects to create another set. It is imperative that
> > the map and the objects are 100% predictable. Data loss and data
> > corruption would be inevitable otherwise.
>
> Right: garbage in, garbage out.  I don't think that implies we need to
> perform runtime schematic validation, so much as the code needs to not
> be buggy.

Without runtime schematic validation, I would have to implement a method of 
reliably distinguishing between qsf objects and qsf maps, checking the 
content of every parameter tag matches the expected definitions, check that 
each object and each parameter tag has the required attributes . . . . e.g. 
I'd have to individually validate every incoming date string to check the xsd 
format. Every boolean would have to be checked and converted to TRUE instead 
of true or 1 or T.

It's not the code being buggy - if content like type="" is missing, how can 
that be handled? The schema bombs out immediately if a required attribute is 
missing. I'd have to reinvent all that or simply ignore invalid data and end 
up with data loss.

I can't know that the file IS valid unless I validate it at runtime. If I 
can't know 100% that the incoming file IS valid, I have to reimplement in 
code the majority of the checks that the schema would have done. These QSF 
objects are user-edited - all manner of garbage could be in them.

The map calculations get even worse, a conditional must follow a certain 
syntax, enforced in the schema, there must be certain tags used, it just goes 
on.

I really can't do any of this without runtime schema validation. This is quite 
enough work as it is, re-inventing something that has ALREADY been released 
is more than a little pointless.

> > > I.e., we don't need to compare the output XML to
> > > some XML Schema definition [let alone a more reasonable
> > > schema-definition language].
> >
> > It's hard enough as it is, without doing the validation by hand.
>
> By hand?  FTR, I was talking about using RelaxNG [compact] instead of
> XML Schema.  I've found the latter to be painful, and the former to be
> pain-less, simple, wonderful, &c.

Personally, I found the opposite. I'm used to the schema, it naturally fits 
the way the data needs to be handled and verified.

-- 

Neil Williams
=============
http://www.dcglug.org.uk/
http://www.nosoftwarepatents.com/
http://sourceforge.net/projects/isbnsearch/
http://www.neil.williamsleesmill.me.uk/
http://www.biglumber.com/x/web?qs=0x8801094A28BCB3E3

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://lists.gnucash.org/pipermail/gnucash-devel/attachments/20050126/e0be1c5b/attachment.bin