Creating Gnucash invoices with XML

Tue Apr 6 04:54:55 EDT 2004

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tuesday 06 April 2004 12:52, Derek Atkins wrote:
> Neil Williams <linux at codehelp.co.uk> writes:
> Hmm, not sure how you import an XML file from another application
> unless it's writing out gnucash-xml objects.  Gnucash does NOT use

It would have to comply with the to-be-written Gnucash DTD - there is no 
standard XML format (like CSV) so you can't expect XML files from another 
source to be imported unless the data model, the DTD, matches and verifies. 
However, it's dead simple to write a Perl or PHP script to re-model the data 
with user involvement.

> schema or dtd to describe the data; the XML parsers and generators are
> all hand-built.  Annoying, but the state-of-the-world.

That's the first job, build the DTD and enforce it.
(abort loading of XML files that don't match the DTD structure - that will 
need to be done by 'proper' C XML handling libraries, libxml etc.)

> >> XML file would help in multiple ways.  For example, imagine being able
> >> to re-run the Hierarchy Druid in order to add new sets-of-accounts to
> >> your data file!  An XML importer would make this much easier.
> >
> > If I'm correct in how Gnucash used XML for data storage, then you've
> > already done all that work. In order for Gnucash to save and reopen XML
> > storage files, XML definitions for every saved component must exist, even
> > if not explicitly. Not only that, but mechanisms already exist to convert
> > the incoming XML data (from the data file) into live Gnucash data (for
> > display and manipulation in the GUI).
>
> No, loading a data file is different than merging into an existing
> data file.  More on this in a bit.

Yes I know, the collision problem. That's foremost in my mind but in terms of 
the import engine, XML is less work than CSV because the bindings already 
exist. It allows me to work on collisions / merges in real Gnucash structures 
in RAM, rather than just in the XML handler. 

> TBH?

<abbr>to be honest</abbr>

> Just designing XML formats doesn't solve the merge problem.

True. There are two problems here, the XML format (small) and the 
merge/collision problem (large).

> Close.  You've broken down into multiple steps what is in actuality
> one big thing.  Right now "open previous data file," "read XML,"
> "populate data structures in RAM" are ALL part of the same subsystem.
> They are NOT distinct sets.

Pity. 

> problem is that the XML subsystem does not have a "merge".  There is
> no intermediate step of "load Datafile" that will merge into an
> existing open Datafile.  That's the import step that needs to happen.
> Yes, we have the code that will read the data and load it into a bunch
> of objects in RAM.  What we do NOT have is the GUI and logic to merge
> those object into an existing datafile-in-RAM.

OK. (I anticipated that it would be a new procedure.)

> > There must be some level of XML parsing already being performed within
> > Gnucash file operations. File->Open and File->Save etc.
> > This would simply be downgraded to import-export.
>
> Yes, but you're missing the necessary "merge" logic which currently
> does not exist.  Yes, the actual I/O functions exist, that's not the
> hard part.

Not missing it, just concentrating on getting an accurate understanding of the 
problem.

> >> The downside is the challenge in mapping the GUIDs of an imported data
> >> to an existing data.  How do you know if an account is the same?  Or
> >> an invoice?  or a customer?  It's a huge can of worms to build an XML
> >> importer (which is why it hasn't been done, yet ;)
> >
> > Not necessarily. In the help file that talked about XSLT, there were a
> > whole list of XSLT definitions for components. XML has the advantage over
> > CSV that these formats can be validated and are reliable. Therefore, an
> > XML file that claims to represent an invoice (from the choice of DTD) but
> > actually contains payment data can be rejected in a nice, informative,
> > operation.
>
> Uh, you don't understand.  I'm not at all worried about formats here.
> I'm talking about data contents and merging.

XML formatting can assist in the labelling of data chunks so that it's easier 
to handle collisions. A certain bit of data cannot occur in certain elements 
of the XML, as dictated by the DTD - the precludes certain collision events 
and limits to number of possible problems. It doesn't deal with all problems, 
but it can help with problems that CSV would leave behind - right data in the 
wrong column.

> There is more data than what's visible to the user, and you have to
> pay attention to that.  Also, just because you merge does not mean you
> want to keep the GUID.  Indeed, I would argue that you DON'T.  The
> issues are murky.  It's not as clear cut as you're making it out to
> be.  Beware the devil in the details!  It'll bite your ass if you're
> not careful.

Been there, done that. Not looking at the detail until I get the structure 
right. Then I'll start on the detail.
:-)

> > Duplications would be handled in exactly the same way as now - by having
> > the unique ID stored / retrieved from the XML, missing ID -> new record.
>
> That's not sufficient.  If I create account "foo" and you create
> account "foo", are those accounts the same or not?  What if you
> transpose into a non-gnucash dataform and then back into gnucash's
> data form?

OK, I'll have to look at a complex merge rule set.

> > If you agree (and if my assumptions about Gnucash file operations above
> > are correct) I'd recommend dumping CSV as a data import mechanism and
> > using XML instead. No need for XSLT, by defining the formats, third-party
> > applications can write native Gnucash XML documents ready for import and
> > expect valid XML export documents in the same format. (native as in 'old
> > version native'.)
>
> Unfortunately there are places that still export data in CSV format --
> in particular transactional information, or even IIF!  So we still
> need a CSV importer..  Even if that "importer" is a program that
> converts CSV -> XML and uses an XML importer.

That could be a whole project in itself, CSV->XML!

At this point, I'm really not keen on CSV as a format for invoices and I think 
I'll have my hands full with the XML data merge problem, so can I leave the 
CSV import function to someone more able / with more time? Please?!

> [ note: the following list assumed we were still talking about a CSV
>   importer.  Changing the context to an XML importer and keeping the
>   list is both unfair and incorrect.  While many of the issues involved
>   are the same, the list for an XML importer _is_ different.  -derek ]

Sorry.

> >> * field parsing (we already have a bunch of generic parsers)
> >
> > already implemented in Open/Save and in need of customising to accept
> > only partial input.
>
> Yes, the xml parsers need to be modified to not require a full book.
> Not TOO difficult, I don't think.

As we both know, the problems come after the partial import.

> >> * transaction matching
> >
> > I'll need help with that. The existing procedures are presumably not
> > anticipating a merge with existing data but are set to be read into an
> > otherwise empty memory allocation.
>
> EXACTLY!

So that's the bulk of the task. OK.

> > Is it acceptable to have a very simple rule?
>
> Depends on the rule.  Regardless, it requires user input.

So each collision event is raised with the user - correct?

> > Is there a unique ID specified?
>
> Where?  In the XML?  Yes.  But is there any guarantee that a
> non-gnucash data source will provide the object GUID?  I find that

No, they almost certainly won't. 

> unlikely.  That means you need to half-ignore the GUID and map based
> on other input.

I was thinking about my original idea about adding invoices to the data 
file/source. That uses existing data as a reference but adds new data. I 
guess the problem would be if the user tried to import the same file twice. 
That brings in the problem above, as well as in cases of a more extensive 
import of other objects that may well require overwriting existing data.

> Yes.  Too simple.  You cannot guarantee that the GUID will always
> be provided in an import mechanism, nor can you guarantee that the
> imported GUID matches the data GUID.

Understandable.

> Not what I meant.  You may need to perform a transaction match or
> duplicate check.  This has nothing to do with XML input and everything
> to do with data coherency.

Could you tell me where I might find an example of a duplicate check and 
transaction lookup in the existing CVS code? It'll help me to see how gnucash 
is structured. (There's an awful lot of code to look through if I don't know 
where to start). Also, which files in the CVS deal with the XML file I/O?

- -- 

Neil Williams
=============
http://www.codehelp.co.uk/
http://www.dclug.org.uk/
http://www.isbn.org.uk/
http://sourceforge.net/projects/isbnsearch/

http://www.biglumber.com/x/web?qs=0x8801094A28BCB3E3
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFAcnBiiAEJSii8s+MRAjXNAJ40VzDIGOvvM/y5BqT7SgV/6M5IXwCgibfz
5aGeV45iggZDNDFSYTacoL0=
=iFe+
-----END PGP SIGNATURE-----