Creating Gnucash invoices with XML

Tue Apr 6 08:26:15 EDT 2004

Neil Williams <linux at codehelp.co.uk> writes:

> That's the first job, build the DTD and enforce it.
> (abort loading of XML files that don't match the DTD structure - that will 
> need to be done by 'proper' C XML handling libraries, libxml etc.)

So you're suggesting a re-write of all the existing XML i/o to use
verified schema instead of the existing hand-created generators and
parsers?  Not that I'm against this idea, but doesn't that go against
the grain of re-using the existing code?

I suppose that's fine provided your plan is:

  1) reverse-engineer the existing XML objects into Schemas
  2) re-write the xml code to use schemas
  3) drop rewrite in place of existing code

I just suspect this is a lot of work for (IMHO) little gain.
The existing parsers do some level of validation, just not cleanly
per a Schema.

>> problem is that the XML subsystem does not have a "merge".  There is
>> no intermediate step of "load Datafile" that will merge into an
>> existing open Datafile.  That's the import step that needs to happen.
>> Yes, we have the code that will read the data and load it into a bunch
>> of objects in RAM.  What we do NOT have is the GUI and logic to merge
>> those object into an existing datafile-in-RAM.
>
> OK. (I anticipated that it would be a new procedure.)

I suspect this is the majority of the work..  Lots of gotchas.

>> > There must be some level of XML parsing already being performed within
>> > Gnucash file operations. File->Open and File->Save etc.
>> > This would simply be downgraded to import-export.
>>
>> Yes, but you're missing the necessary "merge" logic which currently
>> does not exist.  Yes, the actual I/O functions exist, that's not the
>> hard part.
>
> Not missing it, just concentrating on getting an accurate understanding of the 
> problem.

Ok.  The xml i/o is _mostly_ reusable.  I think it would be a small
amount of work to get it to be reusable.  The hard part is definitely
the merging.

> XML formatting can assist in the labelling of data chunks so that it's easier 
> to handle collisions. A certain bit of data cannot occur in certain elements 
> of the XML, as dictated by the DTD - the precludes certain collision events 
> and limits to number of possible problems. It doesn't deal with all problems, 
> but it can help with problems that CSV would leave behind - right data in the 
> wrong column.

I dont think the data-labeling has anything to do with
collision-detection.  See the QIF and OFX importers as examples.
Their data is labeled just fine; the majority of work is NOT in
reading the data file, but in merging the data into gnucash.
Duplicate detection, account determination, etc -- that's the tough
part.

> Been there, done that. Not looking at the detail until I get the structure 
> right. Then I'll start on the detail.
> :-)

But we already have the XML structures defined (albeit not in a
Schema).  Been there, done that -- reuse what we've got unless you
really want to re-write all the i/o.  But if you have limited time I
see little reason to do that.

> At this point, I'm really not keen on CSV as a format for invoices and I think 
> I'll have my hands full with the XML data merge problem, so can I leave the 
> CSV import function to someone more able / with more time? Please?!

Fair enough.

Honestly, if you want to implement a "gnucash XML import" I'm 100% behind
you, and I think it's a great idea.  I'm not trying to derail that concept.
I think it's a GREAT idea.

>> Yes, the xml parsers need to be modified to not require a full book.
>> Not TOO difficult, I don't think.
>
> As we both know, the problems come after the partial import.

Agreed.

>> >> * transaction matching
>> >
>> > I'll need help with that. The existing procedures are presumably not
>> > anticipating a merge with existing data but are set to be read into an
>> > otherwise empty memory allocation.
>>
>> EXACTLY!
>
> So that's the bulk of the task. OK.

IMHO, yes.  I think the bulk of the task is the merging.  You need to
determine what data in the import maps to the existing data, what data
is new, and what data is a duplication.  I think this is the hardest
part.  There's been a good deal of work to do this with Transactions,
but not with Accounts, or any of the business features.

NOTE: a general API to "merge two books" would be a good potential
solution.

>> > Is it acceptable to have a very simple rule?
>>
>> Depends on the rule.  Regardless, it requires user input.
>
> So each collision event is raised with the user - correct?

Yes.  Although it might behoove you to collect all the events and save
the applause until all the names have been called.  If you've got 100
events, would you rather have 100 pop-ups or a window with a list of
100 items?  Personally I'd rather have the latter.

> I was thinking about my original idea about adding invoices to the data 
> file/source. That uses existing data as a reference but adds new data. I 
> guess the problem would be if the user tried to import the same file twice. 
> That brings in the problem above, as well as in cases of a more extensive 
> import of other objects that may well require overwriting existing data.

See above.  That's definitely one problem.  Another problem is trying
to add data if you don't (necessarily) have an internal data
reference.

>> Not what I meant.  You may need to perform a transaction match or
>> duplicate check.  This has nothing to do with XML input and everything
>> to do with data coherency.
>
> Could you tell me where I might find an example of a duplicate check and 
> transaction lookup in the existing CVS code? It'll help me to see how gnucash 
> is structured. (There's an awful lot of code to look through if I don't know 
> where to start). Also, which files in the CVS deal with the XML file I/O?

See the code under src/import-export -- you'll find a bunch of
transaction matching and duplicate detection work.  It would need to
get extended to general objects instead of just transactions.  But if
you do this I think the code could definitely be re-used.

The xml i/o is in src/backend/file/*.

-derek

-- 
       Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
       Member, MIT Student Information Processing Board  (SIPB)
       URL: http://web.mit.edu/warlord/    PP-ASEL-IA     N1NWH
       warlord at MIT.EDU                        PGP key available