Merging two GNCBook* objects

Fri Apr 9 13:32:10 EDT 2004

On Friday 09 April 2004 4:39, Derek Atkins wrote:
> > Something along the lines of:
> > GNCBook* g_merge(GNCBook* main GNCBook* import) {}
> >
> > I anticipate creating another two books in memory:
> > GNCBook* collision and GNCBook* parsed - collision would be offered to
> > the user for confirmation / amendment and then the (amended) collision +
> > parsed would be committed to main and returned. This way, if the user
> > aborts (because of the number / type of collisions), I can just delete
> > collision, import and parsed and leave main untouched. Otherwise, I'd use
> > the amended collision object to add / modify records in main and add
> > parsed - containing records that are simple imports with no collision
> > problems, like new transactions in accounts possibly modified by the
> > merge.
>
> While good in theory, I don't think this is exactly the best approach.
> I think you want to break the import down into pieces, and I'm not

I was beginning to work the same way. There will need to be separate handling 
of the various component objects. I didn't put that in the summary as I 
wasn't sure whether those functions should be part of the API or just for 
internal scope.

> convinced that storing them in a new GNCBook* is the right thing to do
> (there is a lot of overhead in a GNCBook*).  I may be wrong here -- I'm
> hoping some of the other import developers can speak up.

Thanks.

> > dummy outline:
> >
> > GNCBook* g_merge(GNCBook* main, GNCBook* import) {
> > // Create the rule set object
> > // Use the set to make decision 1: Is this data going to conflict with
> > main. // Yes -> GNCBook* collision, No -> GNCBook* parsed
> > // repeat until import is exhausted
> > // I'll need some kind of tally of how collision has been amended by the
> > // user
> > // That tally can then tell me how to resolve each collision and the two
> > // books can be added to main.
> >
> I'm working on some "generic druid" code in the g2 branch.  You can
> probably use that.  This interface closely matches the "transaction
> duplicate detection" interface that already exists, except it would
> need to be extended to other data objects.
>
> > Two types of actions to resolve collisions:
> > 1. Main overrides import
> > 2. Import overrides main
>
> I think the list of rules is going to depend on the object-type.  For

In the internals yes, I should have specified:

Two types of *user* actions to resolve collisions as presented in the dialog:
1. Main overrides import
2. Import overrides main

i.e. for each specific collision (no matter how far down the tree), the user 
only needs to decide whether the keep the original or import the new. I'd 
rather not add a burdensome 'edit-in-place' feature.

> example, how you "merge" an accounts is going to be different than how
> you merge a transaction, and how you merge an invoice is going to be
> different than how you merge the invoice item-list.

Yes, dealt with internally according to the rule set and user response.

> > So a simple tally of user response for each collision event can be
> > resolved to parse the collision book.
>
> Maybe.
>
> You might find that a simple list is not sufficient.  I don't know,

I feared as much. I like to start simple - that way more bits stay simple!

> but my gut feels that somewhere in the rules you need to determine
> for each object in import if it's:
>
> "the same"		(guids match)
> "maybe the same"	(guids don't match but something else matches,
>                          like maybe the account name or invoice owner/date)
> "new"			("clearly" new)

the same - I'll get the import engine to ignore the import data, leave main 
untouched. I'd appreciate comments on just how strict this has to be:
e.g. If the description field doesn't match but the date, account, amount and 
category do match - I'd still list that as a collision but what if it's only 
a difference in capitalisation? 'Lunch' instead of 'lunch' - it would save a 
lot of queries to look for matches with case-insensitive patterns. There 
would still be anomalies with abbreviations, whitespace etc.

maybe the same, maybe different - report to user using the collision object 
(probably not a complete GNCBook). Merge action is dictated by the nature of 
the import data involved in that specific collision.

new - don't list in the collision dialog, just store in case the user aborts 
and then commit when the other collisions are resolved.

There's a balance to be struck here between giving the user 2,000 collisions 
and leaving the user 100 transactions to adjust manually.

> What we're testing here if the the object refers to the same semantic
> concept.  E.g. there is a semantic concept of the top-level Asset
> account.  But if I'm merging your account tree into my account tree
> the guids will differ, but semantically they are the same.  Hense,
> these accounts are "maybe the same".  You probably need user
> intervention here to properly map the "maybe the same" objects.

Yes, from our previous discussions, I was not going to put a lot of weight on 
matching guid's - it's only a part of the match and the match would still 
fail if other parts of the object differed.

> If objects are "the same" or "maybe the same" then you might need to
> determine whether they contain different data.  For example, if I

Definitely. Although not expressed in the summary, that will be in the detail.

> Then you also need to keep all the references correct.

:-) Ho-hum. The rule set and the references are going to take the most care.

> > I'm hoping to start the rule set this holiday weekend by setting out
> > which parts of GNCBook would have to be overwritten and which would have
> > to be appended after user confirmation. (Basically sorting settings from
> > transactions) and then creating a test program for development that
> > implements two basic GNCBook's and outputs the impact of rule changes on
> > each.
>
> What do you mean "overwritten"?  Oh, I think you mean trying to
> determine which objects are the "same", "maybe the same", and "new"?

Yes, identifying which data objects within a GNCBook object in RAM cannot be 
duplicated / repeated and which therefore, in the event of user confirmation, 
would be need to have that data overwritten to reflect the imported data. 
i.e. an account can only have one name but many transactions - the account 
name would be overwritten if the user chose to allow that data to be 
imported. Transactions may be overwritten (confirmed collisions) or appended 
(new).

> Also, I _HIGHLY_ suggest you work from CVS HEAD and _NOT_ from the 1.8
> tree.  Otherwise it's just adding work, and frankly this code wont be
> getting into 1.8 so you might as well work from "current" code.

I'll change over.

> Also, just make sure you can plug in rules per object type.  :)

Absolutely - there'll be no lock-in or lock-out!

> See the qofobject code in CVS HEAD to see what I mean.
>
> Good Luck!
>
> PS: Feel free to pop into the #gnucash channel on irc.gnome.org to
> discuss it with us.  Many of the devs hang out there and answer tech
> questions for each other.

It's absolutely ages since I used IRC.
(I tend to be too verbose and I make stupid typos / brain farts when trying to 
respond in real-time.)

-- 

Neil Williams
=============
http://www.codehelp.co.uk/
http://www.dclug.org.uk/
http://www.isbn.org.uk/
http://sourceforge.net/projects/isbnsearch/

http://www.biglumber.com/x/web?qs=0x8801094A28BCB3E3
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: signature
Url : http://lists.gnucash.org/pipermail/gnucash-devel/attachments/20040409/a723e77c/attachment.bin