first draft of a merge rule set

Linas Vepstas linas at linas.org
Wed Apr 14 21:35:19 EDT 2004


On Wed, Apr 14, 2004 at 12:14:42PM -0400, Derek Atkins was heard to remark:
> linas at linas.org (Linas Vepstas) writes:
> 
> > yeah, I guess ... that's OK. I dunno. maybe not.   But a good baseline 
> > would be purely automated algo like so:
> 
> But wait, there are multiple questions that need to be answered.
> 
> 1) Is this semantically the same object?  For example, does this
>    Account* and that Account* point to (semantically) the same
>    "Account"?  

I'm not sure what you mean by 'semantically' ...
Are you comparing an account to something thats a not-account ??

> I'm not convinced a distance vector is necessarily
>    the correct measure.  

Its a measure, its servicable, reasonably generic.  

'correct' is a mighty big word when talking about similarity between
things.

> Perhaps a weighted distance vector would
>    be appropriate (but then again the weights would be per-object,
>    a "parameter weight"?).

Yeah, well, the problem of how to weight things is a problem
of heuristics.  I think that there's a reasonable set of 
weights that will satisfy most human users most of the time.

The 'correctness' problem, and the 'arbitrariness' of these 
weights is one reason to keep the concept & implementation 
outside of the engine.


> 2) Do these semantically equivalent objects have the same or
>    different data in them? 

I don't understand what your saying.  What's an example of 
'semantically equivalent objects' that would have different data?

Do you mean 'similar objects that have slightly different data?'
For example: two transactions for the same amount, same payee, 
dates that differ by one?  These are 'similar' in my dictionary,
I don't know what you mean by 'semantically equivalent'.

>    This is more a question to determine
>    if you have any work to do once you determine that you've got
>    a duplicate in the import queue.

I don't get this either.  If you've got two transactions that 
are identical, you've got work to do.  If they are almost nearly
identical, you've got exactly the same work to do.  Only if 
they are significantly different do you have to stop and popup 
a gui to ask the user.

--linas

-- 
pub  1024D/01045933 2001-02-01 Linas Vepstas (Labas!) <linas at linas.org>
PGP Key fingerprint = 8305 2521 6000 0B5E 8984  3F54 64A9 9A82 0104 5933


More information about the gnucash-devel mailing list