Dirty entity identification.

Thu Jul 21 17:04:28 EDT 2005

On Thu, Jul 21, 2005 at 08:13:22PM +0100, Neil Williams wrote:
> On Thursday 21 July 2005 12:57 am, Chris Shoemaker wrote:
> > Ok, this was my point.  I completely understand that you can get a
> > very quick boolean answer to the question "has anything in the book
> > changed?" by checking each collection's dirty flag.  But think about
> > *how* you'd have to create a list of all dirty entities for the case
> > where the task is to commit just the 1 split that changed in my
> > example.
> 
> I had been - and it could be solved but I'd have to formalise the idea first. 
> I'm not sure what is the real-world use for such an API. I can see it as a 
> fallback for a failed write but that isn't particularly common. I can see it 
> for incremental storage systems but we don't use those yet. (SQL aside as 
> that can do this via a separate mechanism).

If by "incremental storage system" you mean something that commits
only what has changed, then we're on the same page.  (Incidentally,
even "immediate-commit" systems sometimes fallback to "delayed-commit"
systems when they're in "offline" mode.)

> 
> > ISTM (It seems to me) there are 3 options: 1) You can't do 
> > that; you must commit all 100000 Splits.  2) You can do that just
> > fine, but you must do a linear search through 100000 Splits to find
> > the 1 that changed.  or 3) You start at the dirty book, and perform
> > the tree search I described before. 
> 
> Derek's point stands: The book knows nothing about the tree. There is no tree 
> within the book, it only exists in *our* conceptualisation of the 
> relationships between objects. All the book knows about are collections and 
> collections are not linked to each other - only objects link to other 
> objects.
> 
> Now it *could* be possible for the collection to keep a GList of changed 
> entities in it's own collection. The question is, is it worth doing?

I think not.  But this isn't what is required to be able to perform
incremental updates with out the linear searches.  As you say, it's
predictive, not retrospective.  

> 
> Keep in mind that all existing mechanisms are retrospective - not much is done 
> until the question is asked. Storing a GList of modified entities would have 
> to be predictive: whether you need it or not, it would be being done. This 
> isn't just storing a single boolean value that covers tens of thousands of 
> entities, the GList would store each modified entity and could get incredibly 
> long in some cases. It may only be storing a pointer to the entity or maybe 
> it's GUID (as the type is already determined by the collection), but that 
> will mount up. It is conceivable with the SQL query dialog that I've got 
> planned for after G2, that the user could update every single instance of one 
> object type in one operation.
> 
> > The time cost difference between 
> > 2) and 3) can be arbitrarily large.
> 
> I think it would be too large to inflict on all users at all times for the odd 
> occasion that it might be useful.

I think you may misunderstand.  Both the linear search and the tree
search are retrospective, and the cost of the linear search for dirty
instances of all types will *always* be equal to or greater than the
tree search, and usually (in the cases where not everything is dirty)
it will be MUCH greater.

Proof: To find all the dirty instances of one type with a linear
search where at least one instance is dirty in a collection by type,
you must check every instance in the collection.  With a tree search
you need not check any instance whose referent hasn't been marked as
"containing something dirty".

> 
> > I can see that QSF only needs to handle lists of uniformly typed
> > entities.  However, if there's no way to ask "are there dirty
> > Transactions in this Account"
> 
> The Account is marked dirty but the entities responsible are not currently 
> identifiable.
> 
> > , then *every* selection of a subset of 
> > Splits for commiting will require a linear search through *all*
> > Splits.  Does that seem like a problem to you?
> 
> It would if I could see a need to identify only these entities. 

Right, well, obviously, if you don't mind committing 100000 splits
when only one has changed, then the cost of finding the one doesn't
really matter.  

> Currently, I can only see this as a solution in search of a problem.

Maybe you're right, but let me play devil's advocate: I don't know the
current state of the backends, but imagine this scenario: Backend is
remote server, and connection to server goes down.  What happens?  One
option is that GC prevents the user from continuing to edit the data
on the screen.  Option two is that GC alerts the user that the
connection went down and that changes will be committed to the server
when the connection comes back, if ever.  Let's say we want option
two.  The user adds/changes some splits and the connection comes back
so we want to commit what has changed.  But how?  Several options:

  1) We cached the changes as they were made (as you describe in your
"predictive" method.)  We just clear the cache.

  2) We just send the entire Split collection to the backend and let
it figure out what changed.

  3) We do a linear search through the Split collection to find the
few changes and commit those.

  4) We do a tree search that finds that only one Account is marked as
"contains dirty Splits" so our linear search through Splits is only
through that Account's Splits instead of all Splits.  We find the
changes and commit them.

Any of those options would work.  But if this is something that
happens often, 2) and 3) will probably be unacceptably expensive.

Maybe GC will never have to address this issue because it will never
support an "offline" mode with a remote backend.  If it does, 4) will
be easy to implement as long as instances store a reference to their
"parent", like Split does.  The implementation is simply to do the
same thing to the parent's "contains something dirty" flag as you
currently want to do to the Collections "dirty" flag.

-chris