KVP and data that contains a forward slash
jralls at ceridwen.us
Thu Feb 11 15:48:14 EST 2016
> On Feb 11, 2016, at 11:20 AM, Derek Atkins <warlord at MIT.EDU> wrote:
> John Ralls <jralls at ceridwen.us> writes:
>>> Originally the KVP path was supposed to be a file-system. So if there
>>> is a desire to move away from that, then we should be explicit about it.
>> I hope that's not literally true in the sense of writing it out to
>> disk, that would be incredibly stupid. Surely you mean that it was
>> meant to have paths *like* a file system, which it does. It's still an
>> inefficient design because there's no locality so every step is a (or
>> more likely several because of the hash tables) cache miss.
> No, of course it's not in the sense of writing out to disk. That would
> be silly! :-)
> Keep in mind it was designed when GnuCash ONLY had XML, and the "paths"
> in a KVP tree directly mapped to the XML object hierarchy.
>>>> Most of our existing use of nested KVP isn't necessary anyway, but
>>>> changing it will require a new file/db version and conversion
>>>> routines. No point in introducing new nesting though, and besides that
>>>> would also create a file incompatibility between 2.6 and 2.8.
>>> I would disagree; it's nice to have some nesting (certainly within the
>>> XML framework) to make it easy to remove whole KVP subtrees. When you
>>> view KVP as a file system then it makes total sense to have nesting, and
>>> all the benefits that come with it.
>> That would carry more weight if we actually used the XML DOM tree
>> inside of GnuCash, but we don't. Besides, that's not how we use
>> it. With a couple of exceptions (import matching and book properties)
>> we use it to add members to classes without changing the XML or SQL
>> Schema. Those paths are two or three deep and for the most part have
>> only one or two elements at the bottom. Getting rid of KvpFrames and
>> converting the "path" to a string name so that it takes a single hash
>> lookup instead 2 or three will be more performant with no affect on
>> our actual usage. The payoff is even higher when the KVP data isn't
>> all in memory: Having to do 3 SQL queries to retrieve an int64_t or a
>> string is ridiculous. For most of those uses having to do a separate
>> query at all is ridiculous. The members should be part of the object
>> record and retrieved when the object is instantiated.
> There are two separate issues here:
> 1) The SQL encoding of KVP. The fact that it's encoded the way it is,
> using frames that multiple queries, arguable is itself a bug. I'm not
> sure why we didn't use full-path encodings in SQL. But that is
> completely seprate from how KVPs are used/stored internally.
> 2) Not loading KVP with the object. When an object is instantiated it
> should, arguably, load all the associated KVP data, too, so that it can
> be accessed directly through the object APIs. Whether it's in core as a
> KVP or in core as a table column is mostly irrelevant.
Again, the path argument would be more convincing if we actually used XPath. I don't think it really matters what grand plan the designer (you?) had in mind, what matters is how we're using it now.
No, SQL just makes the penalty more severe. Cache misses are expensive; so much so that the current C++ doctrine (based on many simulations) is that it's much faster to sequentially search a std::vector than to use any container that relies on independently allocated nodes, even when the overhead of complete reallocation for an insert operation is accounted for. Every indirection in KVP, i.e. every step involving a KvpFrame means dereferencing the KvpFrame*, dereferencing its GHashTable*, dereferencing the hash table's hashes and getting the next KvpItem*, dereferencing it, getting its content type and ptr, and if it's a KvpFrame, repeating the process. Because each of them is a different type GSlice will have a different magazine for each of them. If not much else is going on on the system all of the magazines might be in cache after the first round and subsequent descents will be faster. Or not.
That overhead could be reduced immensely by not having KvpFrames. With the whole path as a key there'd be a single GHashtable lookup and only three derefences (the QofInstance's Kvp GHashtable, the KvpItem* it returns, and the KvpItem's contents).
All of that said, I agree that not implementing KVP in the SQL backend that way was a mistake and I regret it.
Moving almost all of the Kvp access to being through GObject properties does make all of the access via the object's API. The next step is to add members to the objects and load them from KVP in the backend. That gets rid of the Kvp performance penalty entirely for the XML backend and makes it one-time for the SQL backend.
More information about the gnucash-devel