(AUDIT?) Re: r14892 - gnucash/trunk - Add a new QOF_TYPE_NUMSTRING to add numeric sorts. (#150799).

Wed Sep 27 08:29:33 EDT 2006

Chris Shoemaker <c.shoemaker at cox.net> writes:

> Oh, I missed that "limit 10" part.  This is really conflating
> filtering with sorting.  Does _GnuCash_ really have a use for "filter
> N"?  _Even_ if we want to support remote datasets larger than RAM, you
> already have filtering by "where".  So, you're describing a case when
> you don't even want to return full query results!  I just don't see
> this being even remotely possible for "personal and small-business"
> accounting software.

The API to set the max results is certainly used.  Druid-Acct-Period
uses it.  And it's used in a few other places, too, including the
register.

> Well, I figure the on-disk representation is probably 2-4 times larger
> than the in memory size (totally a guess).  So I wouldn't worry unless
> his datafiles are > .5GB.

I believe his uncompressed data size is on that order of magnitude.

>> Then tell me that it's okay to have it all in QOF's RAM "cache"..
>
> I would say it's okay to have it all in RAM, and I don't think it
> needs any special "cache" at all.

Many of the internal algorithms don't scale well.  Even linear searches
over that many objects can take a noticible amount of time.

>> Now imagine going out to 20 years of data, hundreds of thousands of
>> transactions...
>
> 10 years, 20 years, 100 years... Datasets grow linearly.  RAM doesn't.
> To find the cross-over point when personal and small-business
> accounting data approached sizes larger than average RAM, I think we'd
> have to go back to the 1980s.

And performance degrades linearly..  That's why I call it Caching.
There's no reason to have a full 20-year dataset in RAM if you're only
operating on the last year's worth of data.

>> Wouldn't you rather have a smaller working set?  I know I would.
>
>>From a user's POV, smaller memory requirements traded for increased
> latency isn't a clear win.  From a developer's POV, having uniform
> access to the whole dataset is a clear benefit.

It's only increased latency for some operations, when you go outside
the existing cached dataset.   In your approach it's a continually
decreasing performance.. forever.....  Even with data that the average
user would probably never even look at or use.  Whereas if you only
cache the data required to perform the operations that the user
wants to perform then the full dataset can grow and performance wont
degrade over time.

I'll also point out that we already DO have a uniform access to the
whole dataset: The QofQuery.

-derek

-- 
       Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
       Member, MIT Student Information Processing Board  (SIPB)
       URL: http://web.mit.edu/warlord/    PP-ASEL-IA     N1NWH
       warlord at MIT.EDU                        PGP key available