Search engine (was Wiki)

Wed May 5 03:31:32 EDT 2004

On Wednesday 05 May 2004 2:28, Derek Atkins wrote:
> Neil Williams <linux at codehelp.co.uk> writes:
> > The archives are in a sub-domain of gnucash.org but how is that
> > configured?
>
> machine1 is "gnucash.org", "www.gnucash.org", and "mail.gnucash.org"
> machine2 is "cvs.gnucash.org" and "lists.gnucash.org"
> Machine2 (cvs,lists) hosts cvsweb and mailman/pipermail
> (the latter being the list archives).  Right now machine1 (www) has
> very limited disk space (at least until Linas upgrades his machine).
> Machine2 is also SIGNIFICANTLY faster!

So I'd like to have the search engine on machine 2 under lists.gnucash.org 
using local filesystem access - read, no write access required for the script 
except to upload new versions.

> So, in terms of performance and disk space, using machine2 is the much
> better option (also, it's the one I maintain ;)  However it means it's
> not at www.gnucash.org.

As with the doxygen docs, it's only a link away. Besides, I can easily give 
the search engine the look and feel of the main GnuCash site - you'll only be 
able to tell the difference from the URL. I might be learning QOF but HTML, 
CSS, SSI and PHP are second nature, even if I say so myself. I don't do Flash 
and I'm no artist but if you want strict compliance HTML, XHTML, CSS, CSS2, 
SSI, PHP, Perl, XML, WML then take a look at codehelp.co.uk and 
www.dclug.org.uk - it's all my own code, most hand-crafted in Vi. I'd be 
lying if I said it was hard to do.

> IMHO it would be "easier" for the docs to be generated on the same
> machine where the cvs archives are kept.  It's a local matter to
> update the tree, so there's no network usage, and that machine is
> significantly more powerful.

Same applies for the search engine - the faster the script can access the HTML 
archive, the faster the search returns the answer.

> But I agree with you..  Having the docs live at http://cvs.gnucash.org/...
> is just fine, so long as there's a link off the main gnucash site.
> Linas may not agree, however, if he wants to keep the "gnucash look at
> feel" of the docs.  But I don't know how hard that would be to get
> doxygen to use SSI.  It might be easier with frames.

As with all other 'machine-produced' HTML, I've always found it better to use 
a Perl script AFTER generation, not to dabble in the config. The machine 
output is very standard and very predictable - ideal country for a pattern 
match that replaces the default doxygen <head> or <style> etc. with the 
necessary replacements, which can include SSI just as easily as any other 
tags. The Perl script can rename the files to .shtml and run under a bash 
script by cron. I do this all the time on machines and formats of every type.

> machine1 is in Texas, machine2 is in Massachusetts.  There is no
> special network between them, no VPN, no NFS.  So you cannot assume
> that machine1 can access data on machine2 or machine2 can access data
> on machine1 except through the same public interfaces that anyone else
> can use.

Fine, then a normal HTML link from www.gnucash.org to the search script on 
lists.

> > I can definitely search any publicly accessible HTML content from any
> > site with no local cache, it all depends on the speed of access. A simple
> > pattern match on --beginarticle-- and --endarticle-- can be used to limit
> > searches to the relevant content. All pages carry a link to the next
> > message and you don't want the subject line to cause that message to show
> > up in a search for content that relates to the content of the linked
> > message.
>
> Sure, but how does this work with a pipermail archive?  How do you
> index it?  Obviously putting it onto the archive server would be the
> best solution, IMHO, to allow easy indexing.

The same as it does with a MHonArc archive, directly against the HTML. It's an 
archive search, not a mailbox search. IMHO, indexing is a waste of resources. 
From watching the access pattern usage of other archive sites, people don't 
search in methods that suit indexing, they want to be able to select which 
archives to search by month and the terms entered don't sit well with a 
formal index. Simpler to offer the visitor the full list of archives (as the 
index page does currently, but with checkboxes) and let the user decide which 
months to search.

> > For an example, take a look at DCLUG:
> > http://www.dclug.org.uk/archive/
> > This is a MHonArc archive built from a mailing list of similar volume to
> > gnucash-devel where the archive is stored on the local filesystem. Speed
> > is obviously relative to the hardware available and I'd tailor the script
> > to try to make best use of the available performance. The archive script
> > is time-aware so that it automatically generates links as each new
> > archive is created.
>
> We use pipermail archives.  
> We don't have the original mbox archives 
> for most of our data (they were destroyed a year or so ago).

And? The kind of mailer is irrelevant, I search the archive, not the mail.

> Any 
> solution we have needs to deal with pipermail and must index the
> individual article HTML files.

Why? The script READS the HTML but indexing is pointless. What are you going 
to index on? You've already got it sorted by date, the threads are taken care 
of, WHY index it???

> > Is that linked from the www.gnucash.org home page?
>
> Yes.  See "FAQ" in the menu.  However there is no "wiki" link per se.
>
> > Should this page be updated?
> > http://www.gnucash.org/en/state_of_the_gnucash_project.phtml
>
> Oh, probably.

This is my bread-and-butter work and it's just so easy to do - it's the kind 
of stuff I enjoy doing. If I have access, I can sort all these things out - 
it doesn't sound like something to which you assign a high priority, so why 
not let me do it?

> > I can do this tomorrow. All I need is access - SSH or FTP, I can do the
> > scripts in PHP. If you can let me know how to access the location for the
> > script(s) by tomorrow morning GMT, I'll work on it straight away. (Feel
> > free to encrypt any access instructions).
>
> You could just email me the scripts and I can install them. 

Yuk! I may be good at PHP but I know when I need testing and speedy updates to 
my code!! Sorry, I've done this before and it simply did NOT work. I really 
do not want to get involved in that morass again.

> Let me 
> know what other information you need.  

absolute path names, ability to use phpinfo() as and when required (without 
leaving it there for the entire world to see), and access. Sorry, I really 
cannot work without FTP or preferably SSH. I'm NOT going to sit and download 
the archive page by page to create a test site and I don't work without being 
able to do my own updates. FTP allows me to download an accurate copy quickly 
and upload script updates simply, SSH allows me to compress the copy into a 
tarball, download that and delete the temporary tarball. Either way, I really 
have tried to do this without access and it is a process WORSE than learning 
QOF. We really, really, truthfully do NOT want to do go there! This is a 
simple job that can be done today, IF some sort of FTP or SSH user access is 
available. Just an ordinary user, but if this isn't available then sorry, I 
really can't help with the search engine.

> No offence, but right now not 
> even the developers have shell access to the server.

Pity. I've got a script that is just waiting to be adapted but I really cannot 
proceed without some access. With the timezone differences as well, I will 
not be able to fix bugs unless I can update in real time from my own box. 
(Yes, there may well be bugs, nobody writes perfect code first time!)

> > With regard to:
> > http://www.gnucash.org/en/contribute.phtml
> >  We even need someone to make sure that the mail archives are running
> > correctly, and that recent mail is getting indexed & is searchable.
> > (webmaster selected)
> >
> > Is that bit about checking the operation of the mail archives still a
> > problem? From only a casual use of the archive, recent messages seem to
> > be added very quickly. The 'searchable' I can solve for you.
>
> No, it is not a problem any more.  The lists are running fine, and the
> archives are running fine.

So the contribute script is to be updated . . . . ?

-- 

Neil Williams
=============
http://www.codehelp.co.uk/
http://www.dclug.org.uk/
http://www.isbn.org.uk/
http://sourceforge.net/projects/isbnsearch/

http://www.biglumber.com/x/web?qs=0x8801094A28BCB3E3
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: signature
Url : http://lists.gnucash.org/pipermail/gnucash-devel/attachments/20040505/4e0baa41/attachment.bin