IRC discussion on i18n, xml/utf8, and 1.8->2.0 data migration issues

Thu Feb 2 12:43:45 EST 2006

For the masses, we had this discussion on IRC earlier today.
I'm copying the logs here for posterity (and my cstim's request).

-derek

<cstim> btw our gnucash data file still begins with <?xml version="1.0"?> i.e. without the encoding="something" attribute.
<cstim> (doesn't it?)
<cstim> This is the root cause of http://bugzilla.gnome.org/show_bug.cgi?id=329202
<jsled> Correct.
<cstim> and we should start as soon as possible to add that tag again, but I'm not yet sure how to set it and what implications this will might have as well.
<cstim> I didn't have a problem of 1.8 <-> 2.0 file compatibility, but only because my German umlauts happen to be in latin-1
<warlord> io-gncxml-v2.c:    fprintf(out, "<?xml version=\"1.0\"?>\n");
<warlord> in src/backend/file
<cstim> really?!?
<jsled> la la we suck.
<cstim> royally.
<warlord> in write_v2_header()
<jsled> We should really have 2.0 fix the upgrade path.
<warlord> yes, but 2.0 needs to figure out what locale the old datafile is in..
<warlord> (and prompt the user).. Which might be.. challenging.. based on the code path and lack of callback.
<jsled> Hmm.  There should be some way to determine the system-default character encoding.
<jsled> If we open a data-file without @encoding, then we use that, and convert to utf-8 on subsequent writes.
<jsled> If we see @encoding, then we're good to go.
<cstim> although I'm confused as to why my latin-1 characters (the non-ascii ones) were read correctly, because http://www.xmlsoft.org/encoding.html claims:
<cstim> "If there is no encoding declaration, then the input has to be in either UTF-8 or UTF-16, if it is not then at some point when processing the input, the converter/checker of UTF-8 form will raise an encoding error"
<cstim> (and non-ascii latin1 is neither UTF-8 nor UTF-16)
<warlord> it could be that we're just not "properly" using libxml..
<warlord> I mean, we're certainly writing out the data ourselves...
<cstim> jsled: the system-default encoding can be obtained by g_get_charset(), http://developer.gnome.org/doc/API/2.0/glib/glib-Character-Set-Conversion.html#g-get-charset
<cstim> And of course the original @encoding needs to be stored somewhere.
<jsled> why?
<cstim> compatibility to 1.8
<jsled> If we convert on the way in to utf-8, and we only ever write utf-8...
<jsled> Oh.  I don't think I care.
<cstim> no, you cannot "don't care", *yet*.
<warlord> if we specify utf-8 then 1.8 should "do the right thing", no?
<cstim> we even still don't write the xml namespaces because of compatibility to 1.8.0
<jsled> Yes, we do.
<warlord> cstim: actually in 1.9/svn we do write the namespace.
<cstim> we do? so we only have compatibility to >= 1.8.5
<warlord> cstim: yea, but 1.8.5 was released in late 2003, so I think that's okay to only have compatability with the last 2.5-3 years backwards.
<cstim> obviously nobody really read http://wiki.gnucash.org/wiki/Release_Schedule because I raised that question there for some time now.
<warlord> 2003-09-11  Chris Lyttle  <chris at wilddev.net>
<warlord>         * configure.in: Release 1.8.6
<warlord>         * NEWS: Release 1.8.6
<warlord> Oh, I read it and thought "but we do that already"...  But didn't think to comment.  Sorry.  I was reading it for the schedule parts, not that.
<warlord> What would g_get_charset() do in that koir(sp?) environment?  Would it say "utf-8?" or "koir"?
<cstim> gnucash-1.8 uses libxml1, doesn't it?
<warlord> I think 1.8 can build against either xml1 or xml2
<cstim> warlord: I guess it would say "koi8-r"
<warlord> [warlord at cliodev build]$ ldd /opt/gnucash-1.8/lib/gnucash/libgncmod-backend-file.so | grep libxml
<warlord>         libxml.so.1 => /usr/lib/libxml.so.1 (0xb7eb4000)
<warlord> Sorry, I was wrong.  It builds against libxml1
<warlord> my xml fu is pretty low..  So, how do we tell the xml parser to convert the data for us?
<cstim> it's all on that xmlsoft.org page.
<warlord> rather, how do we tell libxml that it's really not utf-8?
<cstim> I *think* if @encoding exists then the xml parser will automatically switch to that.
<cstim> xmlSwitchEncoding() ?
* cstim just discovered that xmllint --encode UTF-8 test1.xml > utf8.xml would work as well
<warlord> http://mail.gnome.org/archives/xml/2001-July/msg00160.html
<warlord> The problem is that xmlParseDocument() will call xmlSwitchEncoding().
<warlord> And according to Dan Veillard, the encoding specified in the xml document is canonical:  http://mail.gnome.org/archives/xml/2001-July/msg00161.html
<warlord> So we would need to 'sed' the document to give it a locale.
<jsled> except... I thought we had evidence that that's not happening.
<cstim> http://www.xmlsoft.org/html/libxml-parserInternals.html#xmlSwitchEncoding isn't particularly verbose, too
<jsled> This is going to be great fun! :)
<cstim> I wonder whether these statements from 2001 are still valid.
<warlord> I dont know.. I think we need to look at the code history and see what the code says and when it was changed.
<warlord> Of course we're 5 years behind now...
<warlord> I'm afraid we might need to tell people to manually modify their data files...
<warlord> we should also make sure that g2 will automatically output utf8 even in a koi8-r locale.
<warlord> (and we should set the encoding)
<jsled> I don't think we need to have people manually modify their data files.
<jsled> Either libxml2 DOES convert to utf-8, in which case we're golden.
<cstim> re output: well, first we need to verify that 1.8 will correctly recognize a encoding="UTF-8" attribute
<jsled> Or it DOESN'T, and we determine the system-default charset and convert ourselves.
<cstim> jsled: convert in which step?
<jsled> 2.0 only  ever saves in utf-8, and sets @encoding.
<jsled> Well, easiest to let libxml do it by forcing the xmlSwitchEncoding call, I'd guess.
<cstim> internally, libxml2 has *only* utf-8
<warlord> jsled: except gnc-2.0 doesn't set the xml encoding.
<warlord> I'm fixing that now.
<cstim> warlord: how?
<jsled> except what?
<cstim> warlord: This area needs some discussion input from non-ascii people or otherwise we won't get it right again.
<warlord> when gnucash (svn) creates a data file, it does not specify the XML encoding.
<jsled> oh, sure.  That's a bug, you're apparently fixing.
<jsled> But there aren't any documents of relevance that have been saved with 2.0.
<warlord> correct, I just fixed that.  So now whenever we create an xml document, I specify encoding="utf-8"
<warlord> We still, however, have the upgrade problem.
<jsled> yeah.
<warlord> (and I dont know what happens if the user is running in a non-utf8 locale)
<jsled> um.
<warlord> will their g2 output still be utf8?
<warlord> we might just printf() the data directly..
<jsled> wtf?
<cstim> what?!??
<warlord> Oh, never mind, we do build a DOM Tree on save.
<warlord> However we output the xml header ourself.
<jsled> hmm. both, I guess; there's certainly a bunch of fprintfs for the framing.
<cstim> regarding encodings in general, did you read http://www.joelonsoftware.com/articles/Unicode.html ? I'd strongly recommend it.
<jsled> yeah ... most of the structures are xmlElemDump'ed
<warlord> Yes, I've read it.
<jsled> But the framing is fprintf'ed.
* jsled sighs
--- hampton|away is now known as hampton|slow
<warlord> that shouldn't be an issue..
<jsled> True.
<jsled> cstim: So, as per that discussion with kostik, it seems 1.8 saves in the system-default encoding, always; would you agree?
<cstim> jsled: yes
<jsled> Great.  So 2.0 just needs to branch on the presence of @encoding.
<warlord> Assuming libxml has the SetEncoding override, I wonder if we could just use the libxml encoding detection ala that 2001 email thread to detect and override our own.
<cstim> warlord: you mean instead of using the system's g_get_charset()?
<warlord> e.g., do something like:  enc = xmlDetectCharEncoding(start, 4);
<warlord>     if (enc != XML_CHAR_ENCODING_NONE) {
<warlord>         xmlSwitchEncoding(ctxt, locale-encoding);
<jsled> What's the other branch?
<jsled> We don't care if the encoding's present.
<warlord> jsled: correct.
<cstim> warlord: you mean xmlDetectCharEncoding instead of the system's g_get_charset?
<warlord> cstim: no
<warlord> I'm not up on the libxml API -- I'm trying to find the docs now.  What I mean is:
<warlord> if (xml has encoding specified) {
<warlord>    use xml-specified encoding;
<warlord> } else { /* xml does not specify encoding */
<warlord>   use "locale" encoding;
<warlord>   warn user to check their data;
<warlord> }
<jsled> Hmm.  Or, `else { warn user; return iconv(from=locale-encoding, to=utf-8); }`...
<cstim> jsled: no
<jsled> no?
<hampton|slow> instead of 'use "locale" encoding' I would say that we should request an encoding from the user and default the response to the locale encoding.
<cstim> jsled: iff we can get libxml2 to accept that the file is in a different encoding, then it will do the conversion itself
<jsled> cstim: ah, yah. true.
<warlord> cstim: it's POSSIBLE that xmlSetEncoding() will tell libxml "use this encoding dammit"
<cstim> hampton|slow: in principle yes, but in practice the user will get this question each time when 1.8 has used the file in the meantime
<warlord> but I'd need to see the source to verify.
<warlord> cstim: I think that's okay -- I can't imagine that many users will switch back and forth between 1.8 and 2.0 -- also -- we don't know if a koi8-r user on 1.8 can read a utf-8 encoded XML document with "koi8-r - translated" characters.
<hampton|slow> How many people are going to switch back and forth from 1.8 to g2? Besides, we could always cache the answer.
<cstim> the docs of libxml2 suck. Not that I'd be up for writing better docs, though.
* cstim switches back and forth between 1.8 and 2.0
<cstim> warlord: what was that with the koi8-r user? I hope we get feedback from Kostik about that question, because IFF libxml1 honors the encoding="utf-8" then it will also read it correctly.
<warlord> There's xmlSwithcEncoding() and xmlSwitchInputEncoding()
<warlord> That's a very big IFF
<warlord> But worse -- let's say that libxml1 honors it -- what's the charset of the data that gnucash will read out of the DOM tree?  will libxml1 convert the utf-8 to koi8-r?
<warlord> gnucash 1.8 will expect it in koi8-r..
<cstim> I'm not sure at all.
<warlord> me either.
<warlord> *grr*  
* warlord would like to strangle the original gnc xml authors for not dealing with this..
<cstim> I just know that my non-ascii latin1 file is being used back and forth in 1.8 and 1.9, where 1.8 has LANG=de_DE (i.e. *not* utf8) but 1.9 has LANG=de_DE.utf-8
<warlord> Then again, we didn't deal with it in 1.8, either.
<cstim> and for whatever reason it works fine so far
<warlord> that's because libxml2 can detect iso-latin1 and 'gets it right'
<warlord> it doesn't "get it right" for koi8-r
<warlord> this is discussed in that thread from 2001 that I posted earlier.
<warlord> let me find the actual message which is pretty good about it.
<cstim> actually I'm not too sure about the actual encoding in my file. I'll need to check, but I don't have time before the weekend.
<warlord> first, here's a code snippet that might be interesting to us:
<warlord> http://mail.gnome.org/archives/xml/2001-July/msg00164.html
<warlord> here's the good description of the problem:
<warlord> http://mail.gnome.org/archives/xml/2001-July/msg00168.html
* warlord will be back shortly.
<cstim> Because either the file is in latin1, which means that 1.9/libxml2 miracuously writes latin1 instead of utf8 and 1.8/libxml1 uses this without conversion, or the file is in utf8, which means 1.9 uses it without conversion and 1.8/libxml1 miracuously converts the utf8 to the system locale latin1.
<jsled> hmm.  are you using non-utf8 latin1 characters?
<cstim> I use äüöß
<cstim> German umlauts, in HTML &auml; &uuml; &ouml; &szlig;
<cstim> the encoding of those differs between latin1 and utf-8, yes.
* jsled nods
* warlord is back
<warlord> cstim: can you compare the encodings used in a file generated by 1.8 and one generated by svn?
<cstim> I can, but not before next week.
<cstim> Can someone copy the interesting part of our discussion somewhere? Maybe a wiki page?
* cstim needs to un
<cstim> s/un/run/
--- cstim is now known as cstim_away
-- 
       Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
       Member, MIT Student Information Processing Board  (SIPB)
       URL: http://web.mit.edu/warlord/    PP-ASEL-IA     N1NWH
       warlord at MIT.EDU                        PGP key available