locale issues with data format when upgrading 1.8 -> 2.0

Derek Atkins warlord at MIT.EDU
Fri Feb 3 18:00:09 EST 2006


Quoting Josh Sled <jsled at asynchronous.org>:

> On Fri, 2006-02-03 at 16:24 -0500, Derek Atkins wrote:
>> I think it's a major issue that someone in an ascii-like but
>> non-latin1 locale will get garbage during the default upgrade path.
>> libxml doesn't really provide a way to do proper detection, and 1.8
>> doesn't include an encoding in the data file..  Unfortunately the XML
>> spec says that the lack of an encoding parameter means the data is in
>> utf-8, but that's not the case in 1.8 -- the data is in whatever
>> locale the user was using.
>>
>> So, how do we solve this?
>
> We can look for the presence of the "encoding" attribute on the
> <?xml ...?> header.
>
> If present, then libxml will do the appropriate encoding conversion.

I'm not worried about the case where the encoding exists.  Yes, libxml will
do the right thing.  The problem is the case without the encoding, but
where the data isn't utf-8.

> If not, then we believe the file was written by 1.8.   As such, we
> should set libxml to believe that the encoding is the system-default as
> determined from
> http://gtk.org/api/2.6/glib/glib-Character-Set-Conversion.html#g-get-charset 
> .
> It may require a re-parse of the file to get encoding-conversion done;
> I'm not sure when it's performed by libxml.
>
> This file [[[
>
> #include <libxml/parser.h>
> #include <stdio.h>
>
> int
> main(int argc, char **argv)
> {
>  xmlDocPtr xml = xmlReadFile(argv[1], NULL, 0);
>  printf("encoding: [%s]\n", xml->encoding);
> }
>
> ]]] compiled with [[[
> gcc `xml2-config --cflags --libs` -o xml-test xml-test.c
> ]]] shows that (xmlDocPtr)->encoding contains what we want to know: it's
> set when <?xml [...] encoding="whatever"?> is set and NULL otherwise.

See http://mail.gnome.org/archives/xml/2001-July/msg00165.html for why
this is somewhat problematic.  "might be due to a confusion between locale
and encoding"...

Personally, I kinda like the approach in
http://mail.gnome.org/archives/xml/2001-July/msg00164.html

However I wonder if we want to bring user input into the foray?  Should
we ask the user to choose a charset, or somehow notify the user to check
the data.  And if they check it and the conversion was wrong, what
do we do then?

Also, we should really make sure that if a user is running g2 in a
non-utf8 locale that the data output really /IS/ utf8.  There's lots
of places where we're trusting libxml2 to do what we want, but have
we really verified and tested that it's actually doing what we want?

Any KOI8-R users willing to help us test?

-derek

-- 
       Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
       Member, MIT Student Information Processing Board  (SIPB)
       URL: http://web.mit.edu/warlord/    PP-ASEL-IA     N1NWH
       warlord at MIT.EDU                        PGP key available



More information about the gnucash-devel mailing list