Gnome2 UTF-8 handling
Reinke Bonte
reinke.bonte at web.de
Thu Dec 4 05:11:17 CST 2003
> >
> > > 2.) Get rid of the "wide character set" and use utf-8 for the
> > > user I/O as well as the internal calculations.
> >
> > #2 is the correct option. We should just keep everything in UTF8.
>
> Agreed, 100%.
If my understanding is not completely wrong, you have to choose the
first option and stick with wide characters.
There is no contradiction between "wide characters" and UTF-8. In fact,
you need to use "wide characters" to properly handle UTF-8 encoded
strings. Therefore #2 is not an option.
This is what my libc documentation says:
[...]
UTF-8 is an ASCII compatible encoding where ASCII characters are
represented by ASCII bytes and non-ASCII characters by sequences of 2-6
non-ASCII bytes, and finally UTF-16 is an extension of UCS-2 in which
pairs of certain UCS-2 words can be used to encode non-BMP characters up
to 0x10ffff.
To represent wide characters the char type is not suitable. For this
reason the ISO C standard introduces a new type which is designed to
keep one character of a wide character string. To maintain the
similarity there is also a type corresponding to int for those functions
which take a single wide character.
Data type: wchar_t
This data type is used as the base type for wide character strings.
I.e., arrays of objects of this type are the equivalent of char[] for
multibyte character strings. The type is defined in `stddef.h'.
[...]
>
> > The hard part is going to be converting the existing XML and
> > database data from whatever it's currently using to UTF8.
>
> We don't currently include an "encoding" in the XML data file. That
> could be used as a trigger to ask the user for the old encoding and
> then convert the data to UTF-8. A nice touch would be to scan the
> file first looking for any characters with the high order bit set to
> see if conversion is needed in the first place.
I don't know about database data, but the XML file is a complete mess.
You will not find any high order bit set in the XML file, because libxml
has converted everything into HTML-entities. But unfortunately the wrong
entities for every encoding != Latin1. Here a manual recoding of the
XML-File is necessary, as I described twice here on this mailing list.
Reinke
More information about the gnucash-devel
mailing list