Gnome2 UTF-8 handling
reinke.bonte at web.de
Thu Dec 4 05:11:17 CST 2003
> > > 2.) Get rid of the "wide character set" and use utf-8 for the
> > > user I/O as well as the internal calculations.
> > #2 is the correct option. We should just keep everything in UTF8.
> Agreed, 100%.
If my understanding is not completely wrong, you have to choose the
first option and stick with wide characters.
There is no contradiction between "wide characters" and UTF-8. In fact,
you need to use "wide characters" to properly handle UTF-8 encoded
strings. Therefore #2 is not an option.
This is what my libc documentation says:
UTF-8 is an ASCII compatible encoding where ASCII characters are
represented by ASCII bytes and non-ASCII characters by sequences of 2-6
non-ASCII bytes, and finally UTF-16 is an extension of UCS-2 in which
pairs of certain UCS-2 words can be used to encode non-BMP characters up
To represent wide characters the char type is not suitable. For this
reason the ISO C standard introduces a new type which is designed to
keep one character of a wide character string. To maintain the
similarity there is also a type corresponding to int for those functions
which take a single wide character.
Data type: wchar_t
This data type is used as the base type for wide character strings.
I.e., arrays of objects of this type are the equivalent of char for
multibyte character strings. The type is defined in `stddef.h'.
> > The hard part is going to be converting the existing XML and
> > database data from whatever it's currently using to UTF8.
> We don't currently include an "encoding" in the XML data file. That
> could be used as a trigger to ask the user for the old encoding and
> then convert the data to UTF-8. A nice touch would be to scan the
> file first looking for any characters with the high order bit set to
> see if conversion is needed in the first place.
I don't know about database data, but the XML file is a complete mess.
You will not find any high order bit set in the XML file, because libxml
has converted everything into HTML-entities. But unfortunately the wrong
entities for every encoding != Latin1. Here a manual recoding of the
XML-File is necessary, as I described twice here on this mailing list.
More information about the gnucash-devel