Gnome2 UTF-8 handling

Thu Dec 4 05:11:17 CST 2003

> > 
> > > 2.)  Get rid of the "wide character set" and use utf-8 for the
> > > user I/O as well as the internal calculations.
> > 
> > #2 is the correct option.  We should just keep everything in UTF8.
> 
> Agreed, 100%.

If my understanding is not completely wrong, you have to choose the
first option and stick with wide characters.

There is no contradiction between "wide characters" and UTF-8. In fact,
you need to use "wide characters" to properly handle UTF-8 encoded
strings. Therefore #2 is not an option.

This is what my libc documentation says:

[...]

UTF-8 is an ASCII compatible encoding where ASCII characters are
represented by ASCII bytes and non-ASCII characters by sequences of 2-6
non-ASCII bytes, and finally UTF-16 is an extension of UCS-2 in which
pairs of certain UCS-2 words can be used to encode non-BMP characters up
to 0x10ffff.

To represent wide characters the char type is not suitable. For this
reason the ISO C standard introduces a new type which is designed to
keep one character of a wide character string. To maintain the
similarity there is also a type corresponding to int for those functions
which take a single wide character.

Data type: wchar_t
    This data type is used as the base type for wide character strings.
I.e., arrays of objects of this type are the equivalent of char[] for
multibyte character strings. The type is defined in `stddef.h'.

[...]

> 
> > The hard part is going to be converting the existing XML and
> > database data from whatever it's currently using to UTF8.
> 
> We don't currently include an "encoding" in the XML data file.  That
> could be used as a trigger to ask the user for the old encoding and
> then convert the data to UTF-8.  A nice touch would be to scan the
> file first looking for any characters with the high order bit set to
> see if conversion is needed in the first place.

I don't know about database data, but the XML file is a complete mess.
You will not find any high order bit set in the XML file, because libxml
has converted everything into HTML-entities. But unfortunately the wrong
entities for every encoding != Latin1. Here a manual recoding of the
XML-File is necessary, as I described twice here on this mailing list.

Reinke