Symbian Developer Library

SYMBIAN OS V6.1 EDITION FOR C++

[Index] [Glossary] [Previous] [Next]



Transformation formats

The UCS-2 form of the Unicode character set encodes each character as 2 bytes (16 bits total). However it does not specify which of the bytes is most significant. The byte order, or endian-ness, is left up to the discretion of a particular operating system.

While this is not important within a system, it does mean that text encoded as UCS-2 cannot easily be shared between systems using a different endian-ness. To overcome this problem the Unicode Consortium has defined two transformation formats for sharing Unicode text. The transformation formats explicitly specify byte order, and cannot be misinterpreted by computers using a different byte order.

The two transformation formats, UTF-7 and UTF-8, are described below. For the full definition of these formats, see The Unicode Standard published by The Unicode Consortium.


UTF-7

UTF-7 allows Unicode characters to be encoded and transmitted as 8-bit bytes, of which only 7 bits are used. UTF-7 divides the set of Unicode characters into three subsets, which are encoded and transmitted differently.

[Top]


UTF-8

UTF-8 encodes and transmits Unicode characters as a string of 8-bit bytes. All the ASCII characters 0 to 127 are encoded without change; the most significant bit being set to zero is a signal that they have not been changed. Unicode characters U0080 to U07FF are encoded in two bytes, the remaining Unicode characters — except for the surrogates — are encoded in three bytes. The Unicode surrogate characters are supported by the Character Conversion API, but are not currently supported by all EPOC components.

A variant of UTF-8 used internally by Java differs from standard UTF-8 in two ways. First, the specific case of the NULL character (0x0000) is encoded in the two-byte format, and second, only the one-, two- and three-byte formats are used, not the four-byte format which is normally used for Unicode surrogate-pairs. An argument to ConvertFromUnicodeToUtf8 controls whether the UTF-8 generated by this is the Java variant. Support for this was removed in v6.0.