Transformation formats

The UCS-2 form of the Unicode character set encodes each character as 2 bytes (16 bits total). However it does not specify which of the bytes is most significant. The byte order, or endian-ness, is left up to the discretion of a particular operating system.

While this is not important within a system, it does mean that text encoded as UCS-2 cannot easily be shared between systems using a different endian-ness. To overcome this problem the Unicode Consortium has defined two transformation formats for sharing Unicode text. The transformation formats explicitly specify byte order, and cannot be misinterpreted by computers using a different byte order.

The two transformation formats, UTF-7 and UTF-8, are described below. For the full definition of these formats, see The Unicode Standard published by The Unicode Consortium.

UTF-7 allows Unicode characters to be encoded and transmitted as 8-bit bytes, of which only 7 bits are used. UTF-7 divides the set of Unicode characters into three subsets, which are encoded and transmitted differently.

Set D, is the set of characters which are encoded as a single byte. It includes lower and upper case A to Z, the numeric digits, and nine other characters.
Set O includes the characters ! " # $ % & * ; < = > @ [ ] ^ _ { | }. These characters can be encoded as a single byte, or with the modified base 64 encoding used for set B characters. When encoded as a single byte, set O characters can be misinterpreted by some applications — encoding as modified base 64 overcomes this problem.
Set B comprises the remaining characters, which are encoded as an escape byte followed by 2 or 3 bytes. The encoding format is a modified form of base 64 encoding.

UTF-8 encodes and transmits Unicode characters as a string of 8-bit bytes. All the ASCII characters 0 to 127 are encoded without change; the most significant bit being set to zero is a signal that they have not been changed. Unicode characters U0080 to U07FF are encoded in two bytes, the remaining Unicode characters — except for the surrogates — are encoded in three bytes. The Unicode surrogate characters are supported by the Character Conversion API, but are not currently supported by all EPOC components.

A variant of UTF-8 used internally by Java differs from standard UTF-8 in two ways. First, the specific case of the NULL character (0x0000) is encoded in the two-byte format, and second, only the one-, two- and three-byte formats are used, not the four-byte format which is normally used for Unicode surrogate-pairs. An argument to ConvertFromUnicodeToUtf8 controls whether the UTF-8 generated by this is the Java variant. Support for this was removed in v6.0.