Character sets and Unicode

Textual data in electronic devices is stored in terms of a character set. A character set is a group of characters, each of which is encoded as a different number. The appearance of each character is not a property of the character set, but rather of the font. So a character may be rendered using many different glyphs, but will always have the same numeric value within its character set. Other properties which can also be included in a character set’s definition are the direction of writing, and the way in which sets of characters are combined.

Character sets, and the ways of encoding them, have proliferated with the increasing acceptance of computers and communicators throughout the world. This has led to an international standard character set, which encompasses all commonly used character sets, including Eastern ideograms, in a single character set, Unicode, defined by the Unicode Consortium (http://www.unicode.org).

Most Western character sets, including Cyrillic, Hebrew and Arabic, are encoded by one 8-bit byte per character. Eastern character sets often use variable byte-length encoding. In Unicode, each character is encoded in two 8-bit bytes, except for a few surrogates that are encoded in four bytes.