Character Encodings

One of the most fundamental topics in i18n is the concept of a character encoding or character set.^[73] Computers work with numbers; people work with characters. A character encoding maps one to the other. This is simple enough. The difficulty comes, as it usually does, because of history.

At the time of this writing, ASCII is nearing its 45^th birthday; yet we still see its legacy today. This should not surprise anyone; data is usually the most longlived part of a computing system. As networking protocols and storage formats are built on top of a character encoding, it should not be a surprise that the character encoding would be among the most deeply entrenched and hardest to change parts of a protocol stack.

ASCII

ASCII, the American Standard Code for Information Interchange, was one of the first character encodings to gain widespread use; it was introduced in 1963 and first standardized in 1967. Most encodings in use today descend from ASCII.

The ASCII standard (ANSI X3.4-1986) defines 128 characters. The first 32 characters (with hex values 0 through 1F) and the last character (7F) are nonprinting control characters. The remainder (20 through 7E) are printable. The control characters have largely lost their original meaning, but the printable characters are nearly always the same. The standard ASCII table is as follows.

   x0  x1  x2  x3  x4  x5  x6  x7  x8  x9  xA  xB  xC xD xE xF
0x NUL SOH STX ETX EOT ENQ ACK BEL BS  HT  LF  VT  FF CR SO SI
1x DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM  SUB ESC FS GS RS US
2x     !   "   #   $   %   &   '   (   )   *   +   ,   -  .  /
3x 0   1   2   3   4   5   6   7   8   9   :   ;   <   =  >  ?
4x @   A   B   C   D   E   F   G   H   I   J   K   L   M  N  O
5x P   Q   R   S   T   U   V   W   X   Y   Z   [   \   ]  ^  _
6x `   a   b   c   d   e   f   g   h   i   j   k   l   m  n  o
7x p   q   r   s   t   u   v   w   x   y   z   {   |   }  ~ DEL

Extended ASCII

Although ASCII defines 128 characters and a 7-bit encoding, most computers process data in 8-bit bytes. This leaves room for 128 more characters. Of course, computer vendors each chose their own way to deal with this situation. This led to the development of numerous extended-ASCII character sets, each of which used a different interpretation for the upper octets (80 through FF).

The most widely adopted extended-ASCII standard is ISO 8859. This standard adopts the ASCII values for the first 128 characters, and provides 15 different "parts" that each provide a definition for the last 128 characters. In effect, ISO 8859 defines 15 separate character sets.

The most used of these character sets is ISO-8859-1 (Latin-1). This provides nearly complete coverage for most Western European languages. In fact, the 256 characters defined by ISO-8859-1 correspond to the first 256 code points of Unicode. ISO8859-1 is still in widespread use among languages that use the Latin alphabet.

Problems with ASCII

Though the extended ASCII character encodings were widely successful for years, they only provided a temporary fix. With so many encodings floating around, it is difficult for people to communicate. It is always impossible to look at a sequence of bytes and determine their character encoding; that information must be carried out of-band. The more potential character sets in use, the worse this problem becomes.

Another problem with the use of ASCII or extended ASCII is that it has no support for bidirectional, or bidi, text. Some written languages, such as Hebrew and Arabic, are written primarily right-to-left (RTL). This causes problems in rendering systems that were designed with left-to-right (LTR) text in mind. Bidirectional text, which combines LTR and RTL within a page or paragraph, is usually impossible with ASCII or extended ASCII.

The worst limitation of the extended-ASCII model is that it still only provides support for a maximum of 256 characters. This is not nearly enough for East Asian languages (the so-called CJK or CJKV languages, for Chinese, Japanese, Korean, and Vietnamese), which are ideographic and can require tens of thousands of characters for adequate coverage. There are several encodings that cover the CJKV languages specifically, but they do not solve the general problem of having too many encodings.

^[73] A character set is a collection of characters (such as Unicode), while a character encoding is a mapping of a character set to a stream of bytes. For the older character sets such as ASCII, the two terms can generally be conflated.