Character Encodings

One of the most fundamental topics in i18n is the concept of a character encoding or character set.[73] Computers work with numbers; people work with characters. A character encoding maps one to the other. This is simple enough. The difficulty comes, as it usually does, because of history.

At the time of this writing, ASCII is nearing its 45th birthday; yet we still see its legacy today. This should not surprise anyone; data is usually the most longlived part of a computing system. As networking protocols and storage formats are built on top of a character encoding, it should not be a surprise that the character encoding would be among the most deeply entrenched and hardest to change parts of a protocol stack.

ASCII, the American Standard Code for Information Interchange, was one of the first character encodings to gain widespread use; it was introduced in 1963 and first standardized in 1967. Most encodings in use today descend from ASCII.

The ASCII standard (ANSI X3.4-1986) defines 128 characters. The first 32 characters (with hex values 0 through 1F) and the last character (7F) are nonprinting control characters. The remainder (20 through 7E) are printable. The control characters have largely lost their original meaning, but the printable characters are nearly always the same. The standard ASCII table is as follows.

   x0  x1  x2  x3  x4  x5  x6  x7  x8  x9  xA  xB  xC xD xE xF
0x NUL SOH STX ETX EOT ENQ ACK BEL BS  HT  LF  VT  FF CR SO SI
1x DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM  SUB ESC FS GS RS US
2x     !   "   #   $   %   &   '   (   )   *   +   ,   -  .  /
3x 0   1   2   3   4   5   6   7   8   9   :   ;   <   =  >  ?
4x @   A   B   C   D   E   F   G   H   I   J   K   L   M  N  O
5x P   Q   R   S   T   U   V   W   X   Y   Z   [   \   ]  ^  _
6x `   a   b   c   d   e   f   g   h   i   j   k   l   m  n  o
7x p   q   r   s   t   u   v   w   x   y   z   {   |   }  ~ DEL


[73] A character set is a collection of characters (such as Unicode), while a character encoding is a mapping of a character set to a stream of bytes. For the older character sets such as ASCII, the two terms can generally be conflated.