Encoding Characters

Let’s look at the code in Example 10-80 that illustrates this. First, we’ll encode some text using the UTF-8 and ASCII encodings, and write the byte values we see to the console.

Example 10-80. Encoding text

static void Main(string[] args)
{
    string listenUp = "Listen up!";

    byte[] utf8Bytes = Encoding.UTF8.GetBytes(listenUp);
    byte[] asciiBytes = Encoding.ASCII.GetBytes(listenUp);

    Console.WriteLine("UTF-8");
    Console.WriteLine("-----");
    foreach (var encodedByte in utf8Bytes)
    {
        Console.Write(encodedByte);
        Console.Write(" ");
    }

    Console.WriteLine();
    Console.WriteLine();

    Console.WriteLine("ASCII");
    Console.WriteLine("-----");
    foreach (var encodedByte in asciiBytes)
    {
        Console.Write(encodedByte);
        Console.Write(" ");
    }

    Console.ReadKey();
}

The framework provides us with the Encoding class. This has a set of static properties that provide us with specific instances of an Encoding object for a particular scheme. In this case, we’re using UTF8 and ASCII, which actually return instances of UTF8Encoding and ASCIIEncoding, respectively.

Note

Under normal circumstances, you do not need to know the actual type of these instances; you can just talk to the object returned through its Encoding base class.

GetBytes returns us the byte array that corresponds to the actual in-memory representation of a string, encoded using the relevant scheme.

If we build and run this code, we see the following output:

UTF-8
-----
76 105 115 116 101 110 32 117 112 33

ASCII
-----
76 105 115 116 101 110 32 117 112 33

Notice that our encodings are identical in this case, just as promised. For basic Latin characters, UTF-8 and ASCII are compatible. (Unlike Notepad, the .NET UTF8Encoding does not choose to add a BOM by default, so unless you use characters outside the ASCII range this will in fact produce files that can be understood by anything that knows how to process ASCII.)

Let’s make a quick change to the string we’re trying to change, and translate it into French. Replace the first line inside the Main method with Example 10-81. Notice that we’ve got a capital E with an acute accent at the beginning.

Example 10-81. Using a nonASCII character

string listenUp = "Écoute-moi!";

If you don’t have a French keyboard and you’re wondering how to insert that E-acute character, there are a number of ways to do it.

If you know the decimal representation of the Unicode code point, you can hold down the Alt key and type the number on the numeric keypad (and then release the Alt key). So Alt-0163 will insert the symbol for the UK currency, £, and Alt-0201 produces É. This doesn’t work for the normal number keys, though, so if you don’t have a numeric keypad—most laptops don’t—this isn’t much help.

Possibly the most fun, though, is to run the charmap.exe application. The program icon for it in the Start menu is buried pretty deeply, so it’s easier to type charmap into a command prompt, the Start→Run box, or the Windows 7 Start menu search box. This is very instructive, and allows you to explore the various different character sets and (if you check the “Advanced view” box) encodings. You can see an image of it in Figure 10-2.

Figure 10-2. Charmap.exe in action

Alternatively, you could just escape the character—the string literal "\u00C9coutez moi" will produce the same result. And this has the advantage of not requiring non-ASCII values in your source file. Visual Studio is perfectly able to edit various file encodings, including UTF-8, so you can put non-ASCII characters in strings without having to escape them, and you can even use them in identifiers. But some text-oriented tools are not so flexible, so there may be advantages in keeping your source code purely ASCII.

Now, when we run again, we get the following output:

UTF-8
-----
195 137 99 111 117 116 101 45 109 111 105 33

ASCII
-----
63 99 111 117 116 101 45 109 111 105 33

We’ve quite clearly not got the same output in each case. The UTF-8 case starts with 195, 137, while the ASCII starts with 63. After this preamble, they’re again identical.

So, let’s try decoding those two byte arrays back into strings, and see what happens.

Insert the code in Example 10-82 before the call to Console.ReadKey.

Example 10-82. Decoding text

string decodedUtf8 = Encoding.UTF8.GetString(utf8Bytes);
string decodedAscii = Encoding.ASCII.GetString(asciiBytes);

Console.WriteLine();
Console.WriteLine();

Console.WriteLine("Decoded UTF-8");
Console.WriteLine("-------------");
Console.WriteLine(decodedUtf8);

Console.WriteLine();
Console.WriteLine();

Console.WriteLine("Decoded ASCII");
Console.WriteLine("-------------");
Console.WriteLine(decodedAscii);

We’re now using the GetString method on our Encoding objects, to decode the byte array back into a string. Here’s the output:

UTF-8
-----
195 137 99 111 117 116 101 45 109 111 105 33

ASCII
-----
63 99 111 117 116 101 45 109 111 105 33

Decoded UTF-8
-------------
Écoute-moi!

Decoded ASCII
-------------
?coute-moi!

The UTF-8 bytes have decoded back to our original string. This is because the UTF-8 encoding supports the E-acute character, and it does so by inserting two bytes into the array: 195 137.

On the other hand, our ASCII bytes have been decoded and we see that the first character has become a question mark.

If you look at the encoded bytes, you’ll see that the first byte is 63, which (if you look it up in an ASCII table somewhere) corresponds to the question mark character. So this isn’t the fault of the decoder. The encoder, when faced with a character it didn’t understand, inserted a question mark.

Warning

So, you need to be careful that any encoding you choose is capable of supporting the characters you are using (or be prepared for the information loss if it doesn’t).

OK, we’ve seen an example of the one-byte-per-character ASCII representation, and the at-least-one-byte-per-character UTF-8 representation. Let’s have a look at the underlying at-least-two-bytes-per-character UTF-16 encoding that the framework uses internally—Example 10-83 uses this.

Example 10-83. Using UTF-16 encoding

static void Main(string[] args)
{
    string listenUpFR = "Écoute-moi!";

    byte[] utf16Bytes = Encoding.Unicode.GetBytes(listenUpFR);

    Console.WriteLine("UTF-16");
    Console.WriteLine("-----");
    foreach (var encodedByte in utf16Bytes)
    {
        Console.Write(encodedByte);
        Console.Write(" ");
    }

    Console.ReadKey();
}

Notice that we’re using the Unicode encoding this time.

If we compile and run, we see the following output:

UTF-16
-----
201 0 99 0 111 0 117 0 116 0 101 0 45 0 109 0 111 0 105 0 33 0

It is interesting to compare this with the ASCII output we had before:

ASCII
-----
63 99 111 117 116 101 45 109 111 105 33

The first character is different, because UTF-16 can encode the E-acute correctly; thereafter, every other byte in the UTF-16 array is zero, and the next byte corresponds to the ASCII value. As we said earlier, the Unicode standard is highly compatible with ASCII, and each 16-bit value (i.e., pair of bytes) corresponds to the equivalent 7-bit value in the ASCII encoding.

There’s one more note to make about this byte array, which has to do with the order of the bytes. This is easier to see if we first update the program to show the values in hex, using the formatting function we learned about earlier, as Example 10-84 shows.

Example 10-84. Showing byte values of encoded text

static void Main(string[] args)
{
    string listenUpFR = "Écoute-moi!";

    byte[] utf16Bytes = Encoding.Unicode.GetBytes(listenUpFR);

    Console.WriteLine("UTF-16");
    Console.WriteLine("-----");
    foreach (var encodedByte in utf16Bytes)
    {
        Console.Write(string.Format("{0:X2}", encodedByte));
        Console.Write(" ");
    }

    Console.ReadKey();
}

If we run again, we now see our bytes written out in hex format:

UTF-16
-----
C9 00 63 00 6F 00 75 00 74 00 65 00 2D 00 6D 00 6F 00 69 00 21 00

But remember that each UTF-16 code point is represented by a 16-bit value, so we need to think of each pair of bytes as a character. So, our second character is 63 00. This is the 16-bit hex value 0x0063, represented in the little-endian form. That means we get the least-significant byte (LSB) first, followed by the most-significant byte (MSB).

For good (but now largely historical) reasons of engineering efficiency, the Intel x86 family is natively a little-endian architecture. It always expects the LSB followed by the MSB, so the default Unicode encoding is little-endian. On the other hand, platforms like the 680x0 series used in “classic” Macs are big-endian—they expect the MSB, followed by the LSB. Some chip architectures (like the later versions of the ARM chip used in most phones) can even be switched between flavors!

Note

Another historical note: one of your authors is big-endian (he used the Z80 and 68000 when he was a baby developer) and the other is little endian (he used the 6502, and early pre-endian-switching versions of the ARM when he was growing up).

Consequently, one of us has felt like every memory dump he’s looked at since about 1995 has been “backwards”. The other takes the contrarian position that it’s so-called “normal” numbers that are written backwards. So take a deep breath and count to 01.

Should you need to communicate with something that expects its UTF-16 in a big-endian byte array, you can ask for it. Replace the line in Example 10-84 that initializes the utf16Bytes variable with the code in Example 10-85.

Example 10-85. Using big-endian UTF-16

byte[] utf16Bytes = Encoding.BigEndianUnicode.GetBytes(listenUpFR);

As you might expect, we get the following output:

UTF-16
------
00 C9 00 63 00 6F 00 75 00 74 00 65 00 2D 00 6D 00 6F 00 69 00 21

And let’s try it once more, but with Arabic text, as Example 10-86 shows.

Example 10-86. Big-endian Arabic

static void Main(string[] args)
{
    string listenUpArabic = "ّأنصت إليّ";

    byte[] utf16Bytes = Encoding.BigEndianUnicode.GetBytes(listenUpArabic);

    Console.WriteLine("UTF-16");
    Console.WriteLine("-----");
    foreach (var encodedByte in utf16Bytes)
    {
        Console.Write(string.Format("{0:X2}", encodedByte));
        Console.Write(" ");
    }

    Console.ReadKey();
}

And our output is:

UTF-16
-----
06 23 06 46 06 35 06 2A 00 20 06 25 06 44 06 4A 06 51

(Just to prove that you do get values bigger than 0xFF in Unicode!)

Why Represent Strings As Byte Sequences?

In the course of the chapters on file I/O (Chapter 11) and networking (Chapter 13), we’re going to see a number of communications and storage APIs that deal with writing arrays of bytes to some kind of target device. The byte format in which those strings go down the wires is clearly very important, and, while the framework default choices are often appropriate, knowing how (and why) you might need to choose a different encoding will ensure that you’re equipped to deal with mysterious bugs—especially when wrangling text in a language other than your own, or to/from a non-Windows platform.^[23]

^[22]A sort of donkey, before anyone complains.

^[23]Yes, other platforms do exist.