3.12. Detecting Illegal UTF-8 Characters

Problem

Your program accepts external input in UTF-8 encoding. You need to make sure that the UTF-8 encoding is valid.

Solution

Scan the input string for illegal UTF-8 sequences. If any illegal sequences are detected, reject the input.

Discussion

UTF-8 is an encoding that is used to represent multibyte character sets in a way that is backward-compatible with single-byte character sets. Another advantage of UTF-8 is that it ensures there are no NULL bytes in the data, with the exception of an actual NULL byte. Encodings such as Unicode's UCS-2 may (and often do) contain NULL bytes as "padding" if they are treated as byte streams. For example, the letter "A" is 0x41 in ASCII or UTF-8, but it is 0x0041 in UCS-2.

The first byte in a UTF-8 sequence determines the number of bytes that follow it to make up the complete sequence. The number of upper bits set in the first byte minus one indicates the number of bytes that follow. A bit that is never set immediately follows the count, and the remaining bits are used as part of the character encoding. The bytes that follow the first byte will always have the upper two bits set and unset, respectively; the remaining bits are combined with the encoding bits from the other bytes in the sequence to compute the character. Table 3-2 lists the binary encodings for the range of characters from 0x00000000 to 0x7FFFFFFF.

Table 3-2. UTF-8 encoding byte sequences

Byte range	UTF-8 binary representation
`0x00000000 - 0x0000007F`	`0bbbbbbb`
`0x00000080 - 0x000007FF`	`110bbbbb 10bbbbbb`
`0x00000800 - 0x0000FFFF`	`1110bbbb 10bbbbbb 10bbbbbb`
`0x00010000 - 0x001FFFFF`	`11110bbb 10bbbbbb 10bbbbbb` `10bbbbbb`
`0x00200000 - 0x03FFFFFF`	`111110bb 10bbbbbb 10bbbbbb` `10bbbbbb 10bbbbbb`
`0x04000000 - 0x7FFFFFFF`	`1111110b 10bbbbbb 10bbbbbb` `10bbbbbb 10bbbbbb 10bbbbbb`

The problem with UTF-8 encoding is that invalid sequences can be embedded in the data. The UTF-8 specification states that the only legal encoding for a character is the shortest sequence of bytes that yields the correct value. Longer sequences may be able to produce the same value as a shorter sequence, but they are not legal; such a longer sequence is called an overlong sequence .

The security issue posed by overlong sequences is that allowing them makes it significantly more difficult to analyze a UTF-8 encoded string because multiple representations are possible for the same character. It would be possible to recognize overlong sequences and convert them to the shortest sequence, but we recommend against doing that because there may be other issues involved that have not yet been discovered. We recommend that you reject any input that contains an overlong sequence.

The following spc_utf8_isvalid( ) function will scan a string encoded in UTF-8 to verify that it contains only valid sequences. It will return 1 if the string contains only legitimate encoding sequences; otherwise, it will return 0.

int spc_utf8_isvalid(const unsigned char *input) {
  int                 nb;
  const unsigned char *c = input;
  
  for (c = input;  *c;  c += (nb + 1)) {
    if (!(*c & 0x80)) nb = 0;
    else if ((*c & 0xc0) =  = 0x80) return 0;
    else if ((*c & 0xe0) =  = 0xc0) nb = 1;
    else if ((*c & 0xf0) =  = 0xe0) nb = 2;
    else if ((*c & 0xf8) =  = 0xf0) nb = 3;
    else if ((*c & 0xfc) =  = 0xf8) nb = 4;
    else if ((*c & 0xfe) =  = 0xfc) nb = 5;
    while (nb-- > 0)
      if ((*(c + nb) & 0xc0) != 0x80) return 0;
  } 
  
  return 1;
}