Your program accepts external input in UTF-8 encoding. You need to make sure that the UTF-8 encoding is valid.
Scan the input string for illegal UTF-8 sequences. If any illegal sequences are detected, reject the input.
UTF-8 is an
encoding that is used to represent multibyte character sets in a way
that is backward-compatible with single-byte character sets. Another
advantage of UTF-8 is that it ensures there are no
NULL
bytes in the data, with the exception of an
actual NULL
byte. Encodings such as
Unicode's UCS-2 may (and often do) contain
NULL
bytes as
"padding" if they are treated as
byte streams. For example, the letter
"A" is 0x41
in
ASCII or UTF-8, but it is 0x0041
in UCS-2.
The first byte in a UTF-8 sequence determines the number of bytes
that follow it to make up the complete sequence. The number of upper
bits set in the first byte minus one indicates the number of bytes
that follow. A bit that is never set immediately follows the count,
and the remaining bits are used as part of the character encoding.
The bytes that follow the first byte will always have the upper two
bits set and unset, respectively; the remaining bits are combined
with the encoding bits from the other bytes in the sequence to
compute the character. Table 3-2 lists the binary
encodings for the range of characters from
0x00000000
to 0x7FFFFFFF
.
Table 3-2. UTF-8 encoding byte sequences
Byte range |
UTF-8 binary representation |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
The problem with UTF-8 encoding is that invalid sequences can be embedded in the data. The UTF-8 specification states that the only legal encoding for a character is the shortest sequence of bytes that yields the correct value. Longer sequences may be able to produce the same value as a shorter sequence, but they are not legal; such a longer sequence is called an overlong sequence .
The security issue posed by overlong sequences is that allowing them makes it significantly more difficult to analyze a UTF-8 encoded string because multiple representations are possible for the same character. It would be possible to recognize overlong sequences and convert them to the shortest sequence, but we recommend against doing that because there may be other issues involved that have not yet been discovered. We recommend that you reject any input that contains an overlong sequence.
The following spc_utf8_isvalid(
)
function will scan a string encoded in UTF-8
to verify that it contains only valid sequences. It will return 1 if
the string contains only legitimate encoding sequences; otherwise, it
will return 0.
int spc_utf8_isvalid(const unsigned char *input) { int nb; const unsigned char *c = input; for (c = input; *c; c += (nb + 1)) { if (!(*c & 0x80)) nb = 0; else if ((*c & 0xc0) = = 0x80) return 0; else if ((*c & 0xe0) = = 0xc0) nb = 1; else if ((*c & 0xf0) = = 0xe0) nb = 2; else if ((*c & 0xf8) = = 0xf0) nb = 3; else if ((*c & 0xfc) = = 0xf8) nb = 4; else if ((*c & 0xfe) = = 0xfc) nb = 5; while (nb-- > 0) if ((*(c + nb) & 0xc0) != 0x80) return 0; } return 1; }