Character Set Handling

Document type detection is one of the more important pieces of the content-processing puzzle, but it is certainly not the only one. For all types of text-based files rendered in the browser, one more determination needs to be made: The appropriate character set transformation must be identified and applied to the input stream. The output encoding sought by the browser is typically UTF-8 or UTF-16; the input, on the other hand, is up to the author of the page.

In the simplest scenario, the appropriate encoding method will be provided by the server in a charset parameter of the Content-Type header. In the case of HTML documents, the same information may also be conveyed to some extent through the <meta> directive. (The browser will attempt to speculatively extract and interpret this directive before actually parsing the document.)

Unfortunately, the dangerous qualities of certain character encodings, as well as the actions taken by the browser when the charset parameter is not present or is not recognized, once again make life a lot more interesting than the aforementioned simple rule would imply. To understand what can go wrong, we first need to recognize three special classes of character sets that may alter the semantics of HTML or XML documents:

Character sets that permit noncanonical representations of standard 7-bit ASCII codes. Such noncanonical sequences could be used to cleverly encode HTML syntax elements, such as angle brackets or quotes, in a manner that survives a simple server-side check. For example, the famously problematic UTF-7 encoding permits the “<” character to be encoded as a five-character sequence of “+ADw-”, a string that most server-side filters will happily permit as is. In a similar vein, UTF-8 specification formally prohibits, but technically permits, “<” to be represented by unnecessarily verbose 2- to 5-byte sequences, from 0xC0 0xBC to 0xFC 0x80 0x80 0x80 0x80 0xBC.^[63]
Variable length encodings that give special meaning to one or more bytes that follow a special prefix. Such logic may result in legitimate HTML syntax elements being “consumed” as part of an unintentional multibyte literal. For example, the Shift JIS prefix code 0xE0 can cause the subsequent angle bracket or a quote to be consumed in Internet Explorer, Firefox, and Opera (but not in Chrome), possibly severely altering the meaning of the inline markup.
The opposite problem may also occur: The server may be convinced that it is outputting a multibyte literal, but this literal may be rejected by the browser and interpreted as several individual characters. In EUC-KR, the 0x8E prefix is honored only if the subsequent character has an ASCII code of 0x41 or higher. Any less and it will not have the expected effect, but not all server-side implementations may notice.
Encodings that are completely incompatible with 8-bit ASCII. These cases will simply lead to a very different view of document structure between the client and the server. Common examples include UTF-16 or UTF-32.

The bottom line is that unless the server has a perfect command of the character set it is generating and unless it is certain that the client will not apply an unexpected transformation to the payload, serious complications may arise. For example, consider a web application that removes angle brackets from the highlighted user-controlled string in the following piece of HTML:

You are currently viewing:
<span class="blog_title">
 +ADw-script+AD4-alert("Hi mom!")+ADw-/script+AD4-
</span>

If that document is interpreted as UTF-7 by the receiving party, the actual parsed markup will look as follows:

You are currently viewing:
<span class="blog_title">
 <script>alert("Hi mom!")</script>
</span>

A similar problem, this time related to byte consumption in Shift JIS encoding, is illustrated below. A multibyte prefix is permitted to consume a closing quote, and as a result, the associated HTML tag is not terminated as expected, enabling the attacker to inject an extra onerror handler into the markup:

<img src="http://fuzzybunnies.com/[0xE0]">
 ...this is still a part of the markup...
  ...but the server doesn't know...
  " onerror="alert('This will execute!')"
<div>
  ...page content continues...
</div>

It is simply imperative to prevent character set autodetection for all text-based documents that contain any type of user-controlled data. Most browsers will engage in character set detection if the charset parameter is not found in the Content-Type header or in the <meta> tag. Some marked differences exist between the implementations (for example, only Internet Explorer is keen to detect UTF-7), but you should never assume that the outcome of character set sniffing will be safe.

Character set autodetection will also be attempted if the character set is not recognized or is mistyped; this problem is compounded by the fact that charset naming can be ambiguous and that web browsers are inconsistent in how much tolerance they have for common name variations. As a single data point, consider the fact that Internet Explorer recognizes both ISO-8859-2 and ISO8859-2 (with no dash after the ISO part) as valid character set identifiers in the Content-Type header but fails to recognize UTF8 as an alias for UTF-8. The wrong choice can cause some serious pain.

Note

Fun fact: The X-Content-Type-Options header has no effect on character-sniffing logic.

Byte Order Marks

We are not done with character set detection just yet! Internet Explorer needs to be singled out for yet another dramatically misguided content-handling practice: the tendency to give precedence to the so-called byte order mark (BOM), a sequence of bytes that can be placed at the beginning of a file to identify its encoding, over the explicitly provided charset data. When such a marker is detected in the input file, the declared character set is ignored.

Table 13-1 shows several common markers. Of these, the printable UTF-7 BOM is particularly sneaky.

Table 13-1. Common Byte Order Markers (BOMs)

Encoding name	Byte order mark sequence
UTF-7	“+/v” followed by “8”, “9”, “+”, or “/”
UTF-8	0xEF 0xBB 0xBF
UTF-16 little endian	0xFF 0xFE
UTF-16 big endian	0xFE 0xFF
UTF-32 little endian	0xFF 0xFE 0x00 0x00
UTF-32 big endian	0x00 0x00 0xFE 0xFF
GB −18030	0x84 0x31 0x95 0x33

Note

Microsoft engineers acknowledge the problem with this design and, as of this writing, say that the logic may be revised, depending on the outcome of compatibility tests. If the problem is resolved by the time this book hits the shelves, kudos to them. Until then, allowing the attacker to control the first few bytes of an HTTP response that is not otherwise protected by Content-Disposition may be a bad idea—and other than padding the response, there is no way to work around this glitch.

Character Set Inheritance and Override

Two additional, little-known mechanisms should be taken into account when evaluating the potential impact on character set handling strategies in contemporary web browsers. Both of these features may permit an attacker to force undesirable character encoding upon another page, without relying on character sniffing.

The first apparatus in question, supported by all but Internet Explorer, is known as character set inheritance. Under this policy, any encoding defined for the top-level frame may be automatically applied to any framed documents that do not have their own, valid charset value set. Initially, such inheritance is extended to all framing scenarios, even across completely unrelated websites. However, when Stefan Esser, Abhishek Arya, and several other researchers demonstrated a number of plausible attacks that leveraged this feature to force UTF-7 parsing on unsuspecting targets, Firefox and WebKit developers decided to limit the behavior to same-origin frames. (Opera still permits cross-domain inheritance. Although it does not support UTF-7, other problematic encodings, such as Shift JIS, are fair game.)

The other mechanism that deserves mention is the ability to manually override the currently used character set. This feature is available through the View > Encoding menu or similar in most browsers. Using this menu to change the character set causes the page and all its subframes (including cross-domain ones!) to be reparsed using the selected encoding, regardless of any charset directives encountered earlier for that content.

Because users may be easily duped into selecting an alternative encoding for an attacker-controlled page (simply in order to view it correctly), this design should make you somewhat uncomfortable. Casual users can’t be expected to realize that their election will also apply to hidden <iframe> tags and that such a seemingly innocuous action may enable cross-site scripting attacks against unrelated web properties. In fact, let’s be real: Most of them will not know—and should not have to know—what an <iframe> is.

Markup-Controlled Charset on Subresources

We are nearing the end of the epic journey through the web of content-handling quirks, but we are not quite done yet. Astute readers may recall that in Type-Specific Content Inclusion in Frames, I mentioned that on certain types of subresources (namely, stylesheets and scripts), the embedding page can specify its own charset value in order to apply a specific transformation to the retrieved document, for example,

<script src="http://fuzzybunnies.com/get_js_data.php" charset="EUC-JP">

This parameter is honored by all browsers except for Opera. Where it is supported, it typically does not take precedence over charset in Content-Type, unless that second parameter is missing or unrecognized. But to every rule, there is an exception, and all too often, the name of this exception is Internet Explorer 6. In that still-popular browser, the encoding specified by the markup overrides HTTP data.

Does this behavior matter in practice? To fully grasp the consequences, let’s also quickly return to Chapter 6, where we debated the topic of securing server-generated, user-specific, JSON-like code against cross-domain inclusion. One example of an application that needs such a defense is a searchable address book in a webmail application: The search term is provided in the URL, and a JavaScript serialization of the matching contacts is returned to the browser but must be shielded from inclusion on unrelated sites.

Now, let’s assume that the developer came up with a simple trick to prevent third-party web pages from loading this data through <script src=...>: A single “//” prefix is used to turn the entire response into a comment. Same-origin callers that use the XMLHttpRequest API can simply examine the response, strip the prefix, and pass the data to eval(...)—but remote callers, trying to abuse the <script src=...> syntax, will be out of luck.

In this design, a request to /contact_search.php?q=smith may yield the following response:

// var result  = { "q": "smith", "r": [ "j.smith@example.com" ] };

As long as the search term is properly escaped or filtered, this scheme appears safe. But when we realize that the attacker may force the response to be interpreted as UTF-7, the picture changes dramatically. A seemingly benign search term that, as far as the server is concerned, contains no illegal characters could still unexpectedly decode to

// var result = { "q": "smith[CR][LF]
var gotcha = { "", "r": [ "j.smith@example.com" ] };

This response, when loaded via <script src=... charset=utf-7> inside the victim’s browser, gives the attacker access to a portion of the user’s address book.

This is not just a thought exercise: The “//” approach is fairly common on the Web, and Masato Kinugawa, a noted researcher, found several popular web applications affected by this bug. And a more contrived variant of the same attack is also possible against other execution-preventing prefixes, such as while (1);. In the end, the problems with cross-domain charset override on <script> tags is one of the reasons why in Chapter 6, we strongly recommend using a robust parser-stopping prefix to prevent the interpreter from ever looking at any attacker-controlled bits. Oh—and if you factor in the support for E4X, the picture becomes even more interesting,^[225] but let’s leave it at that.

Detection for Non-HTTP Files

To wrap up this chapter, let’s look at the last missing detail: character set encoding detection for documents delivered over non-HTTP protocols. As can be expected, documents saved to disk and subsequently opened over the file: protocol, or loaded by other means where the usual Content-Type metadata is absent, will usually be subjected to character set detection logic. However, unlike with document determination heuristics, there is no substantial difference among all the possible delivery methods: In all cases, the sniffing behavior is roughly the same.

There is no clean and portable way to address this problem for all text-based documents, but for HTML specifically, the impact of character set sniffing can be mitigated by embedding a <meta> directive inside the document body:

<meta http-equiv="Content-Type" content="text/html;charset=...">

You should not ditch Content-Type in favor of this indicator. Unlike <meta>, the header works for non-HTML content, and it is easier to enforce and audit on a site-wide level. That said, documents that are likely to be saved to disk and that contain attacker-controlled tidbits will benefit from a redundant <meta> tag. (Just make sure that this value actually matches Content-Type.)

^[63] Today, this problem is mitigated by most browsers: Their parsers now have additional checks to reject overlong UTF-8 encodings as a matter of principle. The same cannot be said of all possible server-side UTF-8 libraries, however.