Understanding HTML Parser Behavior

The fundamentals of HTML syntax outlined in the previous sections are usually enough to understand the meaning of well-formed HTML and XHTML documents. When the XHTML dialect is used, there is little more to the story: The minimal fault-tolerance of the parser means that anomalous syntax almost always leads simply to a parsing error. Alas, the picture is very different with traditional, laid-back HTML parsers, which aggressively second-guess the intent of the page developer even in very ambiguous or potentially harmful situations.

Since an accurate understanding of user-supplied markup is essential to designing many types of security filters, let’s have a quick look at some of these behaviors and quirks. To begin, consider the following reference snippet:

image with no caption

Web developers are usually surprised to learn that this syntax can be drastically altered without changing its significance to the browser. For example, Internet Explorer will allow an NUL character (0x00) to be inserted in the location marked at , a change that is likely to throw all naïve HTML filters off the trail. It is also not widely known that the whitespaces at and can be substituted with uncommon vertical tab (0x0B) or form feed (0x0C) characters in all browsers and with a nonbreaking UTF-8 space (0xA0) in Opera.[26] Oh, and here's a really surprising bit: In Firefox, the whitespace at can also be replaced with a single, regular slash—yet the one at can’t.

Moving on, the location marked is also of note. In this spot, NUL characters are ignored by most parsers, as are many types of whitespaces. Not long ago, WebKit browsers accepted a slash in this location, but recent parser improvements have eliminated this quirk.

Quote characters are a yet another topic of interest. Website developers know that single and double quotes can be used to put a string containing whitespaces or angle brackets in an HTML parameter, but it usually comes as a surprise that Internet Explorer also honors backticks (`) instead of real quotes in the location marked . Similarly, few people realize that in any browser, an implicit whitespace is inserted after a quoted parameter, and that the explicit whitespace at can therefore be skipped without changing the meaning of the tag.

The security impact of these patterns is not always easy to appreciate, but consider an HTML filter tasked with scrubbing an <img> tag with an attacker-controlled title parameter. Let’s say that in the input markup, this parameter is not quoted if it contains no whitespaces and angle brackets—a design that can be seen on a popular blogging site. This practice may appear safe at first, but in the following two cases, a malicious, injected onerror parameter will materialize inside a tag:

<img ... title=""onerror="alert(1)">

and

<img ... title=``onerror=`alert(1)`>

Yet another wonderful quote-related quirk in Internet Explorer makes this job even more complicated. While most browsers recognize quoting only when it is used at the beginning of a parameter value, Internet Explorer simply checks for any occurrence of an equal sign (=) followed by a quote and will parse this syntax in a rather unexpected way:

<img src=test.jpg?value=">Yes, we are still inside a tag!">

Parsing a single tag can be a daunting task, but as you might imagine, anomalous arrangements of multiple HTML tags will be even less predictable. Consider the following trivial example:

<i <b>

When presented with such syntax, most browsers only interpret <i> and treat the “<b” string as an invalid tag parameter. Firefox versions before 4, however, would automatically close the <i> tag first when encountering an angle bracket and, in the end, will interpret both <i> and <b>. In the spirit of fault tolerance, until recently WebKit followed that model, too.

A similar behavior can be observed in previous versions of Firefox when dealing with tag names that contain invalid characters (in this case, the equal sign). Instead of doing its best to ignore the entire block, the parser would simply reset and interpret the quoted tag:

<i="<b>">

The handling of tags that are not closed before the end of the file is equally fascinating. For example, the following snippet will prompt most browsers to interpret the <i> tag or ignore the entire string, but Internet Explorer and Opera use a different backtracking approach and will see <b> instead:

<i foo="<b>" [EOF]

In fact, Firefox versions prior to version 4 engaged in far-fetched reparsing whenever particular special tags, such as <title>, were not closed before the end of the document:

<title>This text will be interpreted as a title
<i>This text will be shown as document body!
[EOF]

The last two parsing quirks have interesting security consequences in any scenario where the attacker may be able to interrupt page load prematurely. Even if the markup is otherwise fairly well sanitized, the meaning of the document may change in a very unexpected way.

To further complicate the job of HTML parsing, some browsers exhibit behaviors that can be used to conditionally skip some of the markup in a document. For example, in an attempt to help novice users of Microsoft’s Active Server Pages development platform, Internet Explorer treats <% ... %> blocks as a completely nonstandard comment, hiding any markup between these two character sequences. Another Internet Explorer-specific feature is explicit conditional expressions interpreted by the parser and smuggled inside standard HTML comment blocks:

<!--[if IE 6]>
  Markup that will be parsed only for Internet Explorer 6
<![endif]—>

Many other quirks of this type are related to the idiosyncrasies of SGML and XML. For example, due to the comment-handling behavior mentioned earlier in an aside, browsers disagree on how to parse !- and ?-directives (such as <!DOCTYPE> or <?xml>), whether to allow XML-style CDATA blocks in non-XHTML modes, and on what precedence to give to overlapping special parsing mode tags (such as “<style><!-- </style> -->”).

The set of parsing behaviors discussed in the previous sections is by no means exhaustive. In fact, an entire book has been written on this topic: Inquisitive readers are advised to grab Web Application Obfuscation (Syngress, 2011) by Mario Heiderich, Eduardo Alberto Vela Nava, Gareth Heyes, and David Lindsay—and then weep about the fate of humanity. The bottom line is that building HTML filters that try to block known dangerous patterns, and allow the remaining markup as is, is simply not feasible.

The only reasonable approach to tag sanitization is to employ a realistic parser to translate the input document into a hierarchical in-memory document tree, and then scrub this representation for all unrecognized tags and parameters, as well as any undesirable tag/parameter/value configurations. At that point, the tree can be carefully reserialized into a well-formed, well-escaped HTML that will not flex any of the error correction muscles in the browser itself. Many developers think that a simpler design should be possible, but eventually they discover the reality the hard way.



[26] The behavior exhibited by Opera is particularly sneaky: The Unicode whitespace is not recognized by many standard library functions used in server-side HTML sanitizers, such as isspace(...) in libc. This increases the risk of implementation glitches.