Entity Encoding

Let’s talk about character encoding again. As noted on the first pages of this chapter, certain reserved characters are generally unsafe inside text nodes and tag parameter values, and they will often lead to outright syntax errors in XHTML. In order to allow such characters to be used safely (and to allow a convenient way to embed high-bit text), a simple ampersand-prefixed, semicolon-terminated encoding scheme, known as entity encoding, is available to developers.

The most familiar use of this encoding method is the inclusion of certain predefined, named entities. Only a handful of these are specified for XML, but several hundred more are scattered in HTML specifications and supported by all modern browsers. In this approach, < is used to insert a left angle bracket; > substitutes a right angle bracket; & replaces the ampersand itself; while, say, → is a nice Unicode arrow.

Note

In XHTML documents, additional named entities can be defined using the <!ENTITY> directive and made to resolve to internally defined strings or to the contents of an external file URL. (This last option is obviously unsafe if allowed when processing untrusted content; the resulting attack is sometimes called External XML Entity, or XXE for short.)

In addition to the named entities, it is also possible to insert an arbitrary ASCII or Unicode character using a decimal &#number; notation. In this case, &#60; maps to a left angle bracket; &#62; substitutes a right one; and &#128569; is, I kid you not, a Unicode 6.0 character named “smiling cat face with tears of joy.” Hexadecimal notation can also be used if the number is prefixed with “x”. In this variant, the left angle bracket becomes &#x3c;, etc.

The HTML parser recognizes entity encoding inside text nodes and parameter values and decodes it transparently when building an in-memory representation of the document tree. Therefore, the following two cases are functionally identical:

<img src="http://www.example.com">

and

<img src="ht&#x74;p&#x3a;//www.example.com">

The following two examples, on the other hand, will not work as expected, as the encoding interferes with the structure of the tag itself:

<img src&#x3d;"http://www.example.com">

and

<img s&#x72;c="http://www.example.com">

The largely transparent behavior of entity encoding makes it important to correctly resolve it prior to making any security decisions about the contents of a document and, if applicable, to properly restore it in the sanitized output later on. To illustrate, the following syntax must be recognized as an absolute reference to a javascript: pseudo-URL and not to a cryptic fragment ID inside a relative resource named “./javascript&”:

<a href="javascript&#x3a;alert(1)">

Unfortunately, even the simple task of recognizing and parsing HTML entities can be tricky. In traditional parsing, for example, entities may often be accepted even if the trailing semicolon is omitted, as long as the next character is not an alphanumeric. (In Firefox, dashes and periods are also accepted in entity names.) Numeric entities are even more problematic, as they may have an overlong notation with an arbitrary number of trailing zeros. Moreover, if the numerical value is higher than 232, the standard size of an integer on many computer architectures, the corresponding character may be computed incorrectly.

Developers working with XHTML should be aware of a potential pitfall in that dialect, too. Although HTML entities are not recognized in most of the special parsing modes, XHTML differs from traditional HTML in that tags such as <script> and <style> do not automatically toggle a special parsing mode on their own. Instead, an explicit <![CDATA[...]]> block around any scripts or stylesheets is required to achieve a comparable effect. Therefore, the following snippet with an attacker-controlled string (otherwise scrubbed for angle brackets, quotes, backslashes, and newlines) is perfectly safe in HTML, but not in XHTML:

<script>
  var tmp = 'I am harmless! &#x27;+alert(1);// Or am I?';
  ...
</script>