Chapter 4. Hypertext Markup Language

The Hypertext Markup Language (HTML) is the primary method of authoring online documents. One of the earliest written accounts of this language is a brief summary posted on the Internet by Tim Berners-Lee in 1991.^[136] His proposal outlines an SGML-derived syntax that allows text documents to be annotated with inline hyperlinks and several types of layout aids. In the following years, this specification evolved gradually under the direction of Sir Berners-Lee and Dan Connolly, but it wasn’t until 1995, at the onset of the First Browser Wars, that a reasonably serious and exhaustive specification of the language (HTML 2.0) made it to RFC 1866.^[137]

From that point on, all hell broke loose: For the next few years, competing browser vendors kept introducing all sorts of flashy, presentation-oriented features and tweaked the language to their liking. Several attempts to amend the original RFC have been undertaken, but ultimately the IETF-managed standardization approach proved to be too inflexible. The newly formed World Wide Web Consortium took over the maintenance of the language and eventually published the HTML 3.2 specification in 1997.^[138]

The new specification tried to reconcile the differences in browser implementations while embracing many of the bells and whistles that appealed to the public, such as customizable text colors and variable typefaces. Ultimately, though, HTML 3.2 proved to be a step back for the clarity of the language and had only limited success in catching up with the facts.

In the following years, the work on HTML 4 and 4.01^[139] focused on pruning HTML of all accumulated excess and on better explaining how document elements should be interpreted and rendered. It also defined an alternative, strict XHTML syntax derived from XML, which was much easier to consistently parse but more punishing to write. Despite all this work, however, only a small fraction of all websites on the Internet could genuinely claim compliance with any of these standards, and little or no consistency in parsing modes and error recovery could be seen on the client end. Consequently, some of the work on improving the core language fizzled out, and the W3C turned its attention to stylesheets, the Document Object Model, and other more abstract or forward-looking challenges.

In the late 2000s, some of the low-level work has been revived under the banner of HTML5,^[140] an ambitious project to normalize almost every aspect of the language syntax and parsing, define all the related APIs, and more closely police browser behavior in general. Time will tell if it will be successful; until then, the language itself, and each of the four leading parsing engines,^[25] come with their own set of frustrating quirks.

Basic Concepts Behind HTML Documents

From a purely theoretical standpoint, HTML relies on a fairly simple syntax: a hierarchical structure of tags, name=value tag parameters, and text nodes (forming the actual document body) in between. For example, a simple document with a title, a heading, and a hyperlink may look like this:

<html>
  <head>
    <title>Hello world</title>
  </head>
  <body>
    <h1>Welcome to our example page</h1>
    <a href="http://www.example.com/">Click me!</a>
  </body>
</html>

This syntax puts some constraints on what may appear inside a parameter value or inside the document body. Five characters—angle brackets, single and double quotes, and an ampersand—are reserved as the building blocks of the HTML markup, and these need to be avoided or escaped in some way when used outside of their intended function. The most important rules are:

Stray ampersands (&) should never appear in most sections of an HTML document.
Both types of angle brackets are obviously problematic inside a tag, unless properly quoted.
The left angle bracket (<) is a hazard inside a text node.
Quote characters appearing inside a tag can have undesirable effects, depending on their exact location, but are harmless in text nodes.

To allow these characters to appear in problematic locations without causing side effects, an ampersand-based encoding scheme, discussed in Entity Encoding in HTML Parsing Survival Tips, is provided.

Note

Of course, the availability of such an encoding scheme is not a guarantee of its use. The failure to properly filter out or escape reserved characters when displaying user-controlled data is the cause of a range of extremely common and deadly web application security flaws. A particularly well-known example of this is cross-site scripting (XSS), an attack in which malicious, attacker-provided JavaScript code is unintentionally echoed back somewhere in the HTML markup, effectively giving the attacker full control over the appearance and operation of the targeted site.

Document Parsing Modes

For any HTML document, a top-level <!DOCTYPE> directive may be used to instruct the browser to parse the file in a manner that at least superficially conforms to one of the officially defined standards; to a more limited extent, the same signal can be conveyed by the Content-Type header, too. Of all the available parsing modes, the most striking difference exists between XHTML and traditional HTML. In the traditional mode, parsers will attempt to recover from most types of syntax errors, including unmatched opening and closing tags. In addition, tag and parameter names will be considered case insensitive, parameter values will not always need to be quoted, and certain types of tags, such as <img>, will be closed implicitly. In other words, the following input will be grudgingly tolerated:

<hTmL>
  <BODY>
    <IMG src="/hello_world.jpg">
    <a HREF=http://www.example.com/>
      Click me!
    </oops>
</html>

The XML mode, on the other hand, is strict: All tags need to be balanced carefully, named using the proper case, and closed explicitly. (The XML-specific self-closing tag syntax, such as <img />, is permitted.) In addition, most syntax mistakes, even trivial ones, will result in an error and prevent the document from being displayed at all.

Unlike the regular flavor of HTML, XML-based documents may also elegantly incorporate sections using other XML-compliant markup formats, such as MathML, a mathematical formula markup language. This is done by specifying a different xmlns namespace setting for a particular tag, with no need for one-off, language-level hacks.

The last important difference worth mentioning here is that traditional HTML parsing strategies feature a selection of special modes, entered into after certain tags are encountered and exited only when a specific terminator string is seen; everything in between is interpreted as non-HTML text. Some examples of such special tags include <style>, <script>, <textarea>, or <xmp>. In practical implementations, these modes are exited only when a literal, case-insensitive match on </style, </script, or a similar matching value, is made; any other markup inside such a block will not be interpreted as HTML. (Interestingly, there is one officially obsolete tag, <plaintext>, that cannot be exited at all; it stays in effect for the remainder of the document.)

In comparison, the XML mode is more predictable. It generally forbids stray “<” and “&” characters inside the document, but it provides a special syntax, starting with “<![CDATA[” and ending with “]]>”, as a way to encapsulate any raw text inside an arbitrary tag. For example:

<script>
<![CDATA[
  alert('>>> Hello world! <<<');
]]>
</script>

The other notable special parsing mode available in both XHTML and normal HTML is a comment block. In XML, it quite simply begins with “”. In the traditional HTML parser in Firefox versions prior to 4, any occurrence of “--”, later followed by “>”, is also considered good enough.

The Battle over Semantics

The low-level syntax of the language aside, HTML is also the subject of a fascinating conceptual struggle: a clash between the ideology and the reality of the online world. Tim Berners-Lee always championed the vision of a semantic web, an interconnected system of documents in which every functional block, such as a citation, a snippet of code, a mailing address, or a heading, has its meaning explained by an appropriate machine-readable tag (say, <cite>, <code>, <address>, or <h1> to <h6>).

This approach, he and other proponents argued, would make it easier for machines to crawl, analyze, and index the content in a meaningful way, and in the near future, it would enable computers to reason using the sum of human knowledge. According to this philosophy, the markup language should provide a way to stylize the appearance of a document, but only as an afterthought.

Sir Berners-Lee has never given up on this dream, but in this one regard, the actual usage of HTML proved to be very different from what he wished for. Web developers were quick to pragmatically distill the essence of HTML 3.2 into a handful of presentation-altering but semantically neutral tags, such as <font>, <b>, and <pre>, and saw no reason to explain further the structure of their documents to the browser. W3C attempted to combat this trend but with limited success. Although tags such as <font> have been successfully obsoleted and largely abandoned in favor of CSS, this is only because stylesheets offered more powerful and consistent visual controls. With the help of CSS, the developers simply started relying on a soup of semantically agnostic <span> and <div> tags to build everything from headings to user-clickable buttons, all in a manner completely opaque to any automated content extraction tools.

Despite having had a lasting impact on the design of the language, in some ways, the idea of a semantic web may be becoming obsolete: Online content less frequently maps to the concept of a single, viewable document, and HTML is often reduced to providing a convenient drawing surface and graphic primitives for JavaScript applications to build their interfaces with.

^[25] To process HTML documents, Internet Explorer uses the Trident engine (aka MSHTML); Firefox and some derived products use Gecko; Safari, Chrome, and several other browsers use WebKit; and Opera relies on Presto. With the exception of WebKit, a collaborative open source effort maintained by several vendors, these engines are developed largely in-house by their respective browser teams.