Elements, Tags, and Character Data

The document in Example 2-1 is composed of a single element named person. The element is delimited by the start-tag <person> and the end-tag </person>. Everything between the start-tag and the end-tag of the element (exclusive) is called the element's content . The content of this element is the text:

  Alan Turing

The whitespace is part of the content, although many applications will choose to ignore it. <person> and </person> are markup . The string "Alan Turing" and its surrounding whitespace are character data . The tag is the most common form of markup in an XML document, but there are other kinds we'll discuss later.

Superficially, XML tags look like HTML tags. Start-tags begin with < and end-tags begin with </. Both of these are followed by the name of the element and are closed by >. However, unlike HTML tags, you are allowed to make up new XML tags as you go along. To describe a person, use <person> and </person> tags. To describe a calendar, use <calendar> and </calendar> tags. The names of the tags generally reflect the type of content inside the element, not how that content will be formatted.

Let's look at a slightly more complicated XML document. Example 2-2 is a person element that contains more information suitably marked up to show its meaning.

In Example 2-2, the contents of the first_name, last_name, and profession elements were character data; that is, text that does not contain any tags. The contents of the person and name elements were child elements and some whitespace that most applications will ignore. This dichotomy between elements that contain only character data and elements that contain only child elements (and possibly a little whitespace) is common in record-like documents. However, XML can also be used for more free-form, narrative documents, such as business reports, magazine articles, student essays, short stories, web pages, and so forth, as shown by Example 2-3.

The root element of this document is biography. The biography contains paragraph and definition child elements. It also contains some whitespace. The paragraph and definition elements contain still other elements, including term, emphasize, name, and profession. They also contain some unmarked-up character data. Elements like paragraph and definition that contain child elements and non-whitespace character data are said to have mixed content. Mixed content is common in XML documents containing articles, essays, stories, books, novels, reports, web pages, and anything else that's organized as a written narrative. Mixed content is less common and harder to work with in computer-generated and processed XML documents used for purposes such as database exchange, object serialization, persistent file formats, and so on. One of the strengths of XML is the ease with which it can be adapted to the very different requirements of human-authored and computer-generated documents.