Element Declarations

Every element used in a valid document must be declared in the document's DTD with an element declaration. Element declarations have this basic form:

<!ELEMENT name 
               content_specification>

The name of the element can be any legal XML name. The content specification indicates what children the element may or must have and in what order. Content specifications can be quite complex. They can say, for example, that an element must have three child elements of a given type, or two children of one type followed by another element of a second type, or any elements chosen from seven different types interspersed with text.

#PCDATA

The simplest content specification is one that says an element may only contain parsed character data, but may not contain any child elements of any type. In this case the content specification consists of the keyword #PCDATA inside parentheses. For example, this declaration says that a phone_number element may contain text but may not contain elements:

<!ELEMENT phone_number (#PCDATA)>

Such an element may also contain character references and CDATA sections (which are always parsed into pure text) and comments, and processing instructions (which don't really count in validation). It may contain entity references only if those entity references resolve to plain text without any child elements.

Child Elements

Another simple content specification is one that says the element must have exactly one child of a given type. In this case, the content specification consists of the name of the child element inside parentheses. For example, this declaration says that a fax element must contain exactly one phone_number element:

<!ELEMENT fax (phone_number)>

A fax element may not contain anything else except the phone_number element, and it may not contain more or less than one of those.

Sequences

In practice, a content specification that lists exactly one child element is rare. Most elements contain either parsed character data or (at least potentially) multiple child elements. The simplest way to indicate multiple child elements is to separate them with commas. This is called a sequence. It indicates that the named elements must appear in the specified order. For example, this element declaration says that a name element must contain exactly one first_name child element followed by exactly one last_name child element:

<!ELEMENT name (first_name, last_name)>

Given this declaration, this name element is valid:

<name>
  <first_name>Madonna</first_name>
  <last_name>Cicconne</last_name>
</name>

However, this one is not valid because it flips the order of two elements:

<name>
  <last_name>Cicconne</last_name>
  <first_name>Madonna</first_name>
</name>

This element is invalid because it omits the last_name element:

<name>
  <first_name>Madonna</first_name>
</name>

This one is invalid because it adds a middle_name element:

<name>
  <first_name>Madonna</first_name>
  <middle_name>Louise</middle_name>
  <last_name>Cicconne</last_name>
</name>

The Number of Children

As the previous examples indicate, not all instances of a given element necessarily have exactly the same children. You can affix one of three suffixes to an element name in a content specification to indicate how many of that element are expected at that position. These suffixes are:

? Zero or one of the element is allowed.

* Zero or more of the element is allowed.

+ One or more of the element is required.

For example, this declaration says that a name element must contain exactly one first_name, may or may not contain a middle_name, and may or may not contain a last_name:

<!ELEMENT name (first_name, middle_name?, last_name?)>

Given this declaration, all these name elements are valid:

<name>
  <first_name>Madonna</first_name>
  <last_name>Cicconne</last_name>
</name>
<name>
  <first_name>Madonna</first_name>
  <middle_name>Louise</middle_name>
  <last_name>Cicconne</last_name>
</name>
<name>
  <first_name>Madonna</first_name>
</name>

However, these are not valid:

<name>
  <first_name>George</first_name>
  <!-- only one middle name is allowed -->
  <middle_name>Herbert</middle_name>
  <middle_name>Walker</middle_name>
  <last_name>Bush</last_name>
</name>
<name>
  <!-- first name must precede last name -->
  <last_name>Cicconne</last_name>
  <first_name>Madonna</first_name>
</name>

You can allow for multiple middle names by placing an asterisk after the middle_name:

<!ELEMENT name (first_name, middle_name*, last_name?)>

If you wanted to require a middle_name to be included, but still allow for multiple middle names, you'd use a plus sign instead, like this:

<!ELEMENT name (first_name, middle_name+, last_name?)>

Choices

Sometimes one instance of an element may contain one kind of child, and another instance may contain a different child. This can be indicated with a choice. A choice is a list of element names separated by vertical bars. For example, this declaration says that a methodResponse element contains either a params child or a fault child:

<!ELEMENT methodResponse (params | fault)>

However, it cannot contain both at once. Each methodResponse element must contain one or the other.

Choices can be extended to an indefinite number of possible elements. For example, this declaration says that each digit element can contain exactly one of the child elements named zero, one, two, three, four, five, six, seven, eight, or nine:

<!ELEMENT digit
 (zero | one | two | three | four | five | six | seven | eight | nine)
>

Parentheses

Individually, choices, sequences, and suffixes are fairly limited. However, they can be combined in arbitrarily complex fashions to describe most reasonable content models. Either a choice or a sequence can be enclosed in parentheses. When so enclosed, the choice or sequence can be suffixed with a ?, *, or +. Furthermore, the parenthesized item can be nested inside other choices or sequences.

For example, let's suppose you want to say that a circle element contains a center element and either a radius or a diameter element, but not both. This declaration does that:

<!ELEMENT circle (center, (radius | diameter))>

To continue with a geometry example, suppose a center element can either be defined in terms of Cartesian or polar coordinates. Then each center contains either an x and a y or an r and a θ. We would declare this using two small sequences, each of which is parenthesized and combined in a choice:

 <!ELEMENT center ((x, y) | (r, θ))>

Suppose you don't really care whether the x element comes before the y element or vice versa, nor do you care whether r comes before . Then you can expand the choice to cover all four possibilities:

 <!ELEMENT center ((x, y) | (y, x) | (r, θ) | (θ, r) )>

As the number of elements in the sequence grows, the number of permutations grows more than exponentially. Thus, this technique really isn't practical past two or three child elements. DTDs are not very good at saying you want n instances of A and m instances of B, but you don't really care which order they come in.

Suffixes can be applied to parenthesized elements, too. For instance, let's suppose that a polygon is defined by individual coordinates for each vertex, given in order. For example, this is a right triangle:

<polygon>
  <r>0</r>  <θ>0</θ>
  <x>0</x>  <y>10</y>
  <x>10</x> <y>0</y>
</polygon>

What we want to say is that a polygon is composed of three or more pairs of x-y or r-θ coordinates. An x is always followed by a y, and an r is always followed by a θ. This declaration does that:

<!ELEMENT polygon 
  (((x, y) | (r, θ)), ((x, y) | (r, θ)), ((x, y) | (r, θ))+)>

The plus sign is applied to ((x, y) | (r,θ )).

To return to the name example, suppose you want to say that a name can contain just a first name, just a last name, or a first name and a last name with an indefinite number of middle names. This declaration achieves that:

<!ELEMENT name (last_name
               | (first_name, ( (middle_name+, last_name) | (last_name?) )
               ) >

Mixed Content

In narrative documents, it's common for a single element to contain both child elements and un-marked up, nonwhitespace character data. For example, recall this definition element from Chapter 2:

<definition>A <term>Turing Machine</term> refers to an abstract finite 
state automaton with infinite memory that can be proven equivalent 
to any any other finite state automaton with arbitrarily large memory. 
Thus what is true for one Turing machine is true for all Turing 
machines no matter how implemented.
</definition>

The definition element contains some nonwhitespace text and a term child. This is called mixed content . An element that contains mixed content is declared like this:

<!ELEMENT definition (#PCDATA | term)*>

This says that a definition element may contain parsed character data and term children. It does not specify in which order they appear, nor how many instances of each appear. This declaration allows a definition to have 1 term child, 0 term children, or 23 term children.

You can add any number of other child elements to the list of mixed content, although #PCDATA must always be the first child in the list. For example, this declaration says that a paragraph element may contain any number of name, profession, footnote, emphasize, and date elements in any order, interspersed with parsed character data:

<!ELEMENT paragraph
  (#PCDATA | name | profession | footnote | emphasize | date )*
>

This is the only way to indicate that an element contains mixed content. You cannot say, for example, that there must be exactly one term child of the definition element, as well as parsed character data. You cannot say that the parsed character data must all come after the term child. You cannot use parentheses around a mixed-content declaration to make it part of a larger grouping. You can only say that the element contains any number of any elements from a particular list in any order, as well as undifferentiated parsed character data.

Empty Elements

Some elements do not have any content at all. These are called empty elements and are sometimes written with a closing />. For example:

<image source="bus.jpg" width="152" height="345"
       alt="Alan Turing standing in front of a bus"
/>

These elements are declared by using the keyword EMPTY for the content specification. For example:

<!ELEMENT image EMPTY>

This merely says that the image element must be empty, not that it must be written with an empty-element tag. Given this declaration, this is also a valid image element:

<image source="bus.jpg" width="152" height="345"
       alt="Alan Turing standing in front of a bus"></image>

If an element is empty, then it can contain nothing, not even whitespace. For instance, this is an invalid image element:

<image source="bus.jpg" width="152" height="345"
       alt="Alan Turing standing in front of a bus">
</image>

ANY

Very loose DTDs occasionally want to say that an element exists without making any assertions about what it may or may not contain. In this case, you can specify the keyword ANY as the content specification. For example, this declaration says that a page element can contain any content, including mixed content, child elements, and even other page elements:

<!ELEMENT page ANY>

The children that actually appear in the page elements' content in the document must still be declared in element declarations of their own. ANY does not allow you to use undeclared elements.

ANY is sometimes useful when you're just beginning to design the DTD and document structure and you don't yet have a clear picture of how everything fits together. However, it's extremely bad form to use ANY in finished DTDs. About the only time you'll see it used is when external DTD subsets and entities may change in uncontrollable ways. However, this is actually quite rare. You'd really only need this if you were writing a DTD for an application like XSLT or RDF that wraps content from arbitrary, unknown XML applications.