Every element used in a valid document must be declared in the document's DTD with an element declaration. Element declarations have this basic form:
<!ELEMENTname
content_specification
>
The name of the element can be any legal XML name. The content specification indicates what children the element may or must have and in what order. Content specifications can be quite complex. They can say, for example, that an element must have three child elements of a given type, or two children of one type followed by another element of a second type, or any elements chosen from seven different types interspersed with text.
The simplest content specification is one that says an
element may only contain parsed character data, but may not contain any child
elements of any type. In this case the content specification
consists of the keyword #PCDATA
inside parentheses. For example, this declaration says that a
phone_number
element may contain
text but may not contain elements:
<!ELEMENT phone_number (#PCDATA)>
Such an element may also contain character references and
CDATA
sections (which are always
parsed into pure text) and comments, and processing instructions
(which don't really count in validation). It may contain entity
references only if those entity references resolve to plain text
without any child elements.
Another simple content specification is one that says
the element must have exactly one child of a given type. In this
case, the content specification consists of the name of the child
element inside parentheses. For example, this declaration says that
a fax
element must contain
exactly one phone_number
element:
<!ELEMENT fax (phone_number)>
A fax
element may not
contain anything else except the phone_number
element, and it may not
contain more or less than one of those.
In practice, a content specification that lists exactly
one child element is rare. Most elements contain either parsed
character data or (at least potentially) multiple child elements.
The simplest way to indicate multiple child elements is to separate
them with commas. This is called a sequence. It indicates that the
named elements must appear in the specified order. For example, this
element declaration says that a name
element must contain exactly one
first_name
child element followed
by exactly one last_name
child
element:
<!ELEMENT name (first_name, last_name)>
Given this declaration, this name
element is valid:
<name> <first_name>Madonna</first_name> <last_name>Cicconne</last_name> </name>
However, this one is not valid because it flips the order of two elements:
<name> <last_name>Cicconne</last_name> <first_name>Madonna</first_name> </name>
This element is invalid because it omits the last_name
element:
<name> <first_name>Madonna</first_name> </name>
This one is invalid because it adds a middle_name
element:
<name> <first_name>Madonna</first_name> <middle_name>Louise</middle_name> <last_name>Cicconne</last_name> </name>
As the previous examples indicate, not all instances of a given element necessarily have exactly the same children. You can affix one of three suffixes to an element name in a content specification to indicate how many of that element are expected at that position. These suffixes are:
? Zero or one of the element is allowed. |
* Zero or more of the element is allowed. |
+ One or more of the element is required. |
For example, this declaration says that a name
element must contain exactly one
first_name
, may or may not
contain a middle_name
, and may or
may not contain a last_name
:
<!ELEMENT name (first_name, middle_name?, last_name?)>
Given this declaration, all these name
elements are valid:
<name> <first_name>Madonna</first_name> <last_name>Cicconne</last_name> </name> <name> <first_name>Madonna</first_name> <middle_name>Louise</middle_name> <last_name>Cicconne</last_name> </name> <name> <first_name>Madonna</first_name> </name>
However, these are not valid:
<name> <first_name>George</first_name> <!-- only one middle name is allowed --> <middle_name>Herbert</middle_name> <middle_name>Walker</middle_name> <last_name>Bush</last_name> </name> <name> <!-- first name must precede last name --> <last_name>Cicconne</last_name> <first_name>Madonna</first_name> </name>
You can allow for multiple middle names by placing an asterisk
after the middle_name
:
<!ELEMENT name (first_name, middle_name*, last_name?)>
If you wanted to require a middle_name
to be included, but still
allow for multiple middle names, you'd use a plus sign instead, like
this:
<!ELEMENT name (first_name, middle_name+, last_name?)>
Sometimes one instance of an element may contain one kind of
child, and another instance may contain a different child. This can
be indicated with a choice. A choice is a list
of element names separated by vertical bars. For example, this
declaration says that a methodResponse
element contains either a
params
child or a fault
child:
<!ELEMENT methodResponse (params | fault)>
However, it cannot contain both at once. Each methodResponse
element must contain one or
the other.
Choices can be extended to an indefinite number of possible
elements. For example, this declaration says that each digit
element can contain exactly one of
the child elements named zero
,
one
, two
, three
, four
, five
, six
, seven
, eight
, or nine
:
<!ELEMENT digit (zero | one | two | three | four | five | six | seven | eight | nine) >
Individually, choices, sequences, and suffixes are fairly limited. However,
they can be combined in arbitrarily complex fashions to describe
most reasonable content models. Either a choice or a sequence can be
enclosed in parentheses. When so enclosed, the choice or sequence
can be suffixed with a ?
,
*
, or +
. Furthermore, the parenthesized item can
be nested inside other choices or sequences.
For example, let's suppose you want to say that a circle
element contains a center
element and either a radius
or a diameter
element, but not both. This
declaration does that:
<!ELEMENT circle (center, (radius | diameter))>
To continue with a geometry example, suppose a center
element can either be defined in
terms of Cartesian or polar coordinates. Then each center contains
either an x
and a y
or an r
and a θ. We would declare this using two small
sequences, each of which is parenthesized and combined in a
choice:
<!ELEMENT center ((x, y) | (r, θ))>
Suppose you don't really care whether the x
element comes before the y
element or vice versa, nor do you care
whether r
comes before . Then you
can expand the choice to cover all four possibilities:
<!ELEMENT center ((x, y) | (y, x) | (r, θ) | (θ, r) )>
As the number of elements in the sequence grows, the number of permutations grows more than exponentially. Thus, this technique really isn't practical past two or three child elements. DTDs are not very good at saying you want n instances of A and m instances of B, but you don't really care which order they come in.
Suffixes can be applied to parenthesized elements, too. For instance, let's suppose that a polygon is defined by individual coordinates for each vertex, given in order. For example, this is a right triangle:
<polygon> <r>0</r> <θ>0</θ> <x>0</x> <y>10</y> <x>10</x> <y>0</y> </polygon>
What we want to say is that a polygon is composed of three or
more pairs of x-y or r-θ
coordinates. An x
is always
followed by a y
, and an r
is always followed by a θ. This declaration does
that:
<!ELEMENT polygon (((x, y) | (r, θ)), ((x, y) | (r, θ)), ((x, y) | (r, θ))+)>
The plus sign is applied to ((x
, y)
|
(r
,θ ))
.
To return to the name example, suppose you want to say that a name can contain just a first name, just a last name, or a first name and a last name with an indefinite number of middle names. This declaration achieves that:
<!ELEMENT name (last_name | (first_name, ( (middle_name+, last_name) | (last_name?) ) ) >
In narrative documents, it's common for a single element to contain
both child elements and un-marked up, nonwhitespace character data.
For example, recall this definition
element from Chapter 2:
<definition>A <term>Turing Machine</term> refers to an abstract finite state automaton with infinite memory that can be proven equivalent to any any other finite state automaton with arbitrarily large memory. Thus what is true for one Turing machine is true for all Turing machines no matter how implemented. </definition>
The definition
element
contains some nonwhitespace text and a term
child. This is called mixed
content . An element that contains mixed content is declared
like this:
<!ELEMENT definition (#PCDATA | term)*>
This says that a definition
element may contain parsed character data and term
children. It does not specify in
which order they appear, nor how many instances of each appear. This
declaration allows a definition
to have 1 term
child, 0 term
children, or 23 term
children.
You can add any number of other child elements to the list of
mixed content, although #PCDATA
must always be the first child in the list. For example, this
declaration says that a paragraph
element may contain any number of name
, profession
, footnote
, emphasize
, and date
elements in any order, interspersed
with parsed character data:
<!ELEMENT paragraph (#PCDATA | name | profession | footnote | emphasize | date )* >
This is the only way to indicate that an
element contains mixed content. You cannot say, for example, that
there must be exactly one term
child of the definition
element,
as well as parsed character data. You cannot say that the parsed
character data must all come after the term
child. You cannot use parentheses
around a mixed-content declaration to make it part of a larger
grouping. You can only say that the element contains any number of
any elements from a particular list in any order, as well as
undifferentiated parsed character data.
Some elements do not have any content at all. These are
called empty elements and are sometimes written
with a closing />
. For
example:
<image source="bus.jpg" width="152" height="345" alt="Alan Turing standing in front of a bus" />
These elements are declared by using the keyword EMPTY
for the content specification. For
example:
<!ELEMENT image EMPTY>
This merely says that the image
element must be empty, not that it
must be written with an empty-element tag. Given this declaration,
this is also a valid image
element:
<image source="bus.jpg" width="152" height="345" alt="Alan Turing standing in front of a bus"></image>
If an element is empty, then it can contain nothing, not even
whitespace. For instance, this is an invalid image
element:
<image source="bus.jpg" width="152" height="345" alt="Alan Turing standing in front of a bus"> </image>
Very loose DTDs occasionally want to say that an element exists
without making any assertions about what it may or may not contain.
In this case, you can specify the keyword ANY
as the content specification. For
example, this declaration says that a page
element can contain any content,
including mixed content, child elements, and even other page
elements:
<!ELEMENT page ANY>
The children that actually appear in the page
elements' content in the document
must still be declared in element declarations of their own.
ANY
does not allow you to use
undeclared elements.
ANY
is sometimes useful
when you're just beginning to design the DTD and document structure
and you don't yet have a clear picture of how everything fits
together. However, it's extremely bad form to use ANY
in finished DTDs. About the only time
you'll see it used is when external DTD subsets and entities may
change in uncontrollable ways. However, this is actually quite rare.
You'd really only need this if you were writing a DTD for an
application like XSLT or RDF that wraps content from arbitrary,
unknown XML applications.