XML documents should (but do not have to) begin with an
XML declaration. The XML declaration looks like a
processing instruction with the name xml
and with version
, standalone
, and encoding
pseudo-attributes. Technically,
it's not a processing instruction, though; it's just the XML
declaration, nothing more, nothing less. Example 2-7 demonstrates.
Example 2-7. A very simple XML document with an XML declaration
<?xml version="1.0" encoding="ASCII" standalone="yes"?> <person> Alan Turing </person>
XML documents do not have to have an XML declaration. However,
if an XML document does have an XML declaration, then that declaration
must be the first thing in the document. It must not be preceded by
any comments, whitespace, processing instructions, and so forth. The
reason is that an XML parser uses the first five characters (<?xml
) to make some reasonable guesses
about the encoding, such as whether the document uses a single-byte or
multibyte character set. The only thing that may precede the XML
declaration is an invisible Unicode byte-order mark. We'll discuss
this further in Chapter 5.
The version
attribute should have the value 1.0. Under very
unusual circumstances, it may also have the value 1.1. Since
specifying version="1.1
" limits
the document to the most recent versions of only a couple of
parsers, and since all XML 1.1 parsers must also support XML 1.0,
you don't want to casually set the version to 1.1.
Don't believe us? First answer a couple of questions:
Do you speak Cambodian, Burmese, Amharic, Mongolian, or Divehi?
Does your data contain obsolete, nontext C0 control characters such as vertical tab, form feed, or bell?
If you answered no to both of these questions, you have
absolutely nothing to gain by using XML 1.1. If you answered yes to either one, then you
may have cause to use XML 1.1. XML 1.0 allows Cambodian, Burmese,
Amharic, etc. to be used in character data and attribute values. XML
1.1 also allows these scripts to be used in element and attribute
names, which XML 1.0 does not. XML 1.1 also allows C0 control
characters (except null) to be used in character data and attribute
values (provided they're escaped as numeric character references
like 
), which XML 1.0
does not. If either of these conditions applies to you, then you
might want to use XML 1.1 (although realize you're limiting your
audience by doing so). Otherwise, you really should use XML 1.0
exclusively.
So far, we've been a little cavalier about character sets and character encodings. We've said that XML documents are composed of pure text, but we haven't said what encoding that text uses. Is it ASCII? Latin-1? Unicode? Something else?
The short answer to this question is "Yes." The long answer is that, by default, XML documents are assumed to be encoded in the UTF-8 variable-length encoding of the Unicode character set. This is a strict superset of ASCII, so pure ASCII text files are also UTF-8 documents. However, most XML processors, especially those written in Java, can handle a much broader range of character sets. All you have to do is tell the parser which character encoding the document uses. Preferably, this is done through metainformation, stored in the filesystem or provided by the server. However, not all systems provide character-set metadata, so XML also allows documents to specify their own character set with an encoding declaration inside the XML declaration. Example 2-8 shows how you'd indicate that a document was written in the ISO-8859-1 (Latin-1) character set that includes letters like ö and ç needed for many non-English Western European languages.
Example 2-8. An XML document encoded in Latin-1
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?> <person> Erwin Schrödinger </person>
The encoding
attribute is
optional in an XML declaration. If it is omitted and no metadata is
available, the Unicode character set is assumed. The parser may use
the first several bytes of the file to try to guess which encoding
of Unicode is in use. If metadata is available and it conflicts with
the encoding declaration, then the encoding specified by the
metadata wins. For example, if an HTTP header says a document is
encoded in ASCII but the encoding declaration says it's encoded in
UTF-8, then the parser will pick ASCII.
The different encodings and the proper handling of non-English XML documents will be discussed in greater detail in Chapter 5.
If the standalone
attribute has the value no
, then an application may be required to
read an external DTD (that is, a DTD in a file other than the one it's
reading now) to determine the proper values for parts of the
document. For instance, a DTD may provide default values for
attributes that a parser is required to report, even though they
aren't actually present in the document.
Documents that do not have DTDs, like all the documents in
this chapter, can have the value yes
for the standalone
attribute. Documents that do
have DTDs can also have the value yes
for the standalone
attribute if the DTD doesn't
change the content of the document in any way or if the DTD is
purely internal. Details for documents with DTDs are covered in
Chapter 3.
The standalone
attribute is
optional in an XML declaration. If it is omitted, then the value
no
is assumed.