XML (the eXtensible Markup Language) provides an industry-standard method for encoding structured information. It defines syntactic and structural rules that enable software applications to process XML files even when they don’t understand all of the data.
XML specifications are defined and maintained by the World Wide Web Consortium (W3C). The latest version is XML 1.1 (Second Edition). However, XML 1.0 (currently in its fifth edition) is the most popular version, and is supported by all XML parsers. W3C states that:
You are encouraged to create or generate XML 1.0 documents if you do not need the new features in XML 1.1; XML Parsers are expected to understand both XML 1.0 and XML 1.1 (see http://www.w3.org/xml/core/#publications/).
This chapter will introduce XML 1.0 only, and in fact, will focus on
just the most commonly used XML features. We’ll introduce you to the
XDocument
and XElement
classes first, and you’ll learn how to
create and manipulate XML documents.
Of course, once you have a large document, you’ll want to be able to find substrings, and we’ll show you two different ways to do that, using LINQ. The .NET Framework also allows you to serialize your objects as XML, and deserialize them at their destination. We’ll cover those methods at the end of the chapter.
XML is a markup language, not unlike HTML, except that it is extensible—that is, applications that use XML can (and do) create new kinds of elements and attributes.
In XML, a document is a hierarchy of
elements. An element is typically defined by a pair
of tags, called the start and end tags. In the
following example, FirstName
is an
element:
<FirstName>Orlando</FirstName>
A start tag contains the element name surrounded by a pair of angle brackets:
<FirstName>
An end tag is similar, except that the element name is preceded by a forward slash:
</FirstName>
An element may contain content between its
start and end tags. In this example, the element contains text, but
content can also contain child elements. For example, this
Customer
element has three child
elements:
<Customer> <FirstName>Orlando</FirstName> <LastName>Gee</LastName> <EmailAddress>orlando0@hotmail.com</EmailAddress> </Customer>
The top-level element in an XML document is called its root element. Every document has exactly one root element.
An element does not have to contain content, but every element (except for the root element) has exactly one parent element. Elements with the same parent element are called sibling elements.
In this example, Customers
(plural) is the root. The children of the root element, Customers
, are the three
Customer
(singular) elements:
<Customers> <Customer> <FirstName>Orlando</FirstName> <LastName>Gee</LastName> <EmailAddress>orlando0@hotmail.com</EmailAddress> </Customer> <Customer> <FirstName>Keith</FirstName> <LastName>Harris</LastName> <EmailAddress>keith0@hotmail.com</EmailAddress> </Customer> <Customer> <FirstName>Donna</FirstName> <LastName>Carreras</LastName> <EmailAddress>donna0@hotmail.com</EmailAddress> </Customer> <Customer> <FirstName>Janet</FirstName> <LastName>Gates</LastName> <EmailAddress>janet1@hotmail.com</EmailAddress> </Customer> <Customer> <FirstName>Lucy</FirstName> <LastName>Harrington</LastName> <EmailAddress>lucy0@hotmail.com</EmailAddress> </Customer> </Customers>
Each Customer
has one parent
(Customers
) and three children
(FirstName
, LastName
, and EmailAddress
). Each of these, in turn, has one
parent (Customer
) and zero
children.
When an element has no content—no child elements and no text—you can optionally use a more compact representation, where you write just a single tag, with a slash just before the closing angle bracket. For example, this:
<Customers/>
means exactly the same as this:
<Customers></Customers>
This empty element tag syntax is the only syntax in which an element is represented by just a single tag. Unless you are using this form, it is illegal to omit the closing tag.
XHTML is an enhanced standard of HTML that follows the stricter rules of XML validity. The two most important XML rules that make XHTML different from plain HTML follow:
No elements may overlap, though they may nest. So this is legal, because the elements are nested:
<element 1> <element2> ... </element 2> </element 1>
You may not write:
<element 1> <element2> ... </element 1> </element 2>
because in the latter case, element2
overlaps element1
rather than being neatly nested
within it. (Ordinary HTML allows this.)
Every element must be closed, which means that for each opened element, you must have a closing tag (or the element tag must be self-closing). So while plain old HTML permits:
<br>
in XHTML we must either write this:
<br></br>
or use the empty element tag form:
<br />