Chapter 12. XML

XML (the eXtensible Markup Language) provides an industry-standard method for encoding structured information. It defines syntactic and structural rules that enable software applications to process XML files even when they don’t understand all of the data.

XML specifications are defined and maintained by the World Wide Web Consortium (W3C). The latest version is XML 1.1 (Second Edition). However, XML 1.0 (currently in its fifth edition) is the most popular version, and is supported by all XML parsers. W3C states that:

You are encouraged to create or generate XML 1.0 documents if you do not need the new features in XML 1.1; XML Parsers are expected to understand both XML 1.0 and XML 1.1 (see http://www.w3.org/xml/core/#publications/).

This chapter will introduce XML 1.0 only, and in fact, will focus on just the most commonly used XML features. We’ll introduce you to the XDocument and XElement classes first, and you’ll learn how to create and manipulate XML documents.

Of course, once you have a large document, you’ll want to be able to find substrings, and we’ll show you two different ways to do that, using LINQ. The .NET Framework also allows you to serialize your objects as XML, and deserialize them at their destination. We’ll cover those methods at the end of the chapter.

XML Basics (A Quick Review)

XML is a markup language, not unlike HTML, except that it is extensible—that is, applications that use XML can (and do) create new kinds of elements and attributes.

Elements

In XML, a document is a hierarchy of elements. An element is typically defined by a pair of tags, called the start and end tags. In the following example, FirstName is an element:

<FirstName>Orlando</FirstName>

A start tag contains the element name surrounded by a pair of angle brackets:

<FirstName>

An end tag is similar, except that the element name is preceded by a forward slash:

</FirstName>

An element may contain content between its start and end tags. In this example, the element contains text, but content can also contain child elements. For example, this Customer element has three child elements:

  <Customer>
    <FirstName>Orlando</FirstName>
    <LastName>Gee</LastName>
    <EmailAddress>orlando0@hotmail.com</EmailAddress>
  </Customer>

The top-level element in an XML document is called its root element. Every document has exactly one root element.

An element does not have to contain content, but every element (except for the root element) has exactly one parent element. Elements with the same parent element are called sibling elements.

In this example, Customers (plural) is the root. The children of the root element, Customers, are the three Customer (singular) elements:

<Customers>
  <Customer>
    <FirstName>Orlando</FirstName>
    <LastName>Gee</LastName>
    <EmailAddress>orlando0@hotmail.com</EmailAddress>
  </Customer>
  <Customer>
    <FirstName>Keith</FirstName>
    <LastName>Harris</LastName>
    <EmailAddress>keith0@hotmail.com</EmailAddress>
  </Customer>
  <Customer>
    <FirstName>Donna</FirstName>
    <LastName>Carreras</LastName>
    <EmailAddress>donna0@hotmail.com</EmailAddress>
  </Customer>
  <Customer>
    <FirstName>Janet</FirstName>
    <LastName>Gates</LastName>
    <EmailAddress>janet1@hotmail.com</EmailAddress>
  </Customer>
  <Customer>
    <FirstName>Lucy</FirstName>
    <LastName>Harrington</LastName>
    <EmailAddress>lucy0@hotmail.com</EmailAddress>
  </Customer>
</Customers>

Each Customer has one parent (Customers) and three children (FirstName, LastName, and EmailAddress). Each of these, in turn, has one parent (Customer) and zero children.

When an element has no content—no child elements and no text—you can optionally use a more compact representation, where you write just a single tag, with a slash just before the closing angle bracket. For example, this:

<Customers/>

means exactly the same as this:

<Customers></Customers>

This empty element tag syntax is the only syntax in which an element is represented by just a single tag. Unless you are using this form, it is illegal to omit the closing tag.

XHTML

XHTML is an enhanced standard of HTML that follows the stricter rules of XML validity. The two most important XML rules that make XHTML different from plain HTML follow:

No elements may overlap, though they may nest. So this is legal, because the elements are nested:
```
<element 1>
   <element2>
      ...
   </element 2>
</element 1>
```
You may not write:
```
<element 1>
   <element2>
      ...
   </element 1>
</element 2>
```
because in the latter case, element2 overlaps element1 rather than being neatly nested within it. (Ordinary HTML allows this.)
Every element must be closed, which means that for each opened element, you must have a closing tag (or the element tag must be self-closing). So while plain old HTML permits:
```
 <br>
```
in XHTML we must either write this:
```
<br></br>
```
or use the empty element tag form:
```
<br />
```