Chapter 17. XML Schemas

Although document type definitions can enforce basic structural rules on documents, many applications need a more powerful and expressive validation method. The W3C developed the XML Schema Recommendation to address these needs. Schemas can describe complex restrictions on elements and attributes. Multiple schemas can be combined to validate documents that use multiple XML vocabularies. This chapter provides a rapid introduction to key W3C XML Schema concepts and usage, starting with the fundamental structures that are common to all schemas. We begin with a very simple schema and proceed to add more functionality to it until every major feature of XML Schemas has been introduced.

Overview

An XML Schema is an XML document containing a formal description of what comprises a valid XML document. A W3C XML Schema Language schema is an XML Schema written in the particular syntax recommended by the W3C.

Tip

In this chapter, when we use the word "schema" without further qualification, we are referring specifically to a schema written in the W3C XML Schema language. However, there are numerous other XML Schema languages, including RELAX NG and Schematron, each with their own strengths and weaknesses.

An XML document described by a schema is called an instance document . If a document satisfies all the constraints specified by the schema, it is considered to be schema-valid . The schema document is associated with an instance document through one of the following methods:

An xsi:schemaLocation attribute on an element contains a list of namespaces used within that element and the URLs of the schemas with which to validate elements and attributes in those namespaces.
An xsi:noNamespaceSchemaLocation attribute contains a URL for the schema used to validate elements that are not in any namespace.
A validating parser may be instructed to validate a given document against an explicitly provided schema, ignoring any hints that might be provided within the document itself.

Schemas Versus DTDs

DTDs provide the capability to do basic validation of the following items in XML documents:

Element nesting
Element occurrence constraints
Permitted attributes
Attribute types and default values

However, DTDs do not provide fine control over the format and data types of element and attribute values. Other than the various special attribute types (ID, IDREF, ENTITY, NMTOKEN, and so forth), once an element or attribute has been declared to contain character data, no limits may be placed on the length, type, or format of that content. For narrative documents (such as web pages, book chapters, newsletters, etc.), this level of control is probably good enough.

But as XML makes inroads into more record-like applications, such as remote procedure calls and object serialization, more precise control over the text content of elements and attributes becomes important. The W3C XML Schema standard includes the following features:

Simple and complex data types
Type derivation and inheritance
Element occurrence constraints
Namespace-aware element and attribute declarations

The most important of these features is the addition of simple data types for parsed character data and attribute values. Schemas can enforce much more specific rules about the contents of elements and attributes than DTDs can. In addition to a wide range of built-in simple types (such as string, integer, decimal, and dateTime), the schema language provides a framework for declaring new data types, deriving new types from old types, and reusing types from other schemas.

Besides simple data types, schemas can place more explicit restrictions on the number and sequence of child elements that can appear in a given location. This is even true when elements are mixed with character data, unlike the mixed content supported by DTDs.

Warning

There are a few things that DTDs do that XML Schema can't do, such as defining general entities. XML Inclusions (XInclude) may be able to replace some uses of general entities, but DTDs remain extremely convenient for short entities.

Namespace Issues

As XML documents are exchanged between different people and organizations around the world, proper use of namespaces becomes critical to prevent misunderstandings. Depending on what type of document is being viewed, a simple element like <fullName>Zoe</fullName> could have widely different meanings. It could be a person's name, a pet's name, or the name of a ship that recently docked. By associating every element with a namespace URI, it is possible to distinguish between two elements with the same local name.

Because the "Namespaces in XML" recommendation was released after the XML 1.0 recommendation, DTDs do not provide explicit support for namespaces. Unlike DTDs (where element and attribute declarations must include a namespace prefix), schemas validate against the combination of the namespace URI and local name, rather than the prefixed name.

XML Schema uses namespaces internally for several purposes. The XML Schema vocabulary is in its own namespace, the vocabulary being defined is in its namespace, and components used within the schema (groups, attribute groups, and types) may also have namespaces. XML Schema processing also uses namespaces within instance documents to include directives to the schema processor. For example, the special attributes used to associate an element with a schema (schemaLocation and noNamespaceSchemaLocation) must be associated with the official XML Schema instance namespace URI (http://www.w3.org/2001/XMLSchema-instance) in order for the schema processor to recognize it as an instruction to itself.