Chapter 13. Creating Extensible Schemas

The X from XML stands for “extensible.” The goal of any schema language is to control and limit this extensibility to help the applications deal with it. Extensibility and schemas pursue two opposite goals. Carelessly written schemas may significantly reduce extensibility, and we need to keep this in mind when we design our own schemas.

Here again, we find the duality between the schema and the instance documents, and we need to distinguish between two different forms of extensibility. The extensibility of the schema, is the ability to reuse its components to create other schemas, while the extensibility of the vocabulary, is the ability to add or modify the content models with a minimal impact on the applications, and is, in fact, the openness of the schema.

The extensibility of a schema is essentially determined by its style, the choice of which components (elements and attributes, element and attribute groups, and simple and complex types) have been made global, the use of the final and fixed attributes, and the optional division of these components over different schema documents. We need to have a look at these three factors.

A simple example is often better than a long explanation, so to illustrate the differences between the different schema styles, we will take some examples out of our library and study complex and simple type elements and attributes.

Let’s consider the definition of the book element in the context of our library. We have four different basic ways of defining this element, and they all will validate the same set of instance elements—but not the same set of instance documents, since exposing an element as global allows its use as a document element. We can use a Russian doll design and define the book element and its type locally within the library element (I have used the same Russian doll design for the book’s child elements to keep the schema concise as we will focus on the definition of book for this example):

<xs:element name="library">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="book" maxOccurs="unbounded">
        <xs:complexType>
          <xs:sequence>
            <xs:element ref="isbn"/>
            <xs:element ref="title"/> 
            <xs:element ref="author" minOccurs="0"
              maxOccurs="unbounded"/> 
            <xs:element ref="character" minOccurs="0"
              maxOccurs="unbounded"/>
          </xs:sequence>
          <xs:attribute ref="id"/>
          <xs:attribute ref="available"/>
        </xs:complexType>
      </xs:element>
    </xs:sequence>
  </xs:complexType>
</xs:element>

We can also define a global book element and reference it in the content model of our library:

<xs:element name="book">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="isbn"/>
      <xs:element ref="title"/>
      <xs:element ref="author" minOccurs="0" maxOccurs="unbounded"/> 
      <xs:element ref="character" minOccurs="0"
        maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute ref="id"/>
    <xs:attribute ref="available"/>
  </xs:complexType>
</xs:element>
             
<xs:element name="library">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="book" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

The third classical way is to define a complex type for the content model of our bookType element (note that I could have called it book, but I feel bookType is less confusing):

<xs:complexType name="bookType">
  <xs:sequence>
    <xs:element ref="isbn"/>
    <xs:element ref="title"/>
    <xs:element ref="author" minOccurs="0" maxOccurs="unbounded"/>
    <xs:element ref="character" minOccurs="0" maxOccurs="unbounded"/>
  </xs:sequence>
  <xs:attribute ref="id"/>
  <xs:attribute ref="available"/>
</xs:complexType>
             
<xs:element name="library">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="book" type="bookType" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

Finally, we can define a group containing our book element:

<xs:group name="bookGroup">
  <xs:sequence>
    <xs:element name="book">
      <xs:complexType>
        <xs:sequence>
          <xs:element ref="isbn"/>
          <xs:element ref="title"/> 
          <xs:element ref="author" minOccurs="0"
            maxOccurs="unbounded"/> 
          <xs:element ref="character" minOccurs="0"
            maxOccurs="unbounded"/>
        </xs:sequence>
        <xs:attribute ref="id"/>
        <xs:attribute ref="available"/>
      </xs:complexType>
    </xs:element>
  </xs:sequence>
</xs:group>
             
<xs:element name="library">
  <xs:complexType>
    <xs:sequence>
      <xs:group ref="bookGroup" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

These four basic styles can, of course, be combined. The more extreme example is as follows:

<xs:complexType name="bookType">
  <xs:sequence>
    <xs:element ref="isbn"/>
    <xs:element ref="title"/>
    <xs:element ref="author" minOccurs="0" maxOccurs="unbounded"/>
    <xs:element ref="character" minOccurs="0" maxOccurs="unbounded"/>
  </xs:sequence>
  <xs:attribute ref="id"/>
  <xs:attribute ref="available"/>
</xs:complexType>
             
<xs:element name="book" type="bookType"/>
             
<xs:group name="bookGroup">
  <xs:sequence>
    <xs:element ref="book"/>
  </xs:sequence>
</xs:group>
             
<xs:element name="library">
  <xs:complexType>
    <xs:sequence>
      <xs:group ref="bookGroup" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

Although this example may seem excessive, we must acknowledge that it is also the most extensible, since it lets you use all the “reuse and derive” methods of our three compositors! Now that we’ve seen these four basic styles, let’s see how they compare for re-usability and derivation.

The Russian doll is obviously the style that is the least extensible: both the definition of the book element and of its content model are local. They cannot be referenced to be reused in another part of a schema, they cannot be used as a document element, they cannot be modified by derivation, through xs:redefine , or through substitution groups. Using a Russian doll style here is thus a more efficient “blocking” feature than any blocking attribute is. Changing or reusing the book element or content model requires attaching a totally different schema to the instance document or using a xsi:type attribute in the instance document.

The flat model, which uses global element definitions, gives a basic level of flexibility since the element can now be reused in any location within any schema, can be used as a document element in an instance document, and can be used as the head of a substitution group. When used with a local complex type definition like in our example, the flat model doesn’t allow you to change the content model of the book element. Among these three features, the flat model can be used as the head of a substitution group, and is the only one that can be blocked (using a block attribute). It can be used without restriction as a document element in an instance document or be used anywhere in a schema. We also need to note that elements cannot be redefined and that the content model of our book element cannot be changed, except through a substitution by means of xsi:type in the instance document.

The definition of a global complex type to describe the content model of the book element opens two different doors. The content model of the book element can now be reused to derive extended or restricted content models that may be used elsewhere, and the complex type can be redefined through xs:redefine . As seen in the previous chapter, the derivation can be blocked through the final attribute, but the redefinition cannot be controlled.

Last but not least, embedding the definition of the book element in a group allows the group to be reused elsewhere— for example, in our flat model—but can hide the definition of the book element, if needed, to avoid its usage as a document element in instance documents. (Incidentally, it also blocks its usage as the head of a substitution group.) Defining a group also opens the possibility to redefine it through xs:redefine to change the number of occurrences of the element, to add new elements, or even to change its content model if a global complex type has been used. Using an element group this way is very similar to the approach of RELAX NG and gives a bit of its flexibility. We need to note, though, that element groups cannot be recursive; this can be a limitation to using element groups to define recursive content models with element groups, since a global element still needs to be defined for use in a reference. This can be a problem when we can’t, or don’t want to, use a global element—for instance, when we have two different recursive content models using the same element name with different contents.

Which approach is appropriate? There is no single definite answer to this question, but we know that each of these styles has a different set of extensibility features. The choice between them or a combination of them has a major impact on the reusability and derivability of the definitions present in a schema. Table 13-1 may help with visualizing the differences between these styles, but keep in mind that combinations of all of them are allowed!

Simple type elements behave much like complex types, except that the complex type definitions are, of course, replaced by simple type definitions similar to those for attributes, discussed in the next section.

As seen in Chapter 10, attributes behave differently from elements in that most of the time they are unqualified. This means then that they cannot be globally defined. Otherwise, we have a similar situation with attributes, simple types, and attribute groups as we had with elements and complex types (the other exception is there is no equivalent in attribute land to substitution groups or xsi:type). If we take the definition of a lang attribute restricted to en or fr in the title element, we can have a Russian doll design in which the attribute and its type will be locally defined:

<xs:element name="title">
  <xs:complexType>
    <xs:simpleContent>
      <xs:extension base="xs:token">
        <xs:attribute name="lang">
          <xs:simpleType>
            <xs:restriction base="xs:language">
              <xs:enumeration value="en"/>
              <xs:enumeration value="fr"/>
            </xs:restriction>
          </xs:simpleType>
        </xs:attribute>
      </xs:extension>
    </xs:simpleContent>
  </xs:complexType>
</xs:element>

We can also take a flat design in which the attribute is globally defined:

<xs:attribute name="lang">
  <xs:simpleType>
    <xs:restriction base="xs:language">
      <xs:enumeration value="en"/>
      <xs:enumeration value="fr"/>
    </xs:restriction>
  </xs:simpleType>
</xs:attribute>
             
<xs:element name="title">
  <xs:complexType>
    <xs:simpleContent>
      <xs:extension base="xs:token">
        <xs:attribute ref="lang"/>
      </xs:extension>
    </xs:simpleContent>
  </xs:complexType>
</xs:element>

A global simple type can also be defined:

<xs:simpleType name="langType">
  <xs:restriction base="xs:language">
    <xs:enumeration value="en"/>
    <xs:enumeration value="fr"/>
  </xs:restriction>
</xs:simpleType>
             
<xs:element name="title">
  <xs:complexType>
    <xs:simpleContent>
      <xs:extension base="xs:token">
        <xs:attribute name="lang" type="langType"/>
      </xs:extension>
    </xs:simpleContent>
  </xs:complexType>
</xs:element>

The attribute may be “hidden” in an attribute group:

<xs:attributeGroup name="langGroup">
  <xs:attribute name="lang">
    <xs:simpleType>
      <xs:restriction base="xs:language">
        <xs:enumeration value="en"/>
        <xs:enumeration value="fr"/>
      </xs:restriction>
    </xs:simpleType>
  </xs:attribute>
</xs:attributeGroup>
             
<xs:element name="title">
  <xs:complexType>
    <xs:simpleContent>
      <xs:extension base="xs:token">
        <xs:attributeGroup ref="langGroup"/>
      </xs:extension>
    </xs:simpleContent>
  </xs:complexType>
</xs:element>

All can this can be used together:

<xs:simpleType name="langType">
  <xs:restriction base="xs:language">
    <xs:enumeration value="en"/>
    <xs:enumeration value="fr"/>
  </xs:restriction>
</xs:simpleType>
             
<xs:attribute name="lang" type="langType"/>
             
<xs:attributeGroup name="langGroup">
  <xs:attribute ref="lang"/>
</xs:attributeGroup>
             
<xs:element name="title">
  <xs:complexType>
    <xs:simpleContent>
      <xs:extension base="xs:token">
        <xs:attributeGroup ref="langGroup"/>
      </xs:extension>
    </xs:simpleContent>
  </xs:complexType>
</xs:element>

The impact of these design decisions is pretty much the same as those we’ve seen in complex type elements, except, of course, for substitution groups and usability as a document element. Table 13-2 explains the options these varying approaches provide.

These attributes were already covered in Chapter 12, and they have an obvious impact on the re-usability of simple and complex type definitions since they can block some or all the further derivations. This category of features affects the flexibility of the schema itself. Their friends block and abstract are features that impact the openness of the schema and have no impact on the set of instance documents.

The last factor that acts on the flexibility and re-usability of our schema (and schema libraries) is the split of the components among different documents. Some schema designers have gone as far as possible in this direction and advise the location of each class or component in its own schema document, and to include and import the components needed to create a full schema. This may seem excessive, but provides a very fine granularity and allows a workaround of the limitations of xs:redefine . (If a component needs to be redefined, just leave out the old definition and write a new one.)

The biggest issue with such a design is probably the management of a number of different documents that can rapidly grow, and the many dependencies between these documents. These dependencies must be considered when designing libraries of schemas since they can be tough to track because the links between the included and including documents are multidirectional. A component within an included schema can reference components defined in any other schema processed by the schema processor.

We need to reexamine how a schema processor will build a global schema using all the imported, included, and redefine instructions it will find. The schema processor initially builds a big consolidated schema with all the components defined in all the schema documents it has processed. It then resolves the references between components after building this consolidated schema. Although this simple and powerful mechanism applies to inclusions without restriction, we will see that things can get nastier with imports and redefinitions. Let’s start with the simplest case and move on to the processing of xs:include .

The semantic of xs:include is slightly different from the semantic of the include statements used in languages such as C, and it should be considered a conditional include. A xs:include is actually a request to read a schema if it has not already been read, to add all the component declarations found in this schema to the consolidated schema if they have not already been defined, to ignore the components found in the new schema that are already defined in the global schema if they are identical, and to raise an error if they are different. This means it is perfectly legitimate to create loops and multiple inclusions, either directly (schema A includes schema B, which includes schema C) or indirectly (schema A includes schema B and schema C, which includes schema B) and we can create inclusion paths as complex as we wish.

The meaning of xs:redefine is similar, except that some components can be redefined. When used, this difference is enough to break the possibility of creating loops in which a schema A redefines components of a schema B, which redefines or includes schema A. This restriction actually means that while we can speak of inclusion graphs, the redefinitions would instead form a tree. The process of including or redefining is recursive, however, and when we include (or redefine) a schema, we include the consolidated schema resulting from the included document rather than the document itself. We can still create inclusion loops within the branches of the redefinition tree (schema A can redefine schema B, which includes schema C, which includes schema B).

Some designers rely on the fact that when a schema without target namespaces is included (or redefined) in a schema with a target namespace, the included schema “borrows” the target namespace of its “includer.” This feature, already mentioned in Chapter 10, can be used to build “neutral” components with no namespaces that can be included and used as building blocks. Since these components take the namespace of the including schema like a chameleon takes the color of its environment, these schemas are called "chameleon schemas.” Although this technique is simple and may be convenient in some cases, it can be confusing to define similar components (and, therefore, similar types and content models) in different namespaces instead of creating a common namespace for them, which would immediately identify these types and content models as identical.

xs:import behaves somewhat like xs:include : no redefinitions occur, which means that loops can be created where schema A (for namespace A) imports schema B (for namespace B), which itself imports schema A. It is important to note that xs:import serves two different purposes: it is an instruction to import a schema and a declaration that components from a namespace can be referenced. If schema A for namespace A imports schema B for namespace B, and if schema B needs to reference components from the namespace A, an xs:import statement must be included in schema B to declare that namespace A can be used (the schemaLocation attribute is optional and can be omitted in such cases).

After working through the three mechanisms (include, redefine, and import), we can mix all of them together and note that chameleon schemas can be used together with imports. In this case, the same imported chameleon can contribute several times to a global schema under different namespaces. If schema A for namespace A includes schema B with no namespace, and imports schema C for namespace C with includes schema B, the two inclusions of schema B belong to different namespaces and are considered different.

We now have all the elements to find innovative ways to mix inclusion and import graphs with redefinition trees. Keep in mind that simple is beautiful, and if we don’t restrict ourselves, we humans might get lost well before our favorite schema processor!