Appendix B. Work in Progress

Although W3C XML Schema was approved as a W3C Recommendation in May 2001, it is still just getting started. This chapter identifies a short list of schema-related projects that seem either promising, useful, or just on the way.

Edited by the W3C, the W3C XML Schema is considered by the consortium to belong to the very foundation of XML—together with XML 1.0 and namespaces in XML—and it does impact virtually all the other XML specifications. The most heavily affected seem to be the triumvirate XPath/XSLT/XQuery (and by consequence, XPointer- and XPointer-based specifications), DOM, and RDF.

One of the most amazing things about XPath and XSLT 1.0 is that queries and transformations can be executed by applications with no prior knowledge of the structure of the documents on which they work. This is a major difference from previous information systems, such as RDBMS, in which the layout of the tables needs to be defined before any query can be run. Even though this works just fine in many circumstances, there are two main areas in which improvements can be obtained if the structure of the instance documents is known.

The first of these areas is optimization. This is not crucial for small documents, but as soon as the size of the document grows (which is typically the case in a XML database), any optimizer will need food for thought to perform his job. The first piece of basic information that is required is about the structure of the documents. The second is typed-aware comparisons and sorts. In XPath and XSLT 1.0, the sort order (numerical or string) is indicated in the XSLT style sheet and the comparisons are always done character by character. Sorting or comparing dates with different time zones is practically impossible in these conditions, and some type information coming out of a schema can help a lot.

For these reasons, XSLT 2.0, which will use XPath 2.0 like XSLT 1.0 uses XPath 1.0, and XQuery 1.0, which is a superset of XPath 2.0, both rely on the W3C XML Schema and use the information coming out of the PSVI.

This will indirectly impact a specification that relies on XPath, and XPointer (a specification that defines how fragments of XML documents can be addressed). It will also affect specifications using XPointer, such as XLink (definition of links between document fragments) and XInclude (inclusion of XML fragments). The case of XInclude is a good illustration of the need to define an overall processing model: XInclude relies on XPointer; XPointer relies on XPath, and XPath relies on the W3C XML Schema. This means that a XInclude processor will need the PSVI of the document containing the fragment to include, but this document may only be a container for the fragments and be invalid or even have no W3C XML Schema. On the contrary, from the schema viewpoint, the schema should be applied after the inclusion when the document is complete. Do we need to apply the schema processing before or after the inclusion (or both) ? This question is open.

RDF (Resources Description Framework) can be seen as a way to express graphs in XML by splitting these graphs into elementary elements of information named “statements” or “triples.” Each triple is a logic assertion associating a subject, verb, and object, such as in the phrase “The book 0836217462 (subject) has been written by Charles M Schulz (object).” RDF has its own schema language (RDF Schema) to model and constrain the relations themselves and define the inheritance between them. Since it’s defined as a level on top of XML, a XML schema language does not act at the right level to model a set of RDF triples. However, RDF recognizes two kind of objects, resources identified by a URI and literals (i.e., raw values), and needs a simple datatype system to define constraints on those literals.

Although the idea of associating a W3C XML Schema simple datatype to RDF literals looks simple, it raises several issues. One is the lack of a way to identify W3C XML Schema simple datatypes that would be acceptable for RDF. As we’ve seen, RDF identifies any resource by a URI. To be coherent with the RDF data model, the simple types associated to the literals should be identified by URIs. On the other hand, the W3C XML Schema does not use the URIs to identify its datatypes but rather uses qualified names (QNames). Furthermore, it decided that elements, attributes, simple and complex types, and groups have independent sets of QNames. The QName bib:book can thus refer to an element, an attribute, a complex and simple type, and an element and attribute group of the same schema. The simple approach of identifing the simple type book through its expanded QName (replacing the prefix by the namespace URI) isn’t yet implemented.