A Short History of XML Schema Languages

The list of schema languages is long and needs to include languages developed for SGML (the language used before XML was born) to be complete. The list that I propose is far from exhaustive, and includes only the major proposals that have influenced the schema languages I see as the most promising.

Mandatory for any SGML application, a simplified version of the SGML DTDs was introduced in the XML 1.0 Recommendation. Even though a DTD is not mandatory for an application to read and understand a XML document, many developers highly recommend writing DTDs for any XML application.

The W3C XML Schema Working Group received many proposals that were contributed as notes:

The RELAX NG family is a more traditional marriage between grammar-based XML Schema languages that have chosen to unite their strengths.

  • First published in March 2000 as a Japanese ISO Standard Technical Report written by Murata Makoto, Regular Language description for XML Core (RELAX; see http://www.xml.gr.jp/relax) is both simple (“Tired of complicated specifications? You just RELAX !”) and built on a solid mathematical foundation (the adaptation of the hedge automata theory to XML trees). It was approved as an ISO/IEC Technical Report in May 2001.

  • XDuce (http://xduce.sourceforge.net) was first announced in March 2000."XDuce (`transduce') is a typed programming language that is specifically designed for processing XML data. One can read an XML document as an XDuce value, extract information from it or convert it to another format, and write out the result value as an XML document.” Although it is not meant to be a schema language, its typing system has influenced the schema languages.

  • Published by James Clark in January 2001, TREX (Tree Regular Expressions for XML; see http://thaiopensource.com/trex) is “basically the type system of XDuce with an XML syntax and with a bunch of additional features.” The names and content models of the elements used to define the tree patterns of a TREX schema have been carefully chosen, and TREX schemas are usually as easy to read as a plain text description. The simplicity of the structure of the language also allows the resurrection of a consistent treatment between elements and attributes, a feature lost since DCD.

  • Announced in May 2001, RELAX NG (RELAX New Generation) is a merger of RELAX and TREX, developed by an OASIS TC (http://www.oasis-open.org/committees/relax-ng), coedited by James Clark and Murata Makoto. “The key features of RELAX NG are that it is simple, easy to learn, uses XML syntax, does not change the information set of an XML document, supports XML namespaces, treats attributes uniformly with elements so far as possible, has unrestricted support for unordered content, has unrestricted support for mixed content, has a solid theoretical basis, and can partner with a separate datatyping language (such W3C XML Schema Datatypes).” RELAX NG is now an official specification of the OASIS RELAX NG Technical Committee and will probably progress to become an ISO/IEC International Standard as part of DSDL.

Schematron (http://www.ascc.net/xml/resource/schematron/schematron.html), which was first proposed in September 1999 by Rick Jelliffe of the Academia Sinica Computing Centre, is an unusual schema language. It defines validation rules using XPath expressions. Schematron is also described in the ISO DSDL project.

Starting from the observations that instance documents are usually much easier to understand than the schemas that describe them, and that schema languages often need to give examples of instance documents to help human readers to understand their syntax, I proposed Examplotron (http://examplotron.org) in March 2001, to define “schemas by example” using sample instance documents as actual schemas.