XML is first and foremost a document format. It was always intended for web pages, books, scholarly articles, poems, short stories, reference manuals, tutorials, textbooks, legal pleadings, contracts, instruction sheets, and other documents that human beings would read. Its use as a syntax for computer data in applications such as order processing, object serialization, database exchange and backup, and electronic data interchange is mostly a happy accident.
Most computer programmers are better trained in working with the rigid structures one encounters in record-like applications than in the more free-form environment of an article or story. Most writers are more accustomed to the more free-form format of a book, story, or article. XML is perhaps unique in addressing the needs of both communities equally well. This chapter describes by both elucidation and example the structures encountered in narrative documents that are meant to be read by people instead of computers. Subsequent chapters will look at web pages in particular, then address technologies—such as XSLT, XLinks, and stylesheets—that are primarily intended for use with documents that will be read by human beings. Once we've done that, we'll look at XML as a format for more or less transitory data meant to be read by computers, rather than semipermanent documents intended for human consumption.
XML is a simplified form of the Standardized General Markup Language (SGML). The language that would eventually become SGML was invented by Charles F. Goldfarb, Ed Mosher, and Ray Lorie at IBM in the 1970s and developed by many people around the world until its eventual adoption as ISO standard 8879 in 1986. SGML was intended to solve many of the same problems XML solves in much the same way as XML solves them. It was and is a semantic and structural markup language for text documents. SGML is extremely powerful and achieved some success in the U.S. military and government, in the aerospace sector, and in other domains that needed ways of efficiently managing technical documents that were tens of thousands of pages long.
SGML's biggest success was HTML, which was and is an SGML application. However, HTML is just one SGML application. It does not have anything close to the full power of SGML itself. SGML has also been used to define many other document formats, including DocBook and TEI, both of which we'll discuss shortly.
However, SGML is complicated—very, very complicated. The official SGML specification is over 150 very technical pages. It covers many special cases and unlikely scenarios. It is so complex that almost no software has ever implemented it fully. Programs that implement or rely on different subsets of SGML are often incompatible. The special feature that one program considers essential is all too often considered extraneous fluff and omitted by the next program. Nonetheless, experience with SGML taught developers a lot about the proper design, implementation, and use of markup languages for a wide variety of documents. Much of that general knowledge applies equally well to XML.
One thing all this should make clear is that XML documents aren't just used on the Web. XML can easily handle the needs of publishing in a variety of media, including books, magazines, journals, newspapers, and pamphlets. XML is particularly useful when you need to publish the same information in several of these formats. By applying different stylesheets to the same source document, you can produce web pages, speaker's notes, camera-ready copy for printing, and more.