The Text Encoding Initiative (TEI, http://www.tei-c.org/ ) is an XML (originally SGML) application designed for the markup of classic literature, such as Vergil's Aeneid or the collected works of Thomas Jefferson. It's a prime example of a narrative-oriented DTD. Since TEI is designed for scholarly analysis of text rather than more casual reading or publishing, it includes elements not only for common document structures (chapter, scene, stanza, etc.) but also for typographical elements, grammatical structure, the position of illustrations on the page, and so forth. These aren't important to most readers, but they are important to TEI's intended audience of humanities scholars. For many academic purposes, one manuscript of the Aeneid is not necessarily the same as the next. Transcription errors and emendations made by various monks in the Middle Ages can be crucial.
Example 6-1 shows a fairly simple TEI document that uses the "Lite" version of TEI, a subset of full TEI that includes only the most commonly needed tags. The content comes from the book you're reading now. Although a complete TEI-encoded copy of this manuscript would be much longer, this simple example demonstrates the basic features of most TEI documents that represent books. (In addition to prose, TEI can also be used for plays, poems, missals, and essentially any written form of literature.)
Example 6-1. A TEI document
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <!DOCTYPE TEI.2 SYSTEM "xteilite.dtd"> <TEI.2> <teiHeader> <fileDesc> <titleStmt> <title>XML in a Nutshell</title> <author>Harold, Elliotte Rusty</author> <author>Means, W. Scott</author> </titleStmt> <publicationStmt><p></p></publicationStmt> <sourceDesc><p>Early manuscript draft</p></sourceDesc> </fileDesc> </teiHeader> <text id="HarXMLi"> <front> <div type='toc'> <head>Table Of Contents</head> <list> <item>Introducing XML</item> <item>XML as a Document Format</item> <item>XML on the Web</item> </list> </div> </front> <body> <div1 type="chapter"> <head>Introducing XML</head> <p></p> </div1> <div1 type="chapter"> <head>XML as a Document Format</head> <p> XML is first and foremost a document format. It was always intended for web pages, books, scholarly articles, poems, short stories, reference manuals, tutorials, texts, legal pleadings, contracts, instruction sheets, and other documents that human beings would read. Its use as a syntax for computer data in applications like syndication, order processing, object serialization, database exchange and backup, electronic data interchange, and so forth is mostly a happy accident. </p> <div2 type="section"> <head>SGML's Legacy</head> <p></p> </div2> <div2 type="section"> <head>TEI</head> <p></p> </div2> <div2 type="section"> <head>DocBook</head> <p> DocBook (<hi>http://www.docbook.org/</hi>) is an SGML application designed for new documents, not old ones. It's especially common in computer documentation. Several O'Reilly books have been written in DocBook including <bibl><author>Norm Walsh</author>'s <title>DocBook: The Definitive Guide</title></bibl>. Much of the <abbr expan='Linux Documentation Project'>LDP</abbr> (<hi>http://www.linuxdoc.org/</hi>) corpus is written in DocBook. </p> </div2> </div1> <div1 type="chapter"> <head>XML on the Web</head> <p></p> </div1> </body> <back> <div1 type="index"> <list> <head>INDEX</head> <item>SGML, 8, 89</item> <item>DocBook, 95-98</item> <item>TEI (Text Encoding Initiative), 92-95</item> <item>Text Encoding Initiative, See TEI</item> </list> </div1> </back> </text> </TEI.2>
The root element of this and all TEI documents is TEI.2
. This root element is always divided
into two parts: a header represented by a teiHeader
element and the main content of the document
represented by a text
element. The header contains information about the
source document (for instance, exactly which medieval manuscript the
text was copied from), the encoding of the document, some keywords
describing the document, and so forth.
The text
element is itself
divided into three parts:
front
elementThe preface, table of contents, dedication page, pictures
of the cover, and so forth. Each of these is represented by a
div
element with a type
attribute whose value identifies
the division as a table of contents, preface, title page, and so
forth. Each of these divisions contains other elements laying
out the content of that division.
body
elementThe individual chapters, acts, and so forth that make up
the document. Each of these is represented by a div1
element with a type
attribute that identifies this
particular division as a volume, book, part, chapter, poem, act,
and so forth. Each div1
element has a header
child
giving the title of the volume, book, part, chapter, etc.
back
elementThe index, glossary, etc.
The divisions may be further subdivided; div1
s can contain div2
s, div2
s can contain div3
s, div3
s can contain div4
s, and so on up to div7
. However, for any given work, there is
a smallest division. This division contains paragraphs represented by
p
elements for prose or stanzas
represented by lg
elements for
poetry. Stanzas are further broken up into individual lines
represented by l
elements.
Both lines and paragraphs contain mixed content; that is, they
contain plain text. However, parts of this text may be marked up
further by elements indicating that particular words or characters are
peoples' names (name
), corrections
(corr
), illegible (unclear
), misspellings (sic
), and so on.
This structure fairly closely reflects the structure of the actual documents that are being encoded in TEI. This is true of most narrative-oriented XML applications that need to handle fairly generic documents. TEI is a very representative example of typical XML document structure.