WordprocessingML

Beginning with Microsoft Office 2003 for Windows (but not Office 2004 for the Mac), Microsoft gave Word and the other Office components the ability to save all documents in XML, although by default it still picks a binary format. The XML application saved by Microsoft Word is named WordprocessingML. Unlike DocBook, TEI, and OpenOffice, all of which were designed from scratch without any legacy issues, WordprocessingML was designed more as an XML representation of an existing binary file format. This makes it a rather unusual example of a narrative document format. We would not recommend that you emulate its design in your own applications. Nonetheless, it can be educational to compare it to the other three formats.

Example 6-4 shows the same document as in the previous three examples, this time encoded in WordprocessingML. The WordprocessingML version seems the most opaque and cryptic of the four formats discussed in this chapter. This example makes it pretty obvious that XML is not magic pixie dust you can sprinkle on an existing format to create clean, legible, maintainable data.

The root element of a WordprocessingML document is w:wordDocument . Here, the w prefix is mapped to the namespace URI http://schemas.microsoft.com/office/word/2003/wordml. Several other namespaces are declared for different content that can be embedded in a Word file.

This root element can contain several different chunks of metadata. Here I've used three: o:DocumentProperties for basic metadata like author and title, a w:fonts element that lists the fonts used in the document and their metrics, and a w:styles element that lists the styles referenced in the document. All of these are optional. However, a document saved by Microsoft Word itself would include all of these and several more.

The actual content of the document is stored in a w:body element. The body is divided into sections (wx:sect elements), which can be further divided into subsections (wx:subsection elements). Unusually, these are completely optional; removing them would have no effect. They're mainly present for the convenience of humans. The real structure of the document is inferred not from the sections and subsections but from paragraphs with outline levels.

There are three basic text elements in WordprocessingML that you'll find inside the body: w:t, w:r, and w:p. w:t is for text; w:r is for a run of text, like a span in HTML; and w:p is for a paragraph. A w:p contains w:r elements, each of which contains one w:t element. Neither a w:r nor a w:p can contain text directly. Whitespace is significant within w:t elements, although not within most other elements. However, line breaks are treated the same as spaces. The actual line breaks are indicated by the paragraph boundaries. This matches the typical word-wrapping behavior of Word and most other word processors.

Beyond these and a few other elements, there are almost no semantics in WordprocessingML markup. Instead, many characters are expended on precisely reproducing the appearance of the page, including fonts, font metrics, styles, line breaks, and so forth. In a document that's saved from Word (as opposed to being written by hand as this one was), the style information can easily occupy several dozen times the amount of space the content itself does. Headings are identified not by a separate heading element of some kind, but by setting the outline level property using a preceding sibling w:pPr element with a w:outlineLvl child. The use of sibling elements to set properties (instead of attributes or parent elements) is a very unusual pattern, one that's not well-supported by most XML processing tools.

Example 6-4. A WordprocessingML document

<?xml version="1.0" encoding="UTF-8"?>
<w:wordDocument 
  xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml" 
  xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint" 
  xmlns:o="urn:schemas-microsoft-com:office:office" 
  xml:space="preserve">
  <o:DocumentProperties>
    <o:Title>XML in a Nutshell</o:Title>
    <o:Author>W. Scott Means</o:Author>
    <o:LastAuthor>Elliotte Rusty Harold</o:LastAuthor>
    <o:Revision>2</o:Revision>
    <o:TotalTime>0</o:TotalTime>
    <o:LastPrinted>1601-01-01T04:00:00Z</o:LastPrinted>
    <o:Created>2004-05-25T00:40:00Z</o:Created>
    <o:LastSaved>2004-05-25T00:40:00Z</o:LastSaved>
    <o:Pages>1</o:Pages>
    <o:Words>162</o:Words>
    <o:Characters>925</o:Characters>
    <o:Company>Cafe au Lait</o:Company>
    <o:Lines>7</o:Lines>
    <o:Paragraphs>2</o:Paragraphs>
    <o:CharactersWithSpaces>1085</o:CharactersWithSpaces>
    <o:Version>11.4920</o:Version>
  </o:DocumentProperties>
  <w:fonts>
    <w:defaultFonts w:ascii="Times New Roman" 
                     w:fareast="Times New Roman" 
                     w:h-ansi="Times New Roman" w:cs="Times New Roman"/>
   <w:font w:name="Helvetica"><w:panose-1 w:val="020B0604020202030204"/>
     <w:charset w:val="00"/>
     <w:family w:val="Swiss"/>
     <w:pitch w:val="variable"/>
     <w:sig w:usb-0="20003A87" w:usb-1="00000000" w:usb-2="00000000" 
            w:usb-3="00000000" w:csb-0="000001FF" w:csb-1="00000000"/>
    </w:font>
  </w:fonts>
  <w:styles>
    <w:style w:type="character"  w:styleId="emphasis"  w:default="off"/>
  </w:styles>
  <w:body>
  <wx:sect>
     <w:p>
       <w:pPr>
          <w:outlineLvl w:val="0" />
       </w:pPr>
       <w:r>
         <w:t>Introducing XML</w:t>
       </w:r>
     </w:p>
    <w:p></w:p>
  </wx:sect>
     
  <wx:sect>
     <w:p>
       <w:pPr>
          <w:outlineLvl w:val="0" />
       </w:pPr>
       <w:r>
         <w:t>XML as a Document Format</w:t>
       </w:r>
     </w:p>
     
   <w:p>
     <w:r>
       <w:t>XML is first and foremost a document format. It was always intended
for web pages, books, scholarly articles, poems, short stories,
reference manuals, tutorials, texts, legal pleadings, contracts,
instruction sheets, and other documents that human beings would
read. Its use as a syntax for computer data in applications like
syndication, order processing, object serialization, database
exchange and backup, electronic data interchange, and so forth is
mostly a happy accident.</w:t>
     </w:r>
   </w:p>
     
   <wx:subsection>
     <w:p>
       <w:pPr>
          <w:outlineLvl w:val="1" />
       </w:pPr>
       <w:r>
         <w:t>SGML's Legacy</w:t>
       </w:r>
     </w:p>
     <w:p></w:p>
   </wx:subsection>
   <wx:subsection>
     <w:p>
       <w:pPr>
          <w:outlineLvl w:val="1" />
       </w:pPr>
       <w:r>
         <w:t>TEI</w:t>
       </w:r>
     </w:p>
     <w:p></w:p>
   </wx:subsection>
     
   <wx:subsection>
     <w:p>
       <w:pPr>
          <w:outlineLvl w:val="1" />
       </w:pPr>
       <w:r>
         <w:t>DocBook</w:t>
       </w:r>
     </w:p>
     <w:p>
       <w:hlink w:bookmark="http://www.docbook.org/">
         <w:r>
            <w:rPr>
               <w:rStyle w:val="Hyperlink" />
            </w:rPr>
            <w:t>DocBook</w:t>
         </w:r>
       </w:hlink>
       <w:r>
       <w:t>
is an SGML application designed for new documents, not old ones.
It's especially common in computer documentation. Several
O'Reilly books have been written in DocBook including </w:t>
       </w:r>
       <w:r>
         <w:rPr>
            <w:rStyle w:val="emphasis"/>
         </w:rPr>
         <w:t>Norm Walsh and Leonard Muellner's DocBook: The 
Definitive Guide</w:t>
      </w:r> 
     <w:r>
       <w:t>. Much of the </w:t>
       </w:r>
       <w:hlink w:bookmark="http://www.linuxdoc.org/">
         <w:r>
            <w:rPr>
               <w:rStyle w:val="Hyperlink" />
            </w:rPr>
            <w:t>Linux Documentation Project (LDP)</w:t>
         </w:r>
         </w:hlink>  
         <w:r>
       <w:t> corpus is written in DocBook. </w:t>
       </w:r>
     </w:p>
   </wx:subsection>
     
  </wx:sect>
     
  <wx:sect>
     <w:p>
       <w:pPr>
          <w:outlineLvl w:val="0" />
       </w:pPr>
       <w:r>
         <w:t>XML on the Web</w:t>
       </w:r>
     </w:p>
    <w:p></w:p>
  </wx:sect>
  </w:body>
</w:wordDocument>

Two things strike me about this format. The first is the cryptic nature of the short tag names such as t, r, p, and the positively verbose rPr. The second is the large number of tags needed to mark up this fairly simple document. The problem seems to be that Word bundles all style definitions into the document, and then repeats styles for each paragraph, even if they're reused across the entire document. XML doesn't have to be verbose, but this example certainly is; and it is far less verbose than what I actually saw saved by Word 2003. DocBook and TEI are human legible, even in plain text form. OpenOffice.org and WordprocessingML really aren't, especially in their natural states.