Beginning with Microsoft Office 2003 for Windows (but not Office 2004 for the Mac), Microsoft gave Word and the other Office components the ability to save all documents in XML, although by default it still picks a binary format. The XML application saved by Microsoft Word is named WordprocessingML. Unlike DocBook, TEI, and OpenOffice, all of which were designed from scratch without any legacy issues, WordprocessingML was designed more as an XML representation of an existing binary file format. This makes it a rather unusual example of a narrative document format. We would not recommend that you emulate its design in your own applications. Nonetheless, it can be educational to compare it to the other three formats.
Example 6-4 shows the same document as in the previous three examples, this time encoded in WordprocessingML. The WordprocessingML version seems the most opaque and cryptic of the four formats discussed in this chapter. This example makes it pretty obvious that XML is not magic pixie dust you can sprinkle on an existing format to create clean, legible, maintainable data.
The root element of a WordprocessingML document is w:wordDocument
. Here, the w
prefix
is mapped to the namespace URI http://schemas.microsoft.com/office/word/2003/wordml
.
Several other namespaces are declared for different content that can
be embedded in a Word file.
This root element can contain several different chunks of
metadata. Here I've used three: o:DocumentProperties
for basic metadata like
author and title, a w:fonts
element
that lists the fonts used in the document and their metrics, and a
w:styles
element that lists the
styles referenced in the document. All of these are optional. However,
a document saved by Microsoft Word itself would include all of these
and several more.
The actual content of the document is stored in a w:body
element. The body is divided into sections (wx:sect
elements), which can be further
divided into subsections (wx:subsection
elements). Unusually, these
are completely optional; removing them would have no effect. They're
mainly present for the convenience of humans. The real structure of
the document is inferred not from the sections and subsections but
from paragraphs with outline levels.
There are three basic text elements in WordprocessingML that you'll find inside the body:
w:t
, w:r
, and w:p
. w:t
is for text; w:r
is for a run of
text, like a span
in HTML; and
w:p
is for a paragraph. A w:p
contains w:r
elements, each of which contains one
w:t
element. Neither a w:r
nor a w:p
can contain text directly. Whitespace is
significant within w:t
elements,
although not within most other elements. However, line breaks are
treated the same as spaces. The actual line breaks are indicated by
the paragraph boundaries. This matches the typical word-wrapping
behavior of Word and most other word processors.
Two things strike me about this format. The first is the cryptic
nature of the short tag names such as t
, r
,
p
, and the positively verbose
rPr
. The second is the large number
of tags needed to mark up this fairly simple document. The problem
seems to be that Word bundles all style definitions into the document,
and then repeats styles for each paragraph, even if they're reused
across the entire document. XML doesn't have to be verbose, but this
example certainly is; and it is far less verbose than what I actually
saw saved by Word 2003. DocBook and TEI are human legible, even in
plain text form. OpenOffice.org and WordprocessingML really aren't,
especially in their natural states.