Chapter 18. Programming Models

This chapter briefly explains the most popular programming techniques for parsing, manipulating, and generating XML data. XML support is available for virtually every programming platform in use today, from supercomputer to cell phone. If you can't find XML support built into your programming environment, a quick Google search will likely locate a library.

XML's structured and tagged text can be processed by developers in several ways. Programs can look at XML as plain text, as a stream of events, as a tree, or as a serialization of some other structure. Tools supporting all of these options are widely available.

At their foundation, XML documents are text. The content and markup are both represented as text, and text-editing tools can be extremely useful for XML document inspection, creation, and modification. XML's textual foundations make it possible for developers to work with XML directly, using XML-specific tools only when they choose to.

One of the original design goals of XML was for documents to be easy to parse. For very simple documents that do not depend on features such as attribute defaulting and validation, it is possible to parse tags, attributes, and text data using standard programming tools such as regular expressions and tokenizers, but the complexity of processing grows rapidly as documents use more features. Unless the application can completely control the content of incoming documents, it is almost always preferable to use one of the many high-quality XML parsers that are freely available for most programming languages.

Textual tools are a key part of the XML toolset, however. Many developers use text editors such as vi, Emacs, NotePad, WordPad, BBEdit, and UltraEdit to create or modify XML documents. Regular expressions—in environments such as sed, grep, Perl, and Python—can be used for search and replace or for tweaking documents prior to XML parsing or XSLT processing. Various standards are beginning to take advantage of regular expression matching after a particular document has been parsed. The W3C's XML Schema recommendation, for instance, includes regular-expression matching as one mechanism for validating data types, as discussed in Chapter 17.

Text-based processing can be performed in conjunction with other XML processing. Parsing and then serializing XML documents after other processing has taken place doesn't always produce the desired results. XSLT, for instance, will remove entity references and replace them with entity content. Preserving entities requires replacing them in the original document with unique placeholders, and then replacing the placeholder as it appears in the result. With regular expressions, this is quite easy to do.

As an XML parser reads a document, it moves from the beginning of the document to the end. It may pause to retrieve external resources—for a DTD or an external entity, for instance—but it builds an understanding of the document as it moves along. Tree-based XML technologies (such as the DOM) combine these incremental parsing events into a monolithic image of an XML document once parsing has been completed successfully.

Event-based parsers, on the other hand, report these interim events to their client applications as they happen. Some common parsing events are element start-tag read, element content read, and element end-tag read. For example, consider the document in Example 18-1.

An event-based parser might report events such as this:

startElement:name
startElement:given
content: Keith
endElement:given
startElement:family
content:Johnson
endElement:family
endElement:name

The list and structure of events can become much more complex as features such as namespaces, attributes, whitespace between elements, comments, processing instructions, and entities are added, but the basic mechanism is quite simple and generally very efficient.

Event-based applications are generally more complex than tree-based applications. Processing events typically means the creation of a state machine, code that understands the current context and can route the information in the events to the proper consumer. Because events occur as the document is read, applications must be prepared to discard results should a fatal error occur partway through the document. Also, accessing a wide variety of data scattered throughout a document is much more involved than it would be if the entire document were parsed into a tree structure.

The upside to an event-based API is speed and efficiency. Because event-based APIs stream the document to the client application, your program can begin working with the data from the beginning of the document before the end of the document is seen. It doesn't have to wait for the entire document to be read before commencing. For instance, a brokerage program receiving a long list of requests to buy individual stocks could execute the first trade before the parser reads the second trade, execute the second trade before the parser reads the third trade, and so forth. This could save crucial seconds on the initial trades if the document includes many separate orders.

Even more important than speed is size. XML documents can be quite large, sometimes ranging into the gigabytes. An event-based API does not need to store all this data in memory at one time. It can process the document in small, easily handled chunks, then reclaim that storage. In practice, even on the largest, beefiest servers with gigabytes of RAM, XML documents larger than a couple of hundred megabytes can't be processed with a tree-based API. In an embedded environment (like a cell phone), memory limitations mandate streaming APIs.

Event-based parsers also more naturally fit certain tasks, such as content filtering. Filters can process and modify events before passing them to another processor, efficiently performing a wide range of transformations. Filters can be chained, providing a relatively simple means of building XML processing pipelines, where the information from one processor flows directly into another. Applications that want to feed information directly from XML documents into their own internal structures may find events to be the most efficient means of doing that. Even parsers that report XML documents as complete trees, as described in the next section, typically build those trees from a stream of events.

Tip

The Simple API for XML (SAX), described in Chapter 20 and Chapter 26, is the most commonly used event-based API. SAX2, the current version, is hosted at http://sax.sourceforge.net/. Expat, a widely used XML parser written in C, also uses an event-based API. For information on the expat parser and its API, see http://expat.sourceforge.net.

XML documents, because of the requirements for well-formedness, can be readily described using tree structures. Elements are inherently hierarchical, as they may contain other elements, text content, comments, and so forth.

There is a wide variety of tree models for XML documents. XPath (described in Chapter 9), used in XSLT transformations, has a slightly different set of expectations than does the Document Object Model (DOM) API, which is also different from the XML Information Set (Infoset), another W3C project. XML Schema (described in Chapter 17 and Chapter 22) defines a Post-Schema Validation Infoset (PSVI), which has more information in it (derived from the XML Schema) than any of the others.

Developers who want to manipulate documents from their programs typically use APIs that provide access to an object model representing the XML document. Tree-based APIs typically present a model of an entire document to an application once parsing has successfully concluded. Applications don't have to worry about manually maintaining parsing context or partial processing when a parse error is encountered, as the tree-based parser generally handles errors on its own. Rather than following a stream of events, an application can just navigate through the tree to find the desired pieces of a document.

Working with a tree model has substantial advantages. The entire document is always available, and moving well-balanced portions of a document from one place to another or modifying them is fairly easy. The complete context for any given part of the document is always available. When using APIs that support it, developers can use XPath expressions to locate content and make decisions based on content anywhere in the document. (DOM Level 3 adds formal support for XPath, and various implementations already provide their own nonstandard support.)

Tree models of documents have a few drawbacks. They can take up large amounts of memory, typically three to ten times the original document's file size. Navigating documents can require additional processing after the parse, as developers have more options available to them. (Tree models don't impose the same kinds of discipline as event-based processing.) These issues can make it difficult to scale and share applications that rely on tree models, although they may still be appropriate where small numbers of documents or small documents are being used.

Tip

The Document Object Model (DOM), described in Chapter 19 and Chapter 25, is the most common tree-based API. JDOM (http://jdom.org/ ), DOM4J (http://dom4j.org/ ), and XOM (http://www.cafeconleche.org/XOM) are Java-only alternatives. (XOM is an object model promoted by Elliotte Rusty Harold, one of the authors.)

The most recent entrant into the XML processing arena is the so-called pull processing model. One of the most widely used pull processors is the Microsoft .NET XMLReader class. The pull model is most similar to the event-based model in that it makes the contents of the XML document available progressively as the document is parsed.

Unlike the event model, the pull approach relies on the client application to request content from the parser at its own pace. For example, a pull client might include the following code to parse the simple document shown in Example 18-1:

reader.ReadStartElement("name")
reader.ReadStartElement("given")
givenName = reader.ReadString( )
reader.ReadEndElement( )
reader.ReadStartElement("family")
familyName = reader.ReadString( )
reader.ReadEndElement( )
reader.ReadEndElement( )

The pull client requests the XML content it expects to see from the pull parser. In practice, this makes pull client code easier to read and understand than the corresponding event-based code would be. It also tends to reduce the need to create stacks and structures to contain document information, as the code itself can be written to mirror recursive descent parsing.

In the Java world, BEA, Sun, and several individual developers have collaborated to create the Streaming API for XML (StAX). StAX and other pull parsers share the advantages of streaming with SAX such as speed, parallelism, and memory efficiency while offering an API that is more comfortable to many developers. In essence, SAX and other push parsers are based on the Observer design pattern. StAX, XMLReader, and other pull parsers are based on the Iterator design pattern.

Another facility available to the XML programmer is document transformation. The Extensible Stylesheet Language Transformation (XSLT) language, covered in Chapter 8, is the most popular tool currently available for transforming XML to HTML, XML, or any other regular language that can be expressed in XSLT. In some cases, using a transformation to perform pre- or post-processing on XML data can reduce the complexity of a DOM or SAX application. For instance, XSLT could be used as a preprocessor for a screen-scraping application that starts from XHTML documents. The complex XHTML document could be transformed into a smaller, more accessible application-specific XML format that could then be read by a script.

Transformations may be used by themselves, in browsers, or at the command line, but many XSLT implementations and other transformation tools offer SAX or DOM interfaces, simplifying the task of using them to build document processing pipelines.

Developers who want to take advantage of XML's cross-platform benefits but have no patience for the details of markup can use various tools that rely on XML but don't require direct exposure to XML's structures. Web Services, mentioned in Chapter 16, can be seen as a move in this direction. You can still touch the XML directly if you need to, but toolkits make it easier to avoid doing so.

These kinds of applications are generally built as a layer on top of event- or tree-based processing, presenting their own API to the underlying information. We feel that in most cases, the underlying XML data is as clear and accessible as it can be. Additional layers of abstraction above the XML simply add to the overall complexity and rigidity of the application.