The Simple API for XML (SAX) is an event-based API for reading XML documents. Many different XML parsers implement the SAX API, including Xerces, Crimson, the Oracle XML Parser for Java, and Ælfred. SAX was originally defined as a Java API and is primarily intended for parsers written in Java. Therefore, this chapter focuses on the Java version of the API. However, SAX has been ported to most other major object-oriented languages, including C++, Python, Perl, and Eiffel. The translation from Java is usually fairly obvious.
The SAX API is unusual among XML APIs because it's an event-based push model rather than a tree-based pull model. As the XML parser reads an XML document, it sends the program information from the document in real time. Each time the parser sees a start-tag, an end-tag, character data, or a processing instruction, it tells your program. The document is presented to your program one piece at a time from beginning to end. You can either save the pieces you're interested in until the entire document has been read, or process the information as soon as you receive it. You do not have to wait for the entire document to be read before acting on the data at the beginning of the document. Most importantly, the entire document does not have to reside in memory. This feature makes SAX the API of choice for very large documents that do not fit into available memory.
This chapter covers SAX2 exclusively. In 2004, all major parsers that support SAX also support SAX2. The major change in SAX2 from SAX1 is the addition of namespace support, which necessitated changing the names and signatures of almost every method and class in SAX. The old SAX1 methods and classes are still available, but they're now deprecated, and you shouldn't use them.
SAX is primarily a collection of interfaces in the org.xml.sax
package. One such interface is XMLReader
. This interface represents the XML parser. It declares
methods to parse a document and configure the parsing process, for
instance, by turning validation on or off. To parse a document with SAX,
first create an instance of XMLReader
with the XMLReaderFactory
class in the org.xml.sax.helpers
package. This class has a
static createXMLReader( )
factory method that produces the parser-specific
implementation of the XMLReader
interface. The Java system property org.xml.sax.driver
specifies the concrete
class to instantiate:
try { XMLReader parser = XMLReaderFactory.createXMLReader( ); // parse the document... } catch (SAXException ex) { // couldn't create the XMLReader }
The call to XMLReaderFactory.createXMLReader( )
is wrapped
in a try
-catch
block that catches SAXException
. This is the generic checked exception superclass for
almost anything that can go wrong while parsing an XML document. In this
case, it means either that the org.xml.sax.driver
system property wasn't set,
or that it was set to the name of a class that Java couldn't find in the
class path.
Do not use the SAXParserFactory
and SAXParser
classes included with JAXP. These
classes were designed by Sun to fill a gap in SAX1. They are
unnecessary and indeed actively harmful in SAX2. For instance, they
are not namespace aware by default. SAX2 applications should use
XMLReaderFactory
and XMLReader
instead.
You can choose which concrete class to instantiate by passing its
name as a string to the createXMLReader()
method. This code fragment
instantiates the Xerces parser by name:
try { XMLReader parser = XMLReaderFactory.createXMLReader( "org.apache.xerces.parsers.SAXParser"); // parse the document... } catch (SAXException ex) { // couldn't create the XMLReader }
Now that you've created a parser, you're ready to parse some
documents with it. Pass the system ID of the document you want to read to the
parse( )
method. The
system ID is either an absolute or a relative URL encoded in a string.
For example, this code fragment parses the document at
http://www.slashdot.org/slashdot.xml:
try { XMLReader parser = XMLReaderFactory.createXMLReader( ); parser.parse("http://www.slashdot.org/slashdot.xml"); } catch (SAXParseException ex) { // Well-formedness error } catch (SAXException ex) { // Could not find an XMLReader implementation class } catch (IOException ex) { // Some sort of I/O error prevented the document from being completely // downloaded from the server }
The parse( )
method throws a
SAXParseException
if the document is malformed, an IOException
if an I/O error such as a broken socket occurs while the
document is being read, and a SAXException
if anything else goes wrong.
Otherwise, it returns void
. To
receive information from the parser as it reads the document, you must
configure it with a ContentHandler
.
ContentHandler
,
shown in stripped-down form in Example 20-1, is an interface in
the org.xml.sax
package. You
implement this interface in a class of your own devising. Next, you
configure an XMLReader
with an
instance of your implementation. As the XMLReader
reads the document, it invokes the
methods in this object to tell your program what's in the XML
document. You can respond to these method invocations in any way you
see fit.
The ContentHandler
class
has no relation to the moribund java.net.ContentHandler
class. However, you may encounter a name conflict if
you import both java.net.*
and
org.xml.sax.*
in the same class.
It's better to import just the java.net
classes you actually need, rather
than the entire package.
Example 20-1. The org.xml.sax.ContentHandler interface
package org.xml.sax; public interface ContentHandler { public void setDocumentLocator(Locator locator); public void startDocument( ) throws SAXException; public void endDocument( ) throws SAXException; public void startPrefixMapping(String prefix, String uri) throws SAXException; public void endPrefixMapping(String prefix) throws SAXException; public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts) throws SAXException; public void endElement(String namespaceURI, String localName, String qualifiedName) throws SAXException; public void characters(char[ ] text, int start, int length) throws SAXException; public void ignorableWhitespace(char[ ] text, int start, int length) throws SAXException; public void processingInstruction(String target, String data) throws SAXException; public void skippedEntity(String name) throws SAXException; }
Every time the XMLReader
reads a piece of the document, it calls a method in its
ContentHandler
. Suppose a parser
reads the simple document shown in Example 20-2.
Example 20-2. A simple XML document
<?xml version="1.0" encoding="ISO-8859-1"?> <?xml-stylesheet type='text/css' href='person.css'?> <!DOCTYPE person SYSTEM "person.dtd"> <person xmlns="http://xml.oreilly.com/person"> <name:name xmlns:name="http://xml.oreilly.com/name"> <name:first>Sydney</name:first> <name:last>Lee</name:last> </name:name> <assignment project_id="p2"/> </person>
The parser will call these methods in its ContentHandler
with these arguments in this
order. The values of the arguments passed to each method are given
after each method name:
setDocumentLocator(Locator locator) locator: org.apache.xerces.readers.DefaultEntityHandler@1f953d
startDocument( )
processingInstruction(String target, String data) target: "xml-stylesheet" data: "type='text/css' href='person.css'"
startPrefixMapping(String prefix, String namespaceURI) prefix: "" namespaceURI: "http://xml.oreilly.com/person"
startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts) namespaceURI: "http://xml.oreilly.com/person" localName: "person" qualifiedName: "person" atts: {} (no attributes, an empty list)
ignorableWhitespace(char[] text, int start, int length) text: <?xml version="1.0" encoding="ISO-8859-1"?> <?xml-stylesheet type='text/css' href='person.css'?> <!DOCTYPE person SYSTEM "person.dtd"> <person xmlns="http://xml.oreilly.com/person"> <name:name xmlns:name="http://xml.oreilly.com/name"> <name:first>Sydney</name:first> <name:last>Lee</name:last> </name:name> <assignment project_id="p2"/> </person> start: 181 length: 3
startPrefixMapping(String prefix, String uri) prefix: "name" uri: "http://xml.oreilly.com/name")
startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts) namespaceURI: "http://xml.oreilly.com/name" localName: "name" qualifiedName: "name:name" atts: {} (no attributes, an empty list)
ignorableWhitespace(char[] text, int start, int length) text: <?xml version="1.0" encoding="ISO-8859-1"?> <?xml-stylesheet type='text/css' href='person.css'?> <!DOCTYPE person SYSTEM "person.dtd"> <person xmlns="http://xml.oreilly.com/person"> <name:name xmlns:name="http://xml.oreilly.com/name"> <name:first>Sydney</name:first> <name:last>Lee</name:last> </name:name> <assignment project_id="p2"/> </person> start: 236 length: 5
startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts) namespaceURI: "http://xml.oreilly.com/name" localName: "first" qualifiedName: "name:first" atts: {} (no attributes, an empty list)
characters(char[] text, int start, int length) text: <?xml version="1.0" encoding="ISO-8859-1"?> <?xml-stylesheet type='text/css' href='person.css'?> <!DOCTYPE person SYSTEM "person.dtd"> <person xmlns="http://xml.oreilly.com/person"> <name:name xmlns:name="http://xml.oreilly.com/name"> <name:first>Sydney</name:first> <name:last>Lee</name:last> </name:name> <assignment project_id="p2"/> </person> start: 253 length: 6
endElement(String namespaceURI, String localName, String qualifiedName) namespaceURI: "http://xml.oreilly.com/name" localName: "first" qualifiedName: "name:first"
ignorableWhitespace(char[] text, int start, int length) text: <?xml version="1.0" encoding="ISO-8859-1"?> <?xml-stylesheet type='text/css' href='person.css'?> <!DOCTYPE person SYSTEM "person.dtd"> <person xmlns="http://xml.oreilly.com/person"> <name:name xmlns:name="http://xml.oreilly.com/name"> <name:first>Sydney</name:first> <name:last>Lee</name:last> </name:name> <assignment project_id="p2"/> </person> start: 272 length: 5
startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts) namespaceURI: "http://xml.oreilly.com/name" localName: "last" qualifiedName: "name:last" atts: {} (no attributes, an empty list)
characters(char[] text, int start, int length) text: <?xml version="1.0" encoding="ISO-8859-1"?> <?xml-stylesheet type='text/css' href='person.css'?> <!DOCTYPE person SYSTEM "person.dtd"> <person xmlns="http://xml.oreilly.com/person"> <name:name xmlns:name="http://xml.oreilly.com/name"> <name:first>Sydney</name:first> <name:last>Lee</name:last> </name:name> <assignment project_id="p2"/> </person> start: 288 length: 3
endElement(String namespaceURI, String localName, String qualifiedName) namespaceURI: "http://xml.oreilly.com/name" localName: "last" qualifiedName: "name:last"
ignorableWhitespace(char[] text, int start, int length) text: <?xml version="1.0" encoding="ISO-8859-1"?> <?xml-stylesheet type='text/css' href='person.css'?> <!DOCTYPE person SYSTEM "person.dtd"> <person xmlns="http://xml.oreilly.com/person"> <name:name xmlns:name="http://xml.oreilly.com/name"> <name:first>Sydney</name:first> <name:last>Lee</name:last> </name:name> <assignment project_id="p2"/> </person> start: 303 length: 3
endElement(String namespaceURI, String localName, String qualifiedName) namespaceURI: "http://xml.oreilly.com/name" localName: "name" qualifiedName: "name:name"
endPrefixMapping(String prefix) prefix: "name"
ignorableWhitespace(char[] text, int start, int length) text: <?xml version="1.0" encoding="ISO-8859-1"?> <?xml-stylesheet type='text/css' href='person.css'?> <!DOCTYPE person SYSTEM "person.dtd"> <person xmlns="http://xml.oreilly.com/person"> <name:name xmlns:name="http://xml.oreilly.com/name"> <name:first>Sydney</name:first> <name:last>Lee</name:last> </name:name> <assignment project_id="p2"/> </person> start: 318 length: 3
startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts) namespaceURI: "http://xml.oreilly.com/person" localName: "assignment" qualifiedName: "assignment atts: {project_id="p2"}
endElement(String namespaceURI, String localName, String qualifiedName) namespaceURI: "http://xml.oreilly.com/person" localName: "assignment" qualifiedName: "assignment"
ignorableWhitespace(char[] text, int start, int length) text: <?xml version="1.0" encoding="ISO-8859-1"?> <?xml-stylesheet type='text/css' href='person.css'?> <!DOCTYPE person SYSTEM "person.dtd"> <person xmlns="http://xml.oreilly.com/person"> <name:name xmlns:name="http://xml.oreilly.com/name"> <name:first>Sydney</name:first> <name:last>Lee</name:last> </name:name> <assignment project_id="p2"/> </person> start: 350 length: 1
endElement(String namespaceURI, String localName, String qualifiedName) namespaceURI: "http://xml.oreilly.com/person" localName: "person" qualifiedName: "person"
endPrefixMapping(String prefix) prefix: ""
endDocument( )
Some pieces of this are not deterministic. Note that the
char
array passed to each call to
characters( )
and ignorableWhitespace( )
actually contains the
entire document! The specific text block that the parser really
returns is indicated by the second two arguments. This is an
optimization that Xerces-J performs. Other parsers are free to pass
different char
arrays as long as
they set the start
and length
arguments to match. Indeed, the
parser is also free to split a long run of plain text across multiple
calls to characters( )
or ignorableWhitespace()
, so you cannot assume
that these methods necessarily return the longest possible contiguous
run of plain text. Other details that may change from parser to parser
include attribute order within a tag and whether a Locator
object is provided by calling
setDocumentLocator( )
.
Suppose you want to count the number of elements, attributes,
processing instructions, and characters of plain text that exist in a
given XML document. To do so, first write a class that implements the
ContentHandler
interface. The
current count of each of the four items of interest is stored in a
field. The field values are initialized to zero in the startDocument( )
method, which is called
exactly once for each document parsed. Each callback method in the
class increments the relevant field. The endDocument( )
method reports the total for
that document. Example 20-3
is such a class.
Example 20-3. The XMLCounter ContentHandler
import org.xml.sax.*; public class XMLCounter implements ContentHandler { private int numberOfElements; private int numberOfAttributes; private int numberOfProcessingInstructions; private int numberOfCharacters; public void startDocument( ) { numberOfElements = 0; numberOfAttributes = 0; numberOfProcessingInstructions = 0; numberOfCharacters = 0; } // We should count either the start-tag of the element or the end-tag, // but not both. Empty elements are reported by each of these methods. public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts) { numberOfElements++; numberOfAttributes += atts.getLength( ); } public void endElement(String namespaceURI, String localName, String qualifiedName) { } public void characters(char[ ] text, int start, int length) { numberOfCharacters += length; } public void ignorableWhitespace(char[ ] text, int start, int length) { numberOfCharacters += length; } public void processingInstruction(String target, String data) throws SAXException { numberOfProcessingInstructions++; } // Now that the document is done, we can print out the final results public void endDocument( ) { System.out.println("Number of elements: " + numberOfElements); System.out.println("Number of attributes: " + numberOfAttributes); System.out.println("Number of processing instructions: " + numberOfProcessingInstructions); System.out.println("Number of characters of plain text: " + numberOfCharacters); } // Do-nothing methods we have to implement only to fulfill // the interface requirements: public void setDocumentLocator(Locator locator) { } public void startPrefixMapping(String prefix, String uri) { } public void endPrefixMapping(String prefix) { } public void skippedEntity(String name) { } }
This class needs to override most methods in the ContentHandler
interface. However, if you really want to provide only one or two
ContentHandler
methods, you may
want to subclass the DefaultHandler
class instead. This adapter class implements all
methods in the ContentHandler
interface with do-nothing methods, so you only have to override
methods you're genuinely interested in.
Next, build an XMLReader
, and
configure it with an instance of this class. Finally, parse the
documents you want to count, as in Example 20-4.
Example 20-4. The DocumentStatistics driver class
import org.xml.sax.*; import org.xml.sax.helpers.*; import java.io.IOException; public class DocumentStatistics { public static void main(String[ ] args) { XMLReader parser; try { parser = XMLReaderFactory.createXMLReader( ); } catch (SAXException e) { // fall back on Xerces parser by name try { parser = XMLReaderFactory.createXMLReader( "org.apache.xerces.parsers.SAXParser"); } catch (SAXException eex) { System.err.println("Couldn't locate a SAX parser"); return; } } if (args.length = = 0) { System.out.println( "Usage: java DocumentStatistics URL1 URL2..."); } // Install the Content Handler parser.setContentHandler(new XMLCounter( )); // start parsing... for (int i = 0; i < args.length; i++) { // command line should offer URIs or file names try { parser.parse(args[i]); } catch (SAXParseException ex) { // well-formedness error System.out.println(args[i] + " is not well formed."); System.out.println(ex.getMessage( ) + " at line " + ex.getLineNumber( ) + ", column " + ex.getColumnNumber( )); } catch (SAXException ex) { // some other kind of error System.out.println(ex.getMessage( )); } catch (IOException ex) { System.out.println("Could not report on " + args[i] + " because of the IOException " + ex); } } } }
Running the program in Example 20-4 across the document in Example 20-2 results in the following output:
D:\books\xian\examples\18>java DocumentStatistics 20-2.xml
Number of elements: 5
Number of attributes: 1
Number of processing instructions: 1
Number of characters of plain text: 29
This generic program of Example 20-4 works on any well-formed XML document. Most SAX programs are more specific and only work with certain XML applications. They look for particular elements or attributes in particular places and respond to them accordingly. They may rely on patterns that are enforced by a validating parser. Still, this behavior comprises the fundamentals of SAX.
The complicated part of most SAX programs is the data structure you build to store information returned by the parser until you're ready to use it. Sometimes, this information can be as complicated as the XML document itself, in which case you may be better off using DOM, which at least provides a ready-made data structure for an XML document. You usually want only some information, though, and the data structure you construct should be less complex than the document itself.