Although it is possible to access the data from the original XML
document using only the Node
interface, the DOM Core provides a number of specific node-type
interfaces that simplify common programming tasks. These specific node
types can be divided into two broad types: structural nodes and
content nodes.
Within an XML document, a number of syntax structures exist that are not formally part of the content. The following interfaces provide access to the portions of the document that are not related to element data.
The DocumentType
interface provides access to the XML document type
definition's notations, entities, internal subset, public ID, and
system ID. Since a document can have only one DOCTYPE
declaration, only one DocumentType
node can exist for a given
document. It is accessed via the doctype
attribute of the Document
interface. The definition of
the DocumentType
interface is
shown in Table
19-6.
Using additional fields available since DOM Level 2, it is
now possible to fully reconstruct a parsed document using only the
information provided within the DOM framework. No programmatic way
to modify DocumentType
node
contents currently exists.
The ProcessingInstruction
node type provides direct access to a processing
instruction's contents. Though processing instructions appear in
the document's text, they may also appear before or after the root
element, as well as in DTDs. Table 19-7 describes the
ProcessingInstruction
node's
attributes.
Remember that the only syntactically defined part is the
target name, which is an XML name token. The remaining data (up to
the terminating >
) is
free-form. See Chapter 18
for more information about uses (and potential misuses) of XML
processing instructions.
XML notations formally declare the format for external
unparsed entities and processing instruction targets. The list of
all available notations is stored in a NamedNodeMap
within the document's
DOCTYPE
node, which is accessed
from the Document
interface.
The definition of the Notation
interface is shown in Table
19-8.
The name of the Entity
interface is somewhat ambiguous,
but its meaning becomes clear when it is connected with the
EntityReference
interface, which is also part of the DOM Core. The
Entity
interface provides
access to the entity declaration's notation name, public ID, and
system ID. Parsed entity nodes have childNodes
, while unparsed entities have
a notationName
. The definition
of this interface is shown in Table 19-9.
DOM Level 3 introduces three new attributes that apply to
external parsed entities: inputEncoding
, xmlEncoding
,
and xmlVersion
. This additional
information makes it possible to properly enforce XML
well-formedness constraints for external parsed entities based on
the value of the xmlVersion
attribute. The two encoding related attributes make it possible to
precisely reconstruct external parsed entity files from their DOM
tree representation.
All members of this interface are read-only and cannot be modified at runtime.
The actual data conveyed by an XML document is contained completely within the document element. The following node types map directly to the XML document's nonstructural parts, such as character data, elements, and attribute values.
Each parsed document causes the creation of a single
Document
node in memory. (Empty
Document
nodes can be created
through the DOMImplementation
interface.) This interface provides access to the document type
information and the single, top-level Element
node that contains the entire
body of the parsed document (the documentElement
). It also provides
access to the class factory methods that allow an application to
create new content nodes that were not created by parsing a
document. Table
19-10 shows all attributes and methods of the Document
interface.
Table 19-10. The Document interface, derived from Node
The various create...( )
methods are important for applications that wish to modify the
structure of a document that was previously parsed. Note that
nodes created using one Document
instance may only be inserted
into the document tree belonging to the Document
that created them. DOM Level 2
provided a new importNode( )
method that allows a node, and possibly its children, to be
essentially copied from one document to another. DOM Level 3
introduced the adoptNode( )
method that actually moves an entire node subtree from one
document to another.
Besides the various node-creation methods, some methods can
locate specific XML elements or lists of elements. The methods
getElementsByTagName( )
and
getElementsByTagNameNS()
return
a list of all XML elements with the name, and possibly namespace,
specified. The getElementById(
)
method returns the single element with the given
ID
attribute.
DOM Level 3 also introduced several attributes that are
useful when an application wishes to reconstruct an XML document
to its original, pre-parsing format. The inputEncoding
, xmlEncoding
, and xmlStandalone
attributes preserve
information about the values of the XML declaration from the
original document as well as the character encoding of the
document before it was parsed (and converted to Unicode).
One of the major additions to DOM in Level 3 was the
inclusion of document validation support within the DOM tree
itself. The normalizeDocument(
)
method provides the developer with a mechanism for
essentially "re-parsing" the XML document from the DOM tree in
memory. Various parameters available through the domConfig
attribute control how this
normalization will occur. It is also possible to change the target
version of XML by modifying the xmlVersion
attribute before
normalization. This will cause the DOM to enforce the XML name
construction rules associated with the selected XML version. See
Chapter 21 for more
information about the differences between XML Versions 1.0 and
1.1.
Applications that allow real-time editing of XML documents
sometimes need to temporarily park document nodes outside the
hierarchy of the parsed document. A visual editor that wants to
provide clipboard functionality is one example. When the time
comes to implement the cut function, it is possible to move the
cut nodes temporarily to a DocumentFragment
node without deleting
them, rather than having to leave them in place within the live
document. Then, when they need to be pasted back into the
document, they can be reinserted using a method such as Node.appendChild( )
. The DocumentFragment
interface, derived from
Node
, has no interface-specific
attributes or methods.
Element
nodes are the most frequently encountered node type
in a typical XML document. These nodes are parents for the
Text
, Comment
, EntityReference
, ProcessingInstruction
, CDATASection
, and child Element
nodes that comprise the
document's body. They also allow access to the Attr
objects that contain the element's
attributes. Table
19-11 shows all attributes and methods supported by the
Element
interface.
Table 19-11. The Element interface, derived from Node
Since XML attributes may contain either text values or
entity references, the DOM stores element attribute values as
Node
subtrees. The following
XML fragment shows an element with two attributes:
<!ENTITY bookcase_pic SYSTEM "bookcase.gif" NDATA gif> <!ELEMENT picture EMPTY> <!ATTLIST picture src ENTITY #REQUIRED alt CDATA #IMPLIED> . . . <picture src="bookcase_pic" alt="3/4 view of bookcase"/>
The first attribute contains a reference to an unparsed
entity; the second contains a simple string. Since the DOM
framework stores element attributes as instances of the Attr
interface, a few parsers make the
contents of attributes available as actual subtrees of Node
objects. In this example, the
src
attribute would contain an
EntityReference
object
instance. Note that the nodeValue
of the Attr
node gives the flattened text value
from the Attr
node's children.
Table 19-12 shows
the attributes and methods supported by the Attr
interface.
Besides the attribute name and value, the Attr
interface exposes the specified
flag that indicates whether
this particular attribute instance was included explicitly in the
XML document or inherited from the !ATTLIST
declaration of the DTD. There
is also a back pointer to the Element
node that owns this attribute
object.
Several types of data within a DOM node tree represent
blocks of character data that do not include markup. CharacterData
is an abstract interface
that supports common text-manipulation methods, which are used by
the concrete interfaces Comment
, Text
, and CDATASection
. Table 19-13 shows the
attributes and methods supported by the CharacterData
interface.
DOM parsers are not required to make the contents of XML
comments available after parsing, and relying on comment data in
your application is poor programming practice at best. If your
application requires access to metadata that should not be part of
the basic XML document, consider using processing instructions
instead. The Comment
interface,
derived from CharacterData
, has
no interface-specific attributes or methods, only those it
inherits from its superinterfaces.
If an XML document contains references to general entities
within the body of its elements, the DOM-compliant parser may pass
these references along as EntityReference
nodes. This behavior is
not guaranteed because the parser is free to expand any entity or
character reference included with the actual Unicode character
sequence it represents. The EntityReference
interface, derived from
Node
, has no interface-specific
attributes or methods.
The character data of an XML document is stored within
Text
nodes. Text
nodes are children of either
Element
or Attr
nodes. After parsing, every
contiguous block of character data from the original XML document
is translated directly into a single Text
node. Once the document has been
parsed, however, it is possible that the client application may
insert, delete, and split Text
nodes so that Text
nodes may be
side by side within the document tree. Table 19-14 describes the
Text
interface.
The splitText
method
provides a way to split a single Text
node into two nodes at a given
point. This split would be useful if an editing application wished
to insert additional markup nodes into an existing island of
character data. After the split, it is possible to insert
additional nodes into the resulting gap.
Another useful addition (introduced in Level 3) is the
wholeText
attribute. This
attribute returns all of the text contained in the selected
Text
node, as well as any
adjacent Text
nodes, in
document order. Prior to Level 3, it was necessary to enumerate
all children of a given node and concatenate them manually to get
the entire text contained within a node.