Chapter 11. Tagging and Structure

Structured PDF

As you’ve seen in all the previous chapters, PDF provides the ability to draw text, vectors, raster images, and even video and 3D onto a page that can be displayed or printed. However, the content is just that: a series of drawing instructions. It has no semantic or structural context. There is nothing that delineates one paragraph from another or one image from another. In fact, there isn’t even a concept of a paragraph or a word—just a bunch of glyphs and their associated encoding.

This limitation is addressed by a feature of PDF called logical structure. It enables associating a hierarchical grouping of objects, called structure elements, with the various graphic objects on the page and any additional attributes needed to sufficiently describe those objects. This is quite similar in concept to markup languages such as HTML or XML, but in PDF that structure and content are in separate logical areas of the PDF rather than being intermixed (as they are in HTML, for example). This separation allows the ordering and nesting of logical elements to be entirely independent of the order and location of graphic objects on the document’s pages.

While there is a series of predefined types of structure elements that enable the organization of a document into chapters and sections or the identification of special elements such as figures, tables, and footnotes, the facilities provided by PDF are quite extensible. This extensibility allows writers to choose what structural information to include and how to represent it, while enabling processors to navigate the file without knowing the specific structural conventions.

As previously mentioned, the structural elements are arranged in a hierarchical structure called the StructTree, or structure tree. At the root of the tree is the structure tree root, a dictionary whose Type key has a value of StructTreeRoot (see Example 11-1). There are two other things that are required to be present in the root: the first of the children in the tree and a grouping of structure elements by page (see Figure 11-1 and its result, Figure 11-2).

The K key in the root points to the first structural element in the structure tree. Its value can either be a single structure element dictionary or an array of structure element dictionaries. Most tagged PDFs will have a single entry, which is a structure element of type Document.

The ParentTree key is a number tree that groups all structural elements on a page together with an associated number/index. While it is more logical to have the ordinal page number represent the number/index in the number tree, that is not required, as we will see when we learn how to associate structure with a page (in Associating Structure to Content).

Each structural element is represented by a dictionary whose Type key has a value of StructElem. The specific type of structural element that it represents is specified as the value of the S key. That value is a name object and can be anything, though it is recommended to stick to the values discussed in Standard structure types.

The P key in the structure element dictionary has as its value the parent element in the tree, so that it is possible for a processor to navigate the tree in all directions. In the case of the first child, the parent will be the StructTreeRoot.

As with the StructTreeRoot, the children of each element can be found as the value of the K key. The value of K can be a structure element, an array of structure elements, or an integer that represents the marked content ID (MCID) on the target page for the content. In addition, it is possible to have a reference to an annotation or an XObject if you are referring to the entire object as the content of that particular structure element.

Although it’s not required, it is common to have a Pg key present in the structure element’s dictionary whose value is the page dictionary where the content representing the element is displayed.

One other common key in the structure element’s dictionary is the Lang key, which can be used to clearly identify the natural language applicable to a given structure element (and its children, unless otherwise overridden). The value of this key is a standard RFC 3066 code. Example 11-2 demonstrates a few sample structure elements.

A block-level structure element is any region of text or other content that is laid out in the block progression direction, such as a paragraph, heading, list item, or footnote. Table 11-1 lists some of these types of content and their related structure elements.

Table 11-1. BLSEs and related structure elements
Structure type Description

H

(Heading) A label for a subdivision of a document’s content.

H1–H6

Headings with specific levels.

P

(Paragraph) A low-level division of text.

L

(List) A sequence of items of like meaning and importance. Its immediate children will be list items (LI).

LI

(List item) An individual member of a list.

Lbl

(Label) A name or number that distinguishes a given item from others in the same list or other group of like items. For example, in a dictionary list, it contains the term being defined; in a bulleted or numbered list, it contains the bullet character or the number of the list item and any associated punctuation.

LBody

(List body) The descriptive content of a list item. For example, in a dictionary list, it contains the definition of the term.

Table

(Table) A two-dimensional layout of rectangular data cells, possibly having a complex substructure. It contains either one or more table rows (TR) or an optional table head (THead) followed by one or more table body elements (TBody) and an optional table footer (TFoot).

TR

(Table row) A row of headings or data in a table.

TH

(Table header cell) A table cell containing header text describing one or more rows or columns of the table.

TD

(Table data cell) A table cell containing data that is part of the table’s content.

THead

(Table header row group) A group of rows that constitute the header of a table.

TBody

(Table body row group) A group of rows that constitute the main body portion of a table.

TFoot

(Table footer row group) A group of rows that constitute the footer of a table.

All other standard structure types will either be treated as ILSEs or appear as artifacts (see Artifacts).

Identifying which graphics operators in a content steam are associated with a specific structure element is done by simply enclosing those elements in a pair of marked content operators—specifically BDC and EMC—and an associated property list. A simple example is presented in Example 11-5.

This content refers to the structure elements from Example 11-2, which consisted of two numbered elements, 0 and 1, the numbers that are referenced by the MCID keys in the property lists.

Note

Although the name used in this example for the tag around the image is Figure, it could have been Foo or any other string. It is the value of the S key in the structure element dictionary that actually determines the structure type. Using the same name is a very good idea and is highly recommended!

Although applying structure to the graphics operators in the page’s content stream is the most common approach, it is also possible to apply structure inside other types of content streams, such as the one associated with a form XObject. In most cases, the entire form XObject represents a complete structure element and you can just enclose the Do operator inside of the marked content, as in the preceeding example. However, it is also possible to apply the same type of marked content operators to individual graphics operators inside of the XObject’s content stream.

Although adding structure to a PDF can be quite useful, there are additional rules that can be applied during the writing of the PDF content to enable an even richer set of semantics in the final PDF. When these rules are applied, the PDF is called a tagged PDF.

A tagged PDF document conforms to the following rules:

A tagged PDF document will also contain a mark information dictionary with a value of true for the Marked key. The mark information dictionary is the value of the MarkInfo key in the document catalog dictionary.

In this chapter, you learned about how to add semantic richness to your PDF content through tagging and structure. Next you will see how to incorporate metadata into a PDF at the document as well as the object level.