Chapter 8. Embedded Files

This chapter explains how a PDF can be used as a container for other files, much as a ZIP file can, while still providing rich page content to accompany them.

In most cases, file formats (such as .docx or .xslx) will be converted into PDF for distribution. However, sometimes it can be useful to have the original file as well. Unfortunately, there is a good chance that the two files will become disconnected, so having a way to embed or attach the original inside of the PDF is a useful capability. Additionally, you might choose to embed other files related to the PDF that aren’t the actual content, such as XML data.

For these reasons and more, PDF supports the ability to embed other files inside of itself and then have them presented in the UI of the PDF viewer.

At the heart of embedding files is the file specification dictionary. This dictionary actually supports both embedded and referenced files, but we will focus strictly on the embedded form (see Figure 8-1). In order to ensure that the dictionary can be identified, it must contain a Type key whose value is Filespec. Additionally, there must be three other keys present in the dictionary: F, UF, and EF (see Example 8-1 for a sample).

The F key contains the name of the file in a special encoding specific to file specification strings (ISO 32000-1:2008, 7.11.2), which is the “standard encoding for the platform on which the document is being viewed.” For most modern operating systems, that’s UTF-8, but it isn’t required to be so. However, the UF key contains the name encoded as standard 16-bit Unicode. The EF key refers to the embedded file dictionary, which is a simple dictionary with a single key, F, whose value is an embedded file stream where the actual data for the embedded file lives, along with some additional metadata about the file.

An embedded file stream is simply a stream object that contains the data for an embedded file. As such, it can be stored and compressed using filters (see Stream Objects) such as Flate—the same technology used in a ZIP file. A variety of additional information can be present in the embedded file stream’s dictionary, such as the file’s Internet media type (aka MIME type), as the value of the Subtype key. Other information, such as the date and time at which the file was created or last modified, can be included in the embedded file parameter dictionary (which is the value of the Params key). Example 8-2 shows an example of an embedded file stream.

A file specification and its associated embedded file stream are only one piece of the puzzle; it still needs to be connected to something in the PDF structure so that it can be found by the PDF viewer. If the file is associated with some specific content on a specific page, a FileAttachment annotation would be appropriate (see FileAttachment Annotations). However, if the file is more global to the document, the EmbeddedFiles name tree would be the place (see The EmbeddedFiles Name Tree).

Files can be connected to a PDF in two ways, depending on whether they are to be associated with specific content in a particular location or globally with the PDF as a whole. In the former case, we’ll use file attachment annotations. In the latter case, the approach will be to add an EmbeddedFiles key to the document’s name dictionary.

A PDF with embedded files is useful where the page content is the primary focus for the person who will read the document. However, sometimes you have a collection of documents that need to be grouped together, but none of them have any higher priority than another. Thus, the embedded files themselves are the focus. For example, it might be all the materials for a legal case or for bidding on an engineering job. In those cases, you want the PDF viewer to present the list of files and any associated metadata about them, rather than the normal view of a primary document’s page content. It is for this purpose that the portable collections (or just “collections”) feature of PDF is used. Figure 8-2 shows an example.

The contents of a collection are the files listed in the EmbeddedFiles. Any file in the name tree will be part of the collection, while any embedded files that are not in the tree will not. To make these files be a collection instead of just a loose set of embedded files, there needs to be a collection dictionary in the PDF that is the value of the Collection key in the document’s catalog dictionary (see Example 8-6 for a simple example). Although none of the keys in the collection dictionary are required, a useful collection dictionary would contain at least two keys: D and View.

D
The D key has a string value that is the name of a PDF in the EmbeddedFiles name tree that you want the PDF viewer to show initially. It is recommended that this either be the key document in the collection or instructions about how to navigate the collection.
View
The View key has a value (of type name) that will tell the PDF viewer whether to present the list of files from the collection in details mode (D), tile mode (T), or initially hidden (H).

While a simple list can be useful, it is more likely that there is additional information about each file that could be displayed as part of the collection interface presented by the PDF viewer. For example, if the files represented a movie catalog, displaying the movies’ release dates and durations, as in Figure 8-3, might be useful.

To create a set of fields such as those in the example image, a collection schema dictionary is included in the collection dictionary as the value of the Schema key, with each key in the dictionary having a value that is a collection field dictionary. It would look something like Example 8-7.

In the example schema there are four fields—YEAR, DURATION, TITLE, and DVD—representing not only the names of the fields, but also their types. These fields will then be associated with each of the files specified in the EmbeddedFiles name tree through the addition of a CI key in each file specification dictionary.

With all that data at our disposal, we can also choose to have the file list sorted based on any of the elements of the schema rather than the default order of the EmbeddedFiles name tree. This is done by including a Sort key in the collection dictionary whose value is its associated collection sort dictionary, as shown in Example 8-9.

Previously, in Actions, you learned about actions that allowed a user to navigate within the existing document (GoTo), or to an external document (GoToR). Now that you’ve seen how to embed documents inside of a PDF, let’s see how to navigate to an embedded document.

The GoToE (or “embedded go-to”) action is quite similar to a remote go-to action, but it allows jumping to an embedded PDF file. Both file attachment annotations and entries in the EmbeddedFiles name tree are supported. These embedded files may in turn contain embedded files, and the GoToE action can point through one or more parent PDFs to the final destination PDF (also called the target PDF) via the target dictionary.

The action dictionary for a GoToE action will consist of the same three keys found in both the GoTo and GoToR actions—Type (with a value of Action), S (with a value of GoToE), and D (whose value is the destination in the target PDF).

The value of the T key in the action dictionary is a target dictionary that locates the target in relation to the source, in much the same way that a relative path describes the physical relationship between two files in a filesystem. Target dictionaries may be nested recursively to specify one or more intermediate targets before reaching the final one.

The “relative path” described by the target dictionary need not only go down the hierarchy, but may also go up, just as the “..” entry would signify in a DOS or Unix path. The “direction” is specified by the R (relationship) key and has a value of either P (parent) or C (child). Example 8-10 shows a few sample GoToE actions.

In this chapter, you learned about how to embed a file into a PDF (connecting it either to the document as a whole or to a specific place on a page) using a file specification dictionary and its associated embedded file stream. You aso learned how to instruct a PDF viewer to show your embedded files as rich collection of documents.

Next, you will learn how to work with multimedia objects in PDF, such as videos and sounds.