Chapter 6. Dynamic PDF

Introduction: Hacks #74-92

PDF doesn’t have to be stuck in a file, created once and then published. PDF can be dynamic in multiple ways, ranging from interactivity through forms to custom generation of PDFs that meet particular user needs. While PDF seems static to a lot of people, that’s more a matter of the way it’s typically been used rather than an aspect of the technology itself. If you want to do more with PDF than distribute documents, you can.

Collect Data with Online PDF Forms

Turn your electronic document into a user interface and collect information from readers.

Traditional paper forms use page layout to show how information is structured. Sometimes, as on tax forms, these relationships get pretty complicated. PDF preserves page layout, so it is a natural way to publish forms on the Web. The next decision is, how many PDF form features should you add?

If you add no features, your users must print the form and fill it out as they would any other paper form. Then they must mail it back to you for processing. Sometimes this is all you need, but PDF is capable of more.

If you add fillable form fields to the PDF, your users can fill in the form using Acrobat or Reader. When they are done, they still must print it out and mail it to you. Acrobat users can save filled-in PDFs, but Reader users can’t, which can be frustrating.

If you add fillable form fields and a Submit button that posts field data to your web server, you have joined the information revolution. Your web server can interactively validate the user’s data, provide helpful feedback, record the completed data in your database, and supply the user with a savable PDF copy. Olé!

We have gotten ahead of ourselves, though. First, let’s create a form that submits data to your web server, such as the one in Figure 6-1. Subsequent hacks will build on this. To see online examples of interactive PDF forms, visit http://www.pdfhacks.com/form_session/. You can download our example PDF forms and PHP code from this site, too.

Open the form’s source document and print to PDF [Hack #39] or scan a paper copy and create a PDF using OCR. Open the PDF in Acrobat to add form fields.

PDF form fields correspond closely to HTML form fields, as shown in Table 6-1. Add them to your PDF using one or more Acrobat tools.

In Acrobat 6, as shown in Figure 6-2, you have one tool for each form field type. Open this toolbar by selecting Tools Advanced Editing Forms Show Forms Toolbar. Select a tool (e.g., Text Field tool), click, and drag out a rectangle where the field goes. Release the rectangle and a Field Properties dialog opens. Select the General tab and enter the field Name. This name will identify the field’s data when it is submitted to your web server. Set the field’s appearance and behavior using the other tabs. Click Close and the field is done.

In Acrobat 5, use the Form tool shown in Figure 6-2 to create any form field. Click, and drag out a rectangle where the field goes. Release the rectangle and a Field Properties dialog opens. Select the desired field Type (e.g., Text) and enter the field Name. This name will identify the field’s data when it is submitted to your web server. Set the field’s appearance and behavior using the other tabs. Click OK and the field is done. Using the Form tool, double-click a field at any time to change its properties.

To upload form data to your web server, the PDF must have a Submit Form button. Create a PDF button, open the Actions tab, and then add the Submit a Form (Acrobat 6) or Submit Form (Acrobat 5) action to the Mouse Up event, as shown in Figure 6-3.

Edit the action’s properties to include your script’s URL; this would be an HTML form’s action attribute. Append #FDF to the end of this URL, like this:

http://localhost/pdf_hacks/echo.php#FDF

Set the Field Selection to include the fields you want this button to submit; All Fields is safest, to start. Set the Export Format to HTML and the PDF form will submit the form data using HTTP’s post method.

When you are done, save your PDF form and test it.

To test your interactive PDF form, you must have access to a web server. Many of these hacks use server-side PHP scripts, so your web server should also run PHP (http://www.php.net). Windows users can download an Apache (http://www.apache.org) web server installer called IndigoPerl from IndigoSTAR (http://www.indigostar.com). This installer includes PHP (and Perl) modules, so you can run our hacks right out of the box. Apache and PHP are free software.

Visit http://www.indigostar.com/indigoperl.htm and download indigoperl-2004.02.zip. Unzip this file into a temporary directory and then double-click setup.bat to run the installer. When the installer asks for an installation directory, press Enter to choose the default: C:\indigoperl\. In our discussions, we’ll assume IndigoPerl is installed in this location.

After installing IndigoPerl, open a web browser and point it at http://localhost/. This is the URL of your local web server, and your browser should display a Web Server Test Page with links to documentation. When you request http://localhost/, Apache serves you index.html from C:\indigoperl\apache\htdocs\. Create a pdf_hacks directory in the htdocs directory, and use this location for our PHP scripts. Access this location from your browser with the URL: http://localhost/pdf_hacks/.

Populate online PDF forms with known data.

To maintain form data, you must display the current state of the data to the user. This enables the user to review the data, update a single field, and submit this change back to the server. With HTML forms, you can set field values as the form is served to the user. With PDF forms, you can use the Forms Data Format (FDF) to populate a form’s fields with data, as shown in Figure 6-5.

The PDF Reference [Hack #98] describes the FDF file format. Its syntax uses PDF objects Section 6.8[Hack #80] to organize data. To see an example, open your PDF form in Acrobat and fill in some fields. Export this data as FDF by selecting Advanced Forms Export Forms Data . . . (Acrobat 6) or File Export Form Data . . . (Acrobat 5). Our basic PDF form [Hack #74] yields an FDF file that lists fields in name/value pairs and then references the PDF form by filename:

%FDF-1.2
1 0 obj
<< /FDF 
  << /Fields 
    [ << /T (text_field_1) /V (Here is some text) >>
      << /T (text_field_2) /V (More nice text) >> ]
     /F (http://localhost/fine_form.pdf)
  >>
>>
endobj
trailer
<< /Root 1 0 R >>
%%EOF

[Hack #77] offers a PHP script for easily creating FDF from your data.

Users can store and manage PDF form data using FDF files. Visit http://segraves.tripod.com/index3.htm for some examples. For our purpose of serving filled-out PDF forms, the user never sees or handles the FDF file directly.

You have two options for automatically filling an online PDF form with data. You can serve FDF data that references the PDF form, or you can create a URL that references both the PDF form and the FDF data together.

Serve FDF to Fill Forms

One way to automatically fill an online PDF form is to serve a data-packed FDF file (with MIME type application/vnd.fdf). The user’s browser will open Acrobat/Reader and pass it the FDF data. Acrobat/Reader will read the FDF data to locate the PDF form. It will load and display this PDF and then populate its fields from the FDF. The PDF form in question should be available from your web server and the FDF data should reference it by URL using the /F key, as we do in our preceding example.

This technique is simple, but it has limitations. First, not all browsers know how to handle FDF data. Second, this technique does not always work inside of HTML frames. The next technique overcomes both of these problems.

Hacking the Hack

FDF can also contain PDF annotation (e.g., sticky note) information. Use the preceding techniques to dynamically add annotations to online PDF. Create example FDF or XFDF files by opening a PDF in Acrobat and adding some annotations. Then, select Document Export Comments . . . (Acrobat 6) or File Export Comments . . . (Acrobat 5).

Convert your data into FDF so that Acrobat or Reader can merge it with a PDF form.

As discussed in [Hack #75] , you can deliver filled-out PDF forms on the Web by serving FDF data. FDF data contains the URL for your PDF form and the data with which to fill the form. Upon receiving FDF data, Acrobat (or Reader) will open the referenced PDF form and then populate it with the given information. The next problem is how to easily create FDF data on your web server.

The PDF Reference [Hack #98] describes FDF in dizzying detail, and Adobe offers a free-of-charge FDF Toolkit with a dizzying license (http://partners.adobe.com/asn/acrobat/forms.jsp). But what you need is usually easy to create from the comfort of your favorite web programming language. We provide such a script written in PHP. It converts form data into an FDF file suitable for filling our basic PDF form [Hack #74] .

Elaborate forms might require the high-caliber Adobe FDF Toolkit to create suitable FDF. Most forms merely require their field data cast into the FDF syntax. We offer the forge_fdf script for this purpose. The FDF example from [Hack #75] shows the pattern evident in simple FDF files.

Instead of using a general-purpose FDF function or library, you can also consider exporting an FDF file from your form and then converting it into a template. Replace the form values with variables or other placeholders. Serve this template back to your form after filling the variables with user data. forge_fdf includes functions for encoding PDF strings and names [Hack #80] , which you might find useful when filling in your template.

Create FDF with forge_fdf

Pass form data and a PDF’s URL into forge_fdf, and it returns the corresponding FDF as a string. Create an FDF file with this string or serve it directly to the client browser with Content-type: application/vnd.fdf. We offer an example a little later.

You must remember some FDF peculiarities when passing arguments to forge_fdf.

For example, the following script uses forge_fdf to serve FDF data that should cause the user’s browser to open http://localhost/form.pdf and set its fields to match our values:

<?php
require_once('forge_fdf.php');

$pdf_form_url= "http://localhost/form.pdf";

$fdf_data_strings= array( 'text1' => $_GET['t'], 'text2' => 'Egads!' );
$fdf_data_names= array( 'check1' => 'Off', 'check2' => 'Yes' );

$fields_hidden= array( 'text2', 'check1' );
$fields_readonly= array( 'text1' );

header( 'content-type: application/vnd.fdf' );

echo forge_fdf( $pdf_form_url,
                $fdf_data_strings, 
                $fdf_data_names,
                $fields_hidden,
                $fields_readonly );
?>

To see a more elaborate example of forge_fdf in action, visit http://www.pdfhacks.com/form_session/. Tinker with the online example or download PHP source code from this web page.

Copy this code into a file named forge_fdf.php and include it in your PHP scripts. Or, adapt this algorithm to your favorite language. Visit http://www.pdfhacks.com/forge_fdf/ to download the latest version.

<?php
/* forge_fdf, by Sid Steward
   version 1.0
   visit: http://www.pdfhacks.com/forge_fdf/

  For text fields, combo boxes, and list boxes, add
  field values as a name => value pair to $fdf_data_strings.

  For checkboxes and radio buttons, add field values
  as a name => value pair to $fdf_data_names.  Typically,
  true and false correspond to the (case-sensitive)
  names "Yes" and "Off".

  Any field added to the $fields_hidden or $fields_readonly
  array also must be a key in $fdf_data_strings or
  $fdf_data_names; this might be changed in the future

  Any field listed in $fdf_data_strings or $fdf_data_names
  that you want hidden or read-only must have its field
  name added to $fields_hidden or $fields_readonly; do this
  even if your form has these bits set already

  PDF can be particular about CR and LF characters, so I
  spelled them out in hex: CR == \x0d : LF == \x0a 
*/

function escape_pdf_string( $ss )
{
  $ss_esc= '';
  $ss_len= strlen( $ss );
  for( $ii= 0; $ii< $ss_len; ++$ii ) {
    if( ord($ss{$ii})== 0x28 ||  // open paren
        ord($ss{$ii})== 0x29 ||  // close paren
        ord($ss{$ii})== 0x5c )   // backslash
      {
        $ss_esc.= chr(0x5c).$ss{$ii}; // escape the character w/ backslash
      }
    else if( ord($ss{$ii}) < 32 || 126 < ord($ss{$ii}) ) {
      $ss_esc.= sprintf( "\\%03o", ord($ss{$ii}) ); // use an octal code
    }
    else {
      $ss_esc.= $ss{$ii};
    }
  }
  return $ss_esc;
}

function escape_pdf_name( $ss )
{
  $ss_esc= '';
  $ss_len= strlen( $ss );
  for( $ii= 0; $ii< $ss_len; ++$ii ) {
    if( ord($ss{$ii}) < 33 || 126 < ord($ss{$ii}) || 
        ord($ss{$ii})== 0x23 ) // hash mark
      {
        $ss_esc.= sprintf( "#%02x", ord($ss{$ii}) ); // use a hex code
      }
    else {
      $ss_esc.= $ss{$ii};
    }
  }
  return $ss_esc;
}

function forge_fdf( $pdf_form_url, 
                    $fdf_data_strings,
                    $fdf_data_names,
                    $fields_hidden,
                    $fields_readonly )
{
  $fdf = "%FDF-1.2\x0d%\xe2\xe3\xcf\xd3\x0d\x0a"; // header
  $fdf.= "1 0 obj\x0d<< "; // open the Root dictionary
  $fdf.= "\x0d/FDF << "; // open the FDF dictionary
  $fdf.= "/Fields [ "; // open the form Fields array

  // string data, used for text fields, combo boxes, and list boxes
  foreach( $fdf_data_strings as $key => $value ) {
    $fdf.= "<< /V (".escape_pdf_string($value).")".
      "/T (".escape_pdf_string($key).") ";
    if( in_array( $key, $fields_hidden ) )
      $fdf.= "/SetF 2 ";
    else
      $fdf.= "/ClrF 2 ";

    if( in_array( $key, $fields_readonly ) )
      $fdf.= "/SetFf 1 ";
    else
      $fdf.= "/ClrFf 1 ";

    $fdf.= ">> \x0d";
  }

  // name data, used for checkboxes and radio buttons
  // (e.g., /Yes and /Off for true and false)
  foreach( $fdf_data_names as $key => $value ) {
    $fdf.= "<< /V /".escape_pdf_name($value).
      " /T (".escape_pdf_string($key).") ";
    if( in_array( $key, $fields_hidden ) )
      $fdf.= "/SetF 2 ";
    else
      $fdf.= "/ClrF 2 ";

    if( in_array( $key, $fields_readonly ) )
      $fdf.= "/SetFf 1 ";
    else
      $fdf.= "/ClrFf 1 ";
    $fdf.= ">> \x0d";
  }
  
  $fdf.= "] \x0d"; // close the Fields array

  // the PDF form filename or URL, if given
  if( $pdf_form_url ) {
    $fdf.= "/F (".escape_pdf_string($pdf_form_url).") \x0d";
  }
  
  $fdf.= ">> \x0d"; // close the FDF dictionary
  $fdf.= ">> \x0dendobj\x0d"; // close the Root dictionary

  // trailer; note the "1 0 R" reference to "1 0 obj" above
  $fdf.= "trailer\x0d<<\x0d/Root 1 0 R \x0d\x0d>>\x0d";
  $fdf.= "%%EOF\x0d\x0a";

  return $fdf;
}
?>

Walk your users through the form-filling process.

Collecting information with online forms is an interactive process. We have discussed how to collect form data [Hack #2] and how to drive forms [Hack #75] . Now, let’s use what we know to program an interactive, online form-filling session, such as the one in Figure 6-6 (visit http://www.pdfhacks.com/form_session/ to see this example and download PHP source code).

To submit data to your server, the PDF form must be displayed inside a web browser. I recommend displaying it inside an HTML frameset. This enables you to bracket the form with (HTML) instructions and a hyperlinked escape route, so the user won’t feel abandoned or trapped. It also adjusts the user to the idea that this isn’t any old PDF. Most people experience PDF as something to download and print out. Not only will this look different, but also the frameset conceals the PDF’s URL and prevents reflexive downloading.

Here is the HTML frameset code used in our example at http://www.pdfhacks.com/form_session/. Note that it uses a PDF+FDF URL to reference the form in order to prevent the PDF from breaking out of the frameset [Hack #75] .

<?php
// This is part of form_session
// visit www.pdfhacks.com/form_session/
//

$our_dir= 'http://'.$_SERVER['HTTP_HOST'].dirname($_SERVER['PHP_SELF']);

// The PDF+FDF URL notation respects frames
// and triggers the browser's 'PDF' association
// instead of its 'FDF' association (some browsers
// don't have an FDF association).
//
$form_frame_url= '"'.
$our_dir.'/form_session.pdf#FDF='.
$our_dir.'/update_state.php?reset=1"';

?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN"
   "http://www.w3.org/TR/html4/frameset.dtd">
<html>
  <head>
    <title>PDF Form Filling Session Demo</title>
  </head>
  <frameset cols="*,200">
    <frame src=<?php echo $form_frame_url ?> name="form" scrolling="no" 
marginwidth=0 marginheight=0 frameborder=1>
    <frame src="sidebar_session.html" name="sidebar" scrolling="auto" 
marginwidth=5 marginheight=5 frameborder=1>
  </frameset>
</html>

Here are some ideas for your interactive PDF form design. Keep in mind that a form can have any number of PDF fields, and you can hide fields from the user or set them as read-only at any time. Our forge_fdf script [Hack #77] enables you to set these flags as needed. Just add the field’s name to the $fields_hidden or $fields_readonly arrays.

  • Use hidden text fields to store the form’s session and state information.

  • If your form has multiple sections that require separate, server-side computation, add a Submit Form button for each section. Show only one button at a time by hiding all the others.

    Don’t forget to append an #FDF to form submission actions—e.g., update_state.php#FDF.

  • Highlight specific sections of your form with borders. Create a border using an empty, read-only text field that shows a colored border and a transparent background.

    You must draw these decorative form fields first, so they won’t interfere with other, interactive fields. Or, move a decorative field behind the others by giving it a lower tab order.

We employ some of these techniques in our online example at http://www.pdfhacks.com/form_session/.

Visit http://www.pdfhacks.com/form_session/ to see a live example of this model. Download form_session-1.1.zip from this page to examine the PHP scripts and PDF forms. IndigoPerl [Hack #74] users can unpack form_session-1.1.zip into C:\indigoperl\apache\htdocs\pdf_hacks\form_session-1.1\ and then run the example locally by pointing their browsers at http://localhost/pdf_hacks/form_session-1.1/start.html.

Provide online users with a copy of their completed form to save.

Adobe Reader enables a user to add, change, view, and print form data, but it does not enable a user to save the filled-in PDF form to disk. Saving the file produces a lovely copy of an empty form. How annoying!

Correct this problem server-side by merging the PDF form and its data. Then, offer this filled-in form as a download for the user’s records. After merging, the form fields remain interactive, even though they display the user’s data. Go a step further and flatten this form so that field data becomes a permanent part of the PDF pages. After flattening, filled-in fields are no longer interactive. You can merge and flatten forms using the iText library or our command-line pdftk. Both are free software.

The iText library (http://www.lowagie.com/iText/ or http://itextpdf.sf.net) is a remarkable tool for manipulating PDF documents. The following Java program, merge_pdf_fdf, demonstrates how to merge or flatten a PDF and its form data FDF [Hack #77] using iText. Run this code from your command line, or integrate it into your web application.

/*
  merge_pdf_fdf, version 1.0
  merge an input PDF file with an input FDF file
  to create a filled-in PDF; optionally flatten the
  FDF data so that it becomes part of the page
  http://www.pdfhacks.com/merge_pdf_fdf/

  invoke from the command line like this:

    java -classpath ./itext-paulo.jar:. \
    merge_pdf_fdf input.pdf input.fdf output.pdf

  or:

    java -classpath ./itext-paulo.jar:. \
    merge_pdf_fdf input.pdf input.fdf output.pdf flatten

  adjust the classpath to the location of your iText jar
 */

import java.io.*;
import com.lowagie.text.pdf.*;

public class merge_pdf_fdf extends java.lang.Object {
    
  public static void main(String args[]) {
    if ( args.length == 3 || args.length == 4 ) {
      try {
        // the input PDF
        PdfReader reader = 
          new PdfReader( args[0] );
        reader.consolidateNamedDestinations( );
        reader.removeUnusedObjects( );

        // the input FDF
        FdfReader fdf_reader= 
          new FdfReader( args[1] );

        // PdfStamper acts like a PdfWriter
        PdfStamper pdf_stamper= 
          new PdfStamper( reader,
                          new FileOutputStream( args[2] ) );

        if( args.length == 4 ) { // "flatten"
          // filled-in data becomes a permanent part of the page
          pdf_stamper.setFormFlattening( true );
        }
        else {
          // filled-in data will 'stick' to the form fields,
          // but it will remain interactive
          pdf_stamper.setFormFlattening( false );
        }

        // sets the form fields from the input FDF
        AcroFields fields=
          pdf_stamper.getAcroFields( );
        fields.setFields( fdf_reader );

        // closing the stamper closes the underlying
        // PdfWriter; the PDF document is written
        pdf_stamper.close( );
      }
      catch( Exception ee ) {
        ee.printStackTrace( );
      }
    }
    else { // input error
      System.err.println("arguments: file1.pdf file2.fdf destfile [flatten]");
    }
  }
}

To create a command-line Java program, copy the preceding code into a file named merge_pdf_fdf.java. Then, compile merge_pdf_fdf.java using javac, setting the classpath to the name and location of your iText jar:

               javac -classpath   
                merge_pdf_fdf.java

Finally, invoke merge_pdf_fdf like so:

               java -classpath   
               :. \
               merge_pdf_fdf 
               
                  input.pdf input.fdf output.pdf
               

Use pdftk [Hack #79] to merge a form with an FDF datafile [Hack #77] and create a new PDF. The fields will display the given data, but they also remain interactive. pdftk’s fill_form operation takes the filename of an FDF file as its argument. For example:

               pdftk   
                fill_form   
                output 

You can’t combine the fill_form operation with any other operation (e.g., cat), but you can supply additional output options for encryption [Hack #52] .

Flatten form data permanently into the page by adding the flatten_form output option. The resulting PDF data will no longer be interactive.

               pdftk   
                fill_form   
                output   
                flatten_form

Or, if your PDF form already has field data, just flatten it:

               pdftk   
                output   
                flatten_form

Take control of your PDF with pdftk.

If PDF is electronic paper, pdftk is an electronic staple-remover, hole punch, binder, secret-decoder ring, and X-ray glasses. pdftk is a simple, free tool for doing everyday things with PDF documents. It can:

The pdftk web site (http://www.AccessPDF.com/pdftk/) has links to software downloads and instructions for installation and usage. pdftk currently runs on Windows, Linux, Solaris, FreeBSD, and Mac OS X. Some users can download precompiled binaries, while others must download the source code and build pdftk using gcc, gcj, and libgcj (as described on the web site). pdftk is free software.

On Windows, download pdftk_1.0.exe.zip to a convenient directory. Unzip with your favorite archiving tool, and move the resulting pdftk.exe program to a directory in your PATH, such as C:\windows\system32\ or C:\winnt\system32\. Test it by opening a command-line DOS prompt and typing pdftk --help. It should respond with pdftk version information and usage instructions.

Handy Command Line for Windows

Command prompts aren’t well suited for quickly navigating large, complex filesystems. Let’s configure the Windows File Explorer to open a command prompt in the working directory we select. This will make it easier to use pdftk in a specific directory.

Windows XP and Windows 2000:

Windows 98:

Test your configuration by right-clicking a folder in the File Explorer. The context menu should list your new Command action, as shown in Figure 6-7. Choose this action and a command prompt will appear with its working directory set to the folder you selected. Olé!

Turn obfuscated PDF code into transparent data so you can work with it directly.

PDF uses an element framework for organizing data. When editing PDFs at the text level, it helps to know how to navigate these nodes. The data itself usually is compressed and unreadable. pdftk [Hack #79] can uncompress these streams, making the PDF more interesting to read and much more hackable.

First, uncompress your PDF document using pdftk:

            pdftk   
            output   
            uncompress

Next, fire up your text editor. A good text editor enables you to inspect any document at its lowest level by reading its bytes right off of the disk. Not all text editors can handle the mix of human-readable text and machine-readable binary data that PDF contains. Other editors can read and display this data, but they can’t write it properly. I recommend using gVim [Hack #82] .

Open a PDF in your text editor and you will find some plain-text data and some unreadable binary data. All of this data is organized using a few basic objects. The PDF Reference 1.5 section 3.2 describes these in detail. Here is a quick key to get you started.

Names: / ...

A slash indicates the beginning of a name. Examples include /Type and /Page. Most names have very specific meanings prescribed by the PDF Reference. They are never compressed or encrypted.

Strings: ( ... )

Strings are enclosed by parentheses. An example is (Now is the time). You use them for holding plain-text data in annotations and bookmarks. You can encrypt them but you can’t compress them. Mind escaped characters—e.g., \), \(, or \\.

Dictionaries: << key1 value1 key2 value2 ... >>

Dictionaries map keys to values. Keys must be names and values can be anything, even dictionaries or arrays.

Arrays: [ object1 object2 ... ]

Arrays represent a list of objects. All PDF objects are part of one big tree, interconnected by arrays and dictionaries.

Streams: << ... >> stream ... endstream

Most PDF data is stored in streams. Dictionary data precedes the stream data and holds information about the stream, such as its length and encoding. stream and endstream bracket the actual stream data. Streams are used to hold bitmap images and page-drawing instructions, among other things. Use pdftk to make compressed page streams readable. Some streams use PDF objects (dictionaries, strings, arrays, etc.) to represent information.

Indirect object references: m n R

Indirect object references allow an object to be referenced in one place (or many places) and described in another. The reference is a pair of numbers followed by the letter R, such as: 3528 0 R. You find them in dictionaries and arrays. To locate the object referenced by m n R, search for m n obj.

Indirect object identifiers: m n obj ... endobj

An indirect object is any object that is preceded by the identifying m n obj, where m and n are numbers that uniquely identify the object. Another object can then reference the indirect object by simply invoking m n R, described earlier.

Dictionaries tend to be the most interesting objects. They represent things such as pages and annotations. You can tell what a dictionary describes by checking its /Type and /Subtype keys. Conversely, you can find something in a PDF by searching on its type. For example, you can find each page in a PDF by searching for the text /Page. For annotations, search for /Annot, and for images, /Image.

At the end of the PDF file is the XREF lookup table. It gives the byte offset for every indirect object in the PDF file. This allows rapid random access to PDF pages and other data. Text-level PDF editing can corrupt the XREF table, which breaks the PDF. [Hack #81] solves this problem.

Take control of PDF code by mastering its XREF table.

[Hack #80] revealed the hackable plain text behind PDF. Here we edit this PDF text and then use pdftk [Hack #79] to cover our tracks. pdftk can also compress the page streams when we’re done.

First, uncompress your PDF’s page streams [Hack #80] :

            pdftk   
             output   
             uncompress

Then, open this new PDF in your text editor. Locate your page of interest by searching for the text /pdftk_pageNum N, where N is the number of your page (the first page is 1, not 0). This text was added to the page dictionaries by pdftk.

Find the /Contents key in your page’s dictionary. It is probably mapped to an indirect object reference: m n R. Locate this indirect object by searching for the text m n obj. This will take you to a stream or to an array of streams. If it is an array, look up any of its referenced streams the same way.

Now you should be looking at a stream of PDF drawing operations that describe your page. These operations and their interactions are best understood by studying the PDF Reference [Hack #98] . However, if your page has a lot of text on it, you can probably make it out. An example of a legal change in page text is changing [(gr)17.7(oup)] to [(grip)], or (storey) to (story). Anything inside parentheses this way is fair game. So, change something and save your work.

Editing PDF at the text level typically corrupts the XREF lookup table at the end of the file. Repair your edited PDF using pdftk like so:

            pdftk   
             output 

Or, if you want to compress the output and remove the /pdftk_pageNum entries, add compress to the end like so:

            pdftk   
             output   
             compress

Open your new PDF in Reader and view your page. Do you see the change you made? If it was in the middle of a paragraph, you might be surprised to find that the paragraph hasn’t rewrapped to fit your altered word. Most PDFs have no concept of a paragraph, so how could it?

This procedure is an unlikely way to fix typos. We put it to better use in [Hack #82] .

Integrate pdftk with gVim for Seamless PDF Editing

Turn gVim into a PDF editor.

gVim is an excellent text editor that can also be handy for viewing and editing PDF code. It handles binary data nicely, it is mature, and it is free. Also, you can extend it with plug-ins, which is what we’ll do. First, let’s download and install gVim.

Visit http://www.vim.org. The download page offers links and instructions for numerous platforms. Windows users can download the installer from http://www.vim.org/download.php. As of this writing, it is called gvim63.exe. During installation, the default settings should suit most needs. Click through to the end and it will create a Programs menu from which you can launch gVim.

The first-time gVim user should run gVim in Easy mode, which will make it behave like most other text editors. On Windows, do this by running gVim Easy from the Programs menu. Or, you can activate Easy mode from inside gVim by typing (the initial colon is essential) :source $VIMRUNTIME/evim.vim into your gVim session.

gVim comes with an interactive tutorial and a good online help system. Learn about it by invoking :help.

The pdftk plug-in turns gVim into a PDF editor. When opening a PDF it automatically uncompresses page streams so that you can read and modify them [Hack #80] . When closing a PDF, it automatically compresses page streams. If any changes were made to the PDF, it also fixes internal PDF byte offsets [Hack #81] .

Visit http://www.AccessPDF.com/pdftk/ and download pdftk.vim.zip. Unzip and then move the resulting file, pdftk.vim, to the gVim plug-ins directory. This usually is located someplace such as C:\Vim\vim63\plugin\. To help find this directory, try searching for the file gzip.vim, which should be there already. The pdftk plug-in will be sourced the next time you run gVim, so restart gVim if necessary.

Use care when testing the plug-in for the first time. Copy a PDF to create a test file named test1.pdf. Launch gVim and use it to open test1.pdf. There will be a delay while pdftk uncompresses it. Data should then appear in the editor as readable text, as shown in Figure 6-8. Any graphic bitmaps still will appear as unreadable gibberish.

Without making any changes to the file, save it as test2.pdf and close it. gVim will pause again while it compresses test2.pdf. Now, open test2.pdf in Acrobat or Reader. If everything is in place, it should open just fine. If Acrobat or Reader complains about the file being damaged, double-check the installation.

With our PDF extensions, gVim enables you to conveniently edit PDF code. You can bring power and beauty together by configuring Acrobat’s TouchUp tool to use gVim for editing PDF objects.

In Acrobat, select Edit Preferences General TouchUp. Click Choose Page/Object Editor . . . and a file selector will open. Select gvim.exe, which usually lives somewhere such as C:\Vim\vim63\. Click OK and you are done.

Test out your new configuration on a disposable document. Open the PDF in Acrobat and select the TouchUp Object tool (it might be hidden by the TouchUp Text tool), as shown in Figure 6-9.

Click a paragraph and a box appears, outlining the selection. Right-click inside this box and choose Edit Object . . . . gVim will open, displaying the PDF code used to describe this selection, as shown in Figure 6-10. It will be a full PDF document, with fonts and an XREF table.

Find some paragraph text and make some small changes [Hack #80] . When you save the gVim file, Acrobat should promptly update the visible page to reflect your change. Sometimes this update looks imperfect, temporarily. You can make many successive PDF edits this way.

You might notice occasional warnings from gVim about the data having been modified on the disk by another program. You can safely ignore these.

Add live session data to your PDF on its way down the chute.

After publishing your PDF online, it can be hard to gauge what impact it had on readers. Get a clearer picture of reader response by modifying the PDF’s hyperlinks so that they pass document information to your web server.

For example, if your July newsletter’s PDF edition has hyperlinks to:

http://www.pdfhacks.com/index.html

you can append the newsletter’s edition to the PDF hyperlinks using a question mark:

http://www.pdfhacks.com/index.html?edition=0407

When somebody reading your PDF newsletter follows this link into your site, your web logs record exactly which newsletter they were reading.

Take this reader response idea a step further by adding data to PDF hyperlinks that identifies the user who originally downloaded the PDF. With a little preparation, this is easy to do as the PDF is being served.

Add Hyperlinks to Your PDF Using Links or Buttons

A PDF page can include hyperlinks to web content. You can create them using the Link tool, the Button tool (Acrobat 6), or the Form tool (Acrobat 5). Use the Link tool shown in Figure 6-11 if you want to add a hyperlink to existing text or graphics. Use the Button/Form tool if you want to add a hyperlink and add text/graphics to the page, as shown in Figure 6-12. For example, you would use the Button/Form tool to create a web-style navigation bar [Hack #65] .

To create a hyperlink button in Acrobat 6, select the Button tool (Tools Advanced Editing Forms Button Tool). Click the PDF page and drag out a rectangle. Release the rectangle and a Field Properties dialog opens. Set the button’s appearance using the General, Appearance, and Options tabs.

Open the Actions tab. Set the Trigger to Mouse Up, set the Action to Open a Web Link, and then click Add . . . . A dialog will open where you can enter the hyperlink URL.

To create a hyperlink button in Acrobat 5, select the Form tool. Click the PDF page and drag out a rectangle. Release the rectangle and a Field Properties dialog opens. Set the field type to Button and enter a unique Name. Set the button’s appearance using the Appearance and Options tabs.

Open the Actions tab. Select Mouse Up and click Add . . . . Set the Action Type to World Wide Web Link. Click Edit URL . . . and enter the hyperlink URL.

This example PHP script, serve_newsletter.php, opens a pdfsrc file, reads the offset data we added, then serves the PDF. As it serves the PDF, it replaces the placeholders with hyperlinks. It uses the input GET query string’s edition and user values to tailor the PDF hyperlinks.

For example, when invoked like this:

http://www.pdfhacks.com/serve_newsletter.php?edition=0307&user=84

it opens the PDF file newsletter.0307.pdfsrc and serves it, replacing all userhome hyperlink placeholders with:

http://www.pdfhacks.com/user_home.php?user=84

and replacing all newsletters placeholders with:

http://www.pdfhacks.com/newsletter_home.php?user=84&edition=0307

Tailor serve_newsletter.php to your purpose:

<?php
// serve_newsletter.php, version 1.0
// http://www.pdfhacks.com/dynamic_links/

$fp= @fopen( "./newsletter.{$_GET['edition']}.pdfsrc", 'r' );
if( $fp ) {

  if( $_GET['debug'] ) {
    header("Content-Type: text/plain"); // debug
  }
  else {
    header('Content-Type: application/pdf');
  }

  $pdf_offset= 0;
  $url_offsets= array( );

  // iterate over first lines of pdfsrc file to load $url_offsets
  while( $cc= fgets($fp, 1024) ) {
    if( $cc{0}== '#' ) { // one of our comments
      list($comment, $name, $offset)= explode( '-', $cc );

      if( $name== 'userhome' ) {
        $url_offsets[(int)$offset]= 
          'http://www.pdfhacks.com/user_home.php?user=' . $_GET['user'];
      }
      else if( $name== 'newsletters' ) {
        $url_offsets[(int)$offset]= 
          'http://www.pdfhacks.com/newsletter_home.php?user=' . 
          $_GET['user'] . '&edition=' . $_GET['edition'];
      }
      else { // default
        $url_offsets[(int)$offset]= 'http://www.pdfhacks.com';
      }
    }
    else { // finished with our comments
      echo $cc;
      $pdf_offset= strlen($cc)+ 1;

      break;
    }
  }

  // sort by increasing offsets
  ksort( $url_offsets, SORT_NUMERIC );
  reset( $url_offsets );

  $output_url_line_b= false;
  $output_url_b= false;
  $closed_string_b= false;

  list( $offset, $url )= each( $url_offsets );
  $url_ii= 0;
  $url_len= strlen($url);

  // iterate over rest of file
  while( ($cc= fgetc($fp))!= "" ) {

    if( $output_url_line_b && $cc== '(' ) {
      // we have reached the beginning of our URL
      $output_url_line_b= false;
      $output_url_b= true;

      echo '(';
    }
    else if( $output_url_b ) {
      if( $cc== ')' ) { // finished with this URL
        if( $closed_string_b ) {
          // string has already been capped; pad
          echo ' ';
        }
        else {
          echo ')';
        }

        // get next offset/URL pair
        list( $offset, $url )= each( $url_offsets );
        $url_ii= 0;
        $url_len= strlen($url);

        // reset
        $output_url_b= false;
        $closed_string_b= false;
      }
      else if( $url_ii< $url_len ) {
        // output one character of $url
        echo $url{$url_ii++};
      }
      else if( $url_ii== $url_len ) {
        // done with $url, so cap this string
        echo ')';
        $closed_string_b= true;
        $url_ii++;
      }
      else {
        echo ' '; // replace padding with space
      }
    }
    else {
      // output this character
      echo $cc;

      if( $offset== $pdf_offset ) {
        // we have reached a line in pdfsrc where
        // our URL should be; begin a lookout for '('
        $output_url_line_b= true;
      }
    }

    ++$pdf_offset;
  }

  fclose( $fp );
}
else { // file open failure
  echo 'Error: failed to open: '."./newsletter.{$_GET['edition']}.pdfsrc";
}
?>

Create a PDF template that you can populate as it is served.

Sometimes a PDF needs to include dynamic information. For example, you could fashion the cover of your personalized PDF sales brochure [Hack #89] to include the customer’s name: “Created for Mary Jane Doe on March 15, 2004.” To do this, let’s use what we know about modifying PDF text in a plain-text editor [Hack #80] to create a PDF template. Then we’ll fill in this template using a web server script.

The overall process resembles [Hack #83] . Instead of PDF links, you will add placeholders to the PDF’s page streams. As it is served, these placeholders can be replaced with your data.

Create the PDF

Design the document using your favorite authoring application. Add placeholder text where you want the dynamic data to appear. Placeholders should have a common prefix, such as textbeg_customer. Style this text to taste, but align it to the left (not the center). Before creating a PDF, be careful with the placeholder fonts to avoid results such as the one in Figure 6-14.

Whichever font you choose for your placeholder, you must make sure the font gets adequately embedded into the PDF [Hack #43] . An embedded font is often subset, which means it includes only the characters that are used in your document. If your placeholder text uses a Type 1 font, you can configure Distiller to not subset this font [Hack #43] . If your placeholder text uses a TrueType or OpenType font, you must be sure that every character you might need occurs in your document. To be safe, create a separate page that includes every letter in the alphabet, every number, and every punctuation mark you’ll need. Set this alphabet soup to the font of your placeholder.

Print to PDF and delete this alphabet page.

Convert the PDF into a Template

Prepare the PDF for text editing with pdftk [Hack #79] like this (if you use gVim and our plug-in [Hack #82] to edit PDF, this step isn’t necessary):

               pdftk   
               .pdf output   
               .pdf uncompress

Open the results in your editor and search for your placeholder text. If you can’t find it, search on its page number—e.g., pageNum 5 — and then dig down [Hack #81] to find the page stream that has your placeholder. Distiller probably split it into pieces—e.g., textbeg_customer might end up as [(text)5(b)-1.7(eg_cust)5(o)-1.7(mer)].

Make a few changes to this page stream. First, repair your placeholder text so that grep can find it. So:

[(text)5(b)-1.7(eg_cust)5(o)-1.7(mer)]TJ

becomes:

[(textbeg_customer)]TJ

Or, if your string ends in Tj, such as this:

(Created for textbeg_customer on textbeg_date)Tj

rewrite it like this, adding square brackets and changing the Tj at the end to TJ:

[(Created for textbeg_customer on textbeg_date)]TJ

Next, isolate each placeholder on its own line, if necessary. So, the previous example becomes:

[(Created for )
(textbeg_customer)
( on )
(textbeg_date)]JT

Finally, pad the placeholders with asterisks (*). Add enough asterisks so that the placeholder is longer than any possible data you might write there. Padding the previous example would look like this:

[(Created for )
(textbeg_customer***********************************)
( on )
(textbeg_date**********************)]JT

Save and close your altered PDF.

This example PHP script, alter_pdf_text_example.php, opens a pdfsrc file, reads the offset data we added, and then serves the PDF. As it serves the PDF, it replaces the placeholders with the given text. Note how the replacement text is escaped using escape_pdf_string.

<?php
// alter_pdf_text_example.php, version 1.0
// http://www.pdfhacks.com/dynamic_text/

// the filename of the source PDF file, which 
// contains placeholders for our dynamic text
$pdfsrc_fn= './cover.pdfsrc';

// the data we will place into the PDF text;
$customer_text= "Mary Jane Doe";
$date_text= "March 15, 2004";

function escape_pdf_string( $ss )
{
  $ss_esc= '';
  $ss_len= strlen( $ss );
  for( $ii= 0; $ii< $ss_len; ++$ii ) {
    if( ord($ss{$ii})== 0x28 ||  // open paren
        ord($ss{$ii})== 0x29 ||  // close paren
        ord($ss{$ii})== 0x5c )   // backslash
      {
        $ss_esc.= chr(0x5c).$ss{$ii}; // escape the character w/ backslash
      }
    else if( ord($ss{$ii}) < 32 || 126 < ord($ss{$ii}) ) {
      $ss_esc.= sprintf( "\\%03o", ord($ss{$ii}) ); // use an octal code
    }
    else {
      $ss_esc.= $ss{$ii};
    }
  }
  return $ss_esc;
}

// open the source PDF file, which contains placeholders
$fp= @fopen( $pdfsrc_fn, 'r' );
if( $fp ) {

  if( $_GET['debug'] ) {
    header("Content-Type: text/plain"); // debug
  }
  else {
    header('Content-Type: application/pdf');
  }

  $pdf_offset= 0;
  $text_offsets= array( );

  // iterate over first lines of pdfsrc file to load $text_offsets;
  while( $cc= fgets($fp, 1024) ) {
    if( $cc{0}== '#' ) { // one of our comments
      list($comment, $name, $offset)= explode( '-', $cc );

      if( $name== 'customer' ) {
        $text_offsets[(int)$offset]= 
          escape_pdf_string( $customer_text );
      }
      else if( $name== 'date' ) {
        $text_offsets[(int)$offset]= 
          escape_pdf_string( $date_text );
      }
      else { // default
        $text_offsets[(int)$offset]= 
          escape_pdf_string( '[ERROR]' );
      }
    }
    else { // finished with our comments
      echo $cc;
      $pdf_offset= strlen($cc)+ 1;

      break;
    }
  }

  // sort by increasing offsets
  ksort( $text_offsets, SORT_NUMERIC );
  reset( $text_offsets );

  $output_text_line_b= false;
  $output_text_b= false;
  $closed_string_b= false;

  list( $offset, $text )= each( $text_offsets );
  $text_ii= 0;
  $text_len= strlen($text);

  // iterate over rest of file
  while( ($cc= fgetc($fp))!= "" ) {

    if( $output_text_line_b && $cc== '(' ) {
      // we have reached the beginning of our TEXT
      $output_text_line_b= false;
      $output_text_b= true;

      echo '(';
    }
    else if( $output_text_b ) {
      if( $cc== ')' ) { // finished with this TEXT
        if( $closed_string_b ) {
          // string has already been capped; pad
          echo ' ';
        }
        else {
          echo ')';
        }

        // get next offset/TEXT pair
        list( $offset, $text )= each( $text_offsets );
        $text_ii= 0;
        $text_len= strlen($text);

        // reset
        $output_text_b= false;
        $closed_string_b= false;
      }
      else if( $text_ii< $text_len ) {
        // output one character of $text
        echo $text{$text_ii++};
      }
      else if( $text_ii== $text_len ) {
        // done with $text, so cap this string
        echo ')';
        $closed_string_b= true;
        $text_ii++;
      }
      else {
        echo ' '; // replace padding with space
      }
    }
    else {
      // output this character
      echo $cc;

      if( $offset== $pdf_offset ) {
        // we have reached a line in pdfsrc where
        // our TEXT should be; begin a lookout for '('
        $output_text_line_b= true;
      }
    }

    ++$pdf_offset;
  }

  fclose( $fp );
}
else { // file open failure
  echo 'Error: failed to open: '.$pdfsrc_fn;
}
?>

IndigoPerl users (see Section 6.2.2 in [Hack #74] ) can copy alter_pdf_text_example.php into C:\indigoperl\apache\htdocsdf_hacks along with a PDF template named cover.pdfsrc. Point your browser to http://localhost/pdf_hacks/alter_pdf_text_example.php, and a PDF should appear. All instances of textbeg_customer should be replaced with “Mary Jane Doe,” and all instances of textbeg_date should be replaced with “March 15, 2004.” Naturally, you will need to adapt this script to your own purposes.

Use HTML to Create PDF

Format your content in HTML and then transform it into PDF.

HTML pages are easy to create on the fly. PDF pages are hard. One simple way to create dynamic PDF is to first create the document in HTML and then use HTMLDOC to transform it into PDF, as shown in Figure 6-15. This works for single pages and long documents.

HTMLDOC creates PDF documents from HTML 3.2 data. It provides document layout options, such as running headers and footers. It can add PDF features, such as bookmarks, links, metadata, and encryption. Invoke HTMLDOC from the command line or use its GUI. Visit http://www.easysw.com/htmldoc/software.php to download Windows binaries or source that can be compiled on Linux, Mac OS X, or a variety of other operating systems.

The detailed documentation that comes with HTMLDOC also is available online at http://www.easysw.com/htmldoc/documentation.php.

In Perl, you can automate PDF generation with HTMLDOC by using the HTML::HTMLDoc module to interface with HTMLDOC.

Use Perl to Create PDF

Create or modify PDF with a Perl script.

Many web sites use Perl for creating dynamic content. You can also use Perl to script Acrobat on your local machine [Hack #95] . Given the great number of packages that extend Perl, it is no surprise that packages exist for creating and manipulating PDF. Let’s take a look.

Install Perl and the PDF::API2 Package on Windows

[Hack #95] explains how to install Perl on Windows. After installing Perl, use the Perl Package Manager to easily install the PDF::API2 package.

Launch the Programmer’s Package Manager (PPM, formerly called Perl Package Manager) by selecting Start Programs ActiveState ActivePerl 5.8 Perl Package Manager. A command prompt will open with its ppm> prompt awaiting your command. Type help to see a list of commands. Type search pdf to see a list of available packages. To install PDF::API2, enter install pdf-api2. The Package Manager will fetch the package from the Internet and install it on your machine. The entire session looks something like this:

PPM - Programmer's Package Manager version 3.1.
Copyright (c) 2001 ActiveState SRL. All Rights Reserved.

Entering interactive shell. Using Term::ReadLine::Stub as readline library.

Type 'help' to get started.

ppm> install pdf-api2
====================
Install 'pdf-api2' version 0.3r77 in ActivePerl 5.8.3.809.
====================
Transferring data: 74162/1028845 bytes.

...

Installing C:\Perl\site\lib\PDF\API2\CoreFont\verdanaitalic.pm
Installing C:\Perl\site\lib\PDF\API2\CoreFont\webdings.pm
Installing C:\Perl\site\lib\PDF\API2\CoreFont\wingdings.pm
Installing C:\Perl\site\lib\PDF\API2\CoreFont\zapfdingbats.pm
Installing C:\Perl\site\lib\PDF\API2\Chart\Pie.pm
Successfully installed pdf-api2 version 0.3r77 in ActivePerl 5.8.3.809.
ppm> quit

The PDF::API2 package is used widely to create and manipulate PDF. You can download documentation and examples from http://pdfapi2.sourceforge.net/dl/.

CPAN (http://www.cpan.org) is the Comprehensive Perl Archive Network, where you will find “All Things Perl.” Visit http://search.cpan.org to discover several other PDF packages. Drill down to find details, documentation, and downloads. For example, PDF::Extract (http://search.cpan.org/~nsharrock/) creates a new PDF from the pages of a larger, input PDF.

Generate PDF from within your PHP script.

A number of libraries enable you to create PDF using PHP. The standard PHP documentation includes a PDF Functions section that describes the popular PDFlib module (http://www.pdflib.com). However, this PDF extension is not free software. Typically, you must purchase a license and then recompile PHP to take advantage of these functions.

Consider some of these free alternatives. They are native PHP, so they are easy to install; just include one in your script.

R&OS PDF-PHP

With the R&OS PDF-PHP library (http://www.ros.co.nz/pdf/), you can add text, bitmaps, and drawings to new PDF pages. Formatting includes running headers and footers, multicolumn layout, and tables. PDF features include page labels, links, and encryption. Programming features include callbacks and transactions.

<?php // hello world with R&OS, from readme.pdf
include ('class.ezpdf.php');

$pdf =& new Cezpdf( );
$pdf->selectFont('./fonts/Helvetica.afm');
$pdf->ezText('Hello World!', 50);
$pdf->ezStream( );
?>

FPDF (http://www.fpdf.org) enables you to add text, bitmaps, lines, and rectangles to new PDF pages. Formatting includes running headers and footers, multicolumn layout, and tables. PDF features include metadata and links. The home page provides an active user forum and user-contributed scripts. The following PHP script produces the PDF document shown in Figure 6-17.

<?php // hello world with FPDF, adapted from the tutorial for IndigoPerl users
define('FPDF_FONTPATH','C:\\indigoperl\\apache\\htdocs\\pdf_hacks\\fpdf\\font\\');
require('C:\\indigoperl\\apache\\htdocs\\pdf_hacks\\fpdf\\fpdf.php');

$pdf=new FPDF( );
$pdf->AddPage( );
$pdf->SetFont('Arial','B',16);
$pdf->Cell(40,10,'Hello World!');
$pdf->Output( );
?>

Generate PDF from within your Java program.

When you’re programming in Java, the free iText library should serve most of your dynamic PDF needs. Not only can you create PDF, but you can also read existing PDFs and incorporate their pages into your new document. Visit http://www.lowagie.com/iText/ for documentation and downloads. You can download an alternative, development branch from http://itextpdf.sourceforge.net.

Tip

Another free PDF library written in Java is Etymon’s PJX (http://etymon.com/epub.html). It supports reading, combining, manipulating, and writing PDF documents.

When creating a new PDF, iText can add text, bitmaps, and drawings. Formatting includes running headers and footers, lists, and tables. PDF features include page labels, links, encryption, metadata, bookmarks, and annotations. Programming features include callbacks and templates.

You can also use iText to manipulate existing PDF pages. For example, you can combine pages to create a new document [Hack #89] or use a PDF page as the background for your new PDF [Hack #90] .

Here is Hello World! using iText:

// Hello World in Java using iText, adapted from the iText tutorial
import java.io.FileOutputStream;
import java.io.IOException;

// the iText imports
import com.lowagie.txt.*;
import com.lowagie.text.pdf.PdfWriter;

public class HelloWorld {
  public static void main(String[] args) {
    Document pdf= new Document( );
    PdfWriter.getInstance(pdf, new FileOutputStream("HelloWorld.pdf"));
    pdf.open( );
    pdf.add(new Paragraph("Hello World!"));
    pdf.close( );
  }
}

Collate an online document at serve-time.

Imagine that you have a travel web site. A user visits to learn what packages you offer. She enters her preferences and tastes into an online form and your site returns several suggestions. Now, take this scenario to the next level. Create a custom PDF report based on these suggestions by assembling your literature into a single document. She can download and print this report, and the full impact of your literature is preserved. She can share it with her friends, read it in a comfortable chair, and leave it on her desk as a reminder to follow up—a personal touch with professional execution.

Assembling PDFs into a single document should be easy, and it is. In Java use iText. Elsewhere, use our command-line pdftk [Hack #79] .

Assemble Pages in Java with iText

If your web site runs Java, consider using the iText library (http://www.lowagie.com/iText/) to assemble PDF documents. The following code demonstrates how to use iText to combine PDF pages. Compile and run this Java program from the command-line, or use its code in your Java application:

/*
  concat_pdf, version 1.0, adapted from the iText tools
  concatenate input PDF files and write the results into a new PDF
  http://www.pdfhacks.com/concat/

  This code is free software. It may only be copied or modified
  if you include the following copyright notice:

  This class by Mark Thompson. Copyright (c) 2002 Mark Thompson.

  This code is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 */

import java.io.*;

import com.lowagie.text.*;
import com.lowagie.text.pdf.*;

public class concat_pdf extends java.lang.Object {
    
  public static void main( String args[] ) {
    if( 2<= args.length ) {
      try {
        int input_pdf_ii= 0;
        String outFile= args[ args.length-1 ];
        Document document= null;
        PdfCopy writer= null;

        while( input_pdf_ii < args.length- 1 ) {
          // we create a reader for a certain document
          PdfReader reader= new PdfReader( args[input_pdf_ii] );
          reader.consolidateNamedDestinations( );

          // we retrieve the total number of pages
          int num_pages= reader.getNumberOfPages( );
          System.out.println( "There are "+ num_pages+ 
                              " pages in "+ args[input_pdf_ii] );
                    
          if( input_pdf_ii== 0 ) {
            // step 1: creation of a document-object
            document= new Document( reader.getPageSizeWithRotation(1) );

            // step 2: we create a writer that listens to the document
            writer= new PdfCopy( document, new FileOutputStream(outFile) );

            // step 3: we open the document
            document.open( );
          }

          // step 4: we add content
          PdfImportedPage page;
          for( int ii= 0; ii< num_pages; ) {
            ++ii;
            page= writer.getImportedPage( reader, ii );
            writer.addPage( page );
            System.out.println( "Processed page "+ ii );
          }

          PRAcroForm form= reader.getAcroForm( );
          if( form!= null ) {
            writer.copyAcroForm( reader );
          }

          ++input_pdf_ii;
        }

        // step 5: we close the document
        document.close( );
      }
      catch( Exception ee ) {
        ee.printStackTrace( );
      }
    }
    else { // input error
      System.err.println("arguments: file1 [file2 ...] destfile");
    }
  }
}

To create a command-line Java program, copy the preceding code into a file named concat_pdf.java. Then, compile concat_pdf.java using javac, setting the classpath to the name and location of your iText jar:

               javac -classpath   
                concat_pdf.java

Finally, invoke concat_pdf to combine PDF documents, like so:

               java -classpath   
               :. \
               concat_pdf 

Consider some of these other free tools for assembling PDF:

Merge your PDF pages with a background letterhead, form, or watermark.

Sometimes it makes sense to divide document creation into layers. For example, you need to create an invoice’s background form, with its logo and rules, only once. You can create the invoice data dynamically as needed and then superimpose it on this form to yield the final invoice.

Perform this final merge in Java with iText, or elsewhere with our command-line pdftk [Hack #79] , producing the results in Figure 6-18.

iText (http://www.lowagie.com/iText/) is a powerful library for creating and manipulating PDF. The following Java program uses iText to apply one watermark PDF page to every page in a document. This watermark page can be any PDF page, such as a company letterhead design or an invoice form. The watermark will appear as though it is behind each page’s content. Compile and run this program, or use its code in your Java application.

/*
  watermark_pdf, version 1.0
  http://www.pdfhacks.com/watermark/

  place a single "watermark" PDF page "underneath" all
  pages of the input document

  after compiling, invoke from the command line like this:

    java -classpath ./itext-paulo.jar:. \
    watermark_pdf doc.pdf watermark.pdf output.pdf

  only the first page of watermark.pdf is used
*/

import java.io.*;
import com.lowagie.text.*;
import com.lowagie.text.pdf.*;

public class watermark_pdf {

  public static void main( String[] args ) {
    if( args.length== 3 ) {
      try {
        // the document we're watermarking
        PdfReader document= new PdfReader( args[0] );
        int num_pages= document.getNumberOfPages( );

        // the watermark (or letterhead, etc.)
        PdfReader mark= new PdfReader( args[1] );
        Rectangle mark_page_size= mark.getPageSize( 1 );

        // the output document
        PdfStamper writer= 
          new PdfStamper( document, 
                          new FileOutputStream( args[2] ) );

        // create a PdfTemplate from the first page of mark
        // (PdfImportedPage is derived from PdfTemplate)
        PdfImportedPage mark_page=
          writer.getImportedPage( mark, 1 );

        for( int ii= 0; ii< num_pages; ) {
          // iterate over document's pages, adding mark_page as
          // a layer 'underneath' the page content; scale mark_page
          // and move it so that it fits within the document's page;
          // if document's page is cropped, this scale might
          // not be small enough

          ++ii;
          Rectangle doc_page_size= document.getPageSize( ii );
          float h_scale= doc_page_size.width( )/mark_page_size.width( );
          float v_scale= doc_page_size.height( )/mark_page_size.height( );
          float mark_scale= (h_scale< v_scale) ? h_scale :  v_scale;

          float h_trans= (float)((doc_page_size.width( )- 
                                  mark_page_size.width( )* mark_scale)/2.0);
          float v_trans= (float)((doc_page_size.height( )- 
                                  mark_page_size.height( )* mark_scale)/2.0);
          
          PdfContentByte contentByte= writer.getUnderContent( ii );
          contentByte.addTemplate( mark_page, 
                                   mark_scale, 0, 
                                   0, mark_scale, 
                                   h_trans, v_trans );
        }

        writer.close( );
      }
      catch( Exception ee ) {
        ee.printStackTrace( );
      }
    }
    else { // input error
      System.err.println("arguments: in_document in_watermark out_pdf_fn");
    }
  }
}

To create a command-line Java program, copy the preceding code into a file named watermark_pdf.java. Then, compile watermark_pdf.java using javac, setting the classpath to the name and location of your iText jar:

               javac -classpath   
                watermark_pdf.java

Finally, invoke watermark_pdf to apply the first page of form.pdf to every page of invoice.pdf to create watermarked.pdf, like so:

               java -classpath   
               :. \
               watermark_pdf 

Produce PDF documents for XML documents styled with CSS using YesLogic Prince.

YesLogic (http://www.yeslogic.com) of Melbourne, Australia, offers an extremely simple little tool for converting XML documents styled with CSS into PDF or PostScript. It’s called Prince and it’s currently at Version 3.0. It runs on Windows or Red Hat Linux (Versions 7.3 and 8.0). Prince comes with a set of examples, default stylesheets, and DTDs.

You can download a free demo version from http://yeslogic.com/prince/demo/. This demo is fully featured but outputs the word Demo in an outline font across every page it creates. If you like it, you can purchase a copy from http://yeslogic.com/prince/purchasing/.

This simple XML document represents a time:

<?xml version="1.0" encoding="UTF-8"?>

<!-- a time instant -->
<time timezone="PST">
 <hour>11</hour>
 <minute>59</minute>
 <second>59</second>
 <meridiem>p.m.</meridiem>
 <atomic signal="true" symbol="&#x25D1;"/>
</time>

The CSS stylesheet provides detailed formatting information for it:

time {font-size:40pt; text-align: center }
time:before {content: "The time is now: "}
hour {font-family: sans-serif; color: gray}
hour:after {content: ":"; color: black}
minute {font-family: sans-serif; color: gray}
minute:after {content: ":"; color: black}
second {font-family: sans-serif; color: gray}
second:after {content: " "; color: black}
meridiem {font-variant: small-caps}

After downloading and installing Prince, open the application and follow these steps:

Figure 6-20 shows you time.pdf in Adobe Reader Version 6.0.

Michael Fitzgerald

Use Apache’s FOP engine together with XSL-FO to generate PDF output.

Apache’s Formatting Objects Processor (FOP, available at http://xml.apache.org/fop/) is an open source Java application that reads an XSL-FO (http://www.w3.org/TR/xsl/) tree and renders the result primarily as PDF, although other formats are possible, such as Printer Control Language (PCL), PostScript (PS), Scalable Vector Graphics (SVG), an area tree representation of XML, Java Abstract Windows Toolkit (AWT), FrameMaker’s Maker Interchange Format (MIF), and text.

XSL-FO defines formatting objects that help describe blocks, paragraphs, pages, tables, and such. These formatting objects are aided by a large set of formatting properties that control things such as fonts, text alignment, spacing, and the like, many of which match the properties used in CSS (http://www.w3.org/Style/CSS/). XSL-FO’s formatting objects and properties provide a framework for creating attractive, printable pages.

XSL-FO is a huge, richly detailed XML vocabulary for formatting documents for presentation. XSL-FO is the common name for the XSL specification produced by the W3C. The spec is nearly 400 pages long. At one time, XSL-FO and XSLT (finished spec is less than 100 pages) were part of the same specification, but they split into two specs in April 1999. XSLT became a recommendation in November 1999, but XSL-FO did not achieve recommendation status until October 2001.

To get you started, we’ll go over a few simple examples. The first example, time.fo, is a XSL-FO document that formats the contents of the elements in time.xml:

<fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format">
 <fo:layout-master-set>
  <fo:simple-page-master master-reference="Time" page-height="11in"
      page-width="8.5in" margin-top="1in" margin-bottom="1in"
      margin-left="1in" margin-right="1in">
   <fo:region-body margin-top=".5in"/>
   <fo:region-before extent="1.5in"/>
   <fo:region-after extent="1.5in"/>
  </fo:simple-page-master>
 </fo:layout-master-set>
 <fo:page-sequence master-name="Time">
  <fo:flow flow-name="xsl-region-body">

 <!-- Heading -->
 <fo:block font-size="24px" font-family="sans-serif" line-height="26px"
   space-after.optimum="20px" text-align="center" font-weight="bold"
  color="#0050B2">Time</fo:block>

 <!-- Blocks for hour/minute/second/atomic status -->
 <fo:block font-size="12px" font-family="sans-serif" line-height="16px"
   space-after.optimum="10px" text-align="start">Hour: 11 </fo:block>
 <fo:block font-size="12px" font-family="sans-serif" line-height="16px"
   space-after.optimum="10px" text-align="start">Minute: 59</fo:block>
 <fo:block font-size="12px" font-family="sans-serif" line-height="16px"
   space-after.optimum="10px" text-align="start">Second: 59</fo:block>
 <fo:block font-size="12px" font-family="sans-serif" line-height="16px"
   space-after.optimum="10px" text-align="start">Meridiem: p. m.</fo:block>
 <fo:block font-size="12px" font-family="sans-serif" line-height="16px"
   space-after.optimum="10px" text-align="start">Atomic? true</fo:block>

  </fo:flow>
 </fo:page-sequence>
</fo:root>

FOP is pretty easy to use. To generate a PDF from this XSL-FO file, download and install FOP from http://xml.apache.org/fop/download.html. At the time of this writing, FOP is at Version 20.5. In the main directory, you’ll find a fop.bat file for Windows or a fop.sh file for Unix. You can run FOP using these scripts.

To create a PDF from time.fo, enter this command:

               fop time.fo time-fo.pdf

time.fo is the input file, and time-fo.pdf is the output file. FOP will let you know of its progress with a report such as this:

[INFO] Using org.apache.xerces.parsers.SAXParser as SAX2 Parser
[INFO] FOP 0.20.5
[INFO] Using org.apache.xerces.parsers.SAXParser as SAX2 Parser
[INFO] building formatting object tree
[INFO] setting up fonts
[INFO] [1]
[INFO] Parsing of document complete, stopping renderer

Figure 6-21 shows the result of formatting time.fo with FOP in Adobe Acrobat.

You also can incorporate XSL-FO markup into an XSLT stylesheet, then transform and format a document with just one FOP command. Here is a stylesheet (time-fo.xsl) that incorporates XSL-FO:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:fo="http://www.w3.org/1999/XSL/Format">
<xsl:output method="xml" encoding="utf-8" indent="yes"/>

<xsl:template match="/">
<fo:root>
 <fo:layout-master-set>
  <fo:simple-page-master master-reference="Time" page-height="11in"
      page-width="8.5in" margin-top="1in" margin-bottom="1in"
      margin-left="1in" margin-right="1in">
   <fo:region-body margin-top=".5in"/>
   <fo:region-before extent="1.5in"/>
   <fo:region-after extent="1.5in"/>
  </fo:simple-page-master>
 </fo:layout-master-set>
 <fo:page-sequence master-name="Time">
  <fo:flow flow-name="xsl-region-body">
   <xsl:apply-templates select="time"/>
  </fo:flow>
 </fo:page-sequence>
</fo:root>
</xsl:template>

<xsl:template match="time">
 <!-- Heading -->
 <fo:block font-size="24px" font-family="sans-serif" line-height="26px"
     space-after.optimum="20px" text-align="center" font-weight="bold"
     color="#0050B2">
     Time
 </fo:block>

 <!-- Blocks for hour/minute/second/atomic status -->
 <fo:block font-size="12px" font-family="sans-serif" line-height="16px"
     space-after.optimum="10px" text-align="start">
     Hour: <xsl:value-of select="hour"/>
 </fo:block>
 <fo:block font-size="12px" font-family="sans-serif" line-height="16px"
     space-after.optimum="10px" text-align="start">
     Minute: <xsl:value-of select="minute"/>
 </fo:block>
 <fo:block font-size="12px" font-family="sans-serif" line-height="16px"
     space-after.optimum="10px" text-align="start">
     Second: <xsl:value-of select="second"/>
 </fo:block>
 <fo:block font-size="12px" font-family="sans-serif" line-height="16px"
     space-after.optimum="10px" text-align="start">
     Meridiem: <xsl:value-of select="meridiem"/>
 </fo:block>
 <fo:block font-size="12px" font-family="sans-serif" line-height="16px"
     space-after.optimum="10px" text-align="start">
     Atomic? <xsl:value-of select="atomic/@signal"/>
 </fo:block>
</xsl:template>

</xsl:stylesheet>

The same XSL-FO markup you saw in time.fo is interspersed with templates and instructions that transform time.xml. Now with this command, you can generate a PDF like the one you generated with time.fo:

               Fop -xsl time-fo.xsl -xml time.xml -pdf time-fo.pdf

For more information on these technologies, see Michael Fitzgerald’s Learning XSLT (O’Reilly) or Dave Pawson’s XSL-FO (O’Reilly).

Michael Fitzgerald