Chapter 5. Manipulating PDF Files

Introduction: Hacks #51-73

A lot of people think of PDFs as frozen files, printed once and then impossible to modify. That isn’t the case, however! Whether you have Adobe Acrobat or not, there are lots of ways to manipulate PDF files: breaking them up, making their file sizes smaller, encrypting and decrypting them, and presenting them to users in different ways.

Split and Merge PDF Documents (Even Without Acrobat)

You can create new documents from existing PDF files by breaking the PDFs into smaller pieces or combining them with information from other PDFs.

As a document proceeds through its lifecycle, it can undergo many changes. It might be assembled from individual sections and then compiled into a larger report. Individual pages might be copied into a personal reference document. Sections might be replaced as new information becomes available. Some documents are agglomerations of smaller pieces, like an expense report with all of its lovely and easily lost receipts.

While it’s easy to manipulate paper pages by hand, you must use a program to manipulate PDF pages. Adobe Acrobat can do this for you, but it is expensive. Other commercial products, such as pdfmeld from FyTek (http://www.fytek.com), also provide this basic functionality. The pdftk PDF toolkit [Hack #79] is a free software alternative.

Quickly Combine Pages in Acrobat

In Acrobat 6, select File Create PDF From Multiple Files . . . . Click the Browse . . . button (Choose . . . on a Macintosh) to open a file selector. You can select multiple files at once. On Windows, you can select a variety of file types, including Microsoft Office documents. Arrange the files into the desired order and click OK.

To quickly combine two PDF documents using Acrobat 5, begin by opening the first PDF in Acrobat. In the Windows File Explorer, select the PDF you want to append, drag it over the PDF open in Acrobat, and then drop it. A dialog will open, asking where you want to insert the PDF. Select After Last Page and it will be appended to the first PDF.

If you have a folder of PDF files to combine and their order in the Windows File Explorer is the order you want in the final document, begin by opening the first PDF in Acrobat 5. Next, in the File Explorer, select the remaining PDFs to merge. Finally, click the first PDF in this selection, as shown in Figure 5-1, drag the selection over the PDF currently open in Acrobat, and then drop it. A dialog will open, asking where you want to insert these PDFs. Select After Last Page and they will be appended to the first PDF. Review the document to ensure the PDFs were appended in the correct order.

pdftk is a command-line tool for doing everyday things with PDF documents. It can combine PDF documents into a single document or split individual pages out into a new PDF document. Read [Hack #79] to install pdftk and our handy command-line shortcut. pdftk is free software.

Open a command prompt and then change the working directory to the folder that holds the input PDF files. Or, you can open a handy command line by right-clicking the folder that holds your input PDF files and selecting Command from the context menu.

To combine pages into one document, invoke pdftk like so:

               pdftk   
                cat [  
               ] output 

A couple of quick examples give you the flavor of it. Here is an example of combining the first page of in2.pdf, the even pages in in1.pdf, and then the odd pages of in1.pdf to create a new PDF named out.pdf:

pdftk A=in1.pdf B=in2.pdf cat B1 A1-endeven A1-endodd output out.pdf

Here is an example of combining a folder of documents to create a new PDF named combined.pdf. The documents will be ordered alphabetically:

pdftk *.pdf cat output combined.pdf

Now, let’s dig into the parameters:

You can see from these examples that page ranges also specify the output page order. Notice the keyword end, which refers to the final page in a PDF.

Specify a sequence of page ranges like so:

A1 B1-end C5

When combining all the input PDF documents in their given order, you can omit the <input PDF pages> section.

<output PDF filename>

The output PDF filename must be different from any of the input filenames.

If any of the input files are encrypted, you will need to supply their owner passwords [Hack #52] .

Encrypt and Decrypt PDF (Even Without Acrobat)

Restrict who can view your PDF and how they can use it.

You can use PDF encryption to lock a file’s content behind a password, but more often it is used to enforce lighter restrictions imposed by the author. For example, the author might permit printing pages but prohibit making changes to the document. Here, we continue from [Hack #2] and explain how pdftk [Hack #79] can encrypt and decrypt PDF documents. We’ll begin by describing the Acrobat Standard Security model (called Password Security in Acrobat 6) and the permissions you can grant or revoke.

Set the user password if you don’t want people to see your PDF. If they don’t have the user password, it simply won’t open.

You also have some control over what people can do with your document once they have it open. The permissions associated with 128-bit security (Acrobat 5 and 6) are more precise than those associated with 40-bit security (Acrobat 3 and 4). Tables Table 5-1 and Table 5-2 list all available permissions for each security model, and Figure 5-2 shows the permissions as seen through Acrobat. The tables also show the corresponding pdftk flags to use.

Comparing these two tables, you can see that Assembly is a weaker version ofModifyContents and FillIn is a weaker version of ModifyAnnotations.

DegradedPrinting sends pages to the printer as rasterized images, whereas Printing sends pages as PostScript. A PostScript stream can be intercepted and turned back into (unsecured) PDF, so the Printing permission is a security risk. However, DegradedPrinting reduces the clarity of printed pages, so you should test your document to make sure DegradedPrinting yields acceptable, printed pages.

After setting these permissions and/or a user password, changing them requires the owner password, if it is set.

Apply or remove encryption from a given PDF with a quick right-click.

[Hack #52] discussed how to apply or remove PDF encryption with pdftk [Hack #79] . Let’s streamline these security operations by adding handy Encrypt and Decrypt items to the PDF context menu. The encryption example simply applies a user password to the selected PDF, so nobody can open it without the password. The decryption example removes all (Standard) security, upon success.

Add the Encrypt PDF Context Menu Item

Windows XP and Windows 2000:

  1. In the Windows File Explorer menu, select Tools Folder Options . . . and click the File Types tab. Select the PDF file type and click the Advanced button.

  2. Click the New . . . button and a New Action dialog appears. Give the new action the name Encrypt.

  3. Give the action an application to open by clicking the Browse . . . button and selecting cmd.exe, which lives somewhere such as C:\windows\system32\ (Windows XP) or C:\winnt\system32\ (Windows 2000).

  4. Add these arguments after cmd.exe, changing the path to suit, like so:

    C:\windows\system32\cmd.exe
    /C C:\windows\system32\pdftk.exe "%1" output "%1.encrypted.pdf" 
                         encrypt_128bits user_pw PROMPT
  5. Click OK, OK, OK and you should be done with the configuration.

Include live data that your readers can unpack and use.

PDF provides a convenient package for your document. A typical PDF contains fonts, images, page streams, annotations, and metadata. It turns out that you can pack anything into a PDF file, even the source document used to create the PDF! These attachments enjoy the benefits of PDF features such as compression, encryption, and digital signatures. Attachments also enable you to provide your readers with document data, such as tables, in a native file format that they can easily use. People often ask, [Hack #7] . Attach your document data as HTML or Excel files and give your readers exactly what they need.

This hack explains how to attach files to your PDF. [Hack #55] goes on to describe how to quickly extract your document’s tables for PDF attachment.

Page Attachments Versus Document Attachments

You can attach a file to a particular PDF page, where it is visible as an icon. Or, you can attach a file to the PDF document so that it keeps a lower profile. After encrypting your PDF, document attachments can’t be unpacked without the ModifyAnnotations permission [Hack #52] . Page attachments, on the other hand, can be unpacked at any time, regardless of the security permissions you imposed. Of course, the PDF must be opened first, which could require a user password.

When you encrypt a PDF, you also encrypt its attachments. The permissions you apply can affect whether users can unpack these attachments. See [Hack #52] for details on how to apply encryption using pdftk.

Once the PDF is open in Acrobat/Reader (which might require a password), any files attached to PDF pages can be unpacked, regardless of the PDF’s permissions. This enables you to disable copy/paste features, yet still make select data available to your readers.

Document attachments are more restricted than page attachments. You must grant the ModifyAnnotations permission if you want your readers to be able to unpack and view document attachments.

Pack your document’s essential information into its PDF edition.

Readers copy data from PDF documents to use in their own documents or spreadsheets. Tables usually contain the most valuable data, yet they are the most difficult to extract from a PDF [Hack #7] . Give readers what they need, as shown in Figure 5-4, by automatically extracting tables from your source document, converting them into an Excel spreadsheet, and then attaching them to your PDF.

In Microsoft Word, use the following macro to copy a document’s tables into a new document. In Word, create the macro like so.

Open the Macros dialog box (Tools Macro Macros . . . ). Type CopyTablesIntoNewDocument into the “Macro name:” field, set “Macros in:” to Normal.dot, and click Create.

A window will open where you can enter the macro’s code. It already will have two lines of code: Sub CopyTablesIntoNewDocument() and End Sub. You don’t need to duplicate these lines.

You can download the following code from http://www.pdfhacks.com/copytables/:

Sub CopyTablesIntoNewDocument( )
' version 1.0
' http://www.pdfhacks.com/copytables/

Dim SrcDoc, NewDoc As Document
Dim SrcDocTableRange As Range

Set SrcDoc = ActiveDocument
If SrcDoc.Tables.Count <> 0 Then
    
    Set NewDoc = Documents.Add(DocumentType:=wdNewBlankDocument)
    Set NewDocRange = NewDoc.Range
    Dim PrevPara As Range
    Dim NextPara As Range
    Dim NextEnd As Long
    NextEnd = 0
        
    For Each SrcDocTable In SrcDoc.Tables
        Set SrcDocTableRange = SrcDocTable.Range
       
        'output the preceding paragraph?
        Set PrevPara = SrcDocTableRange.Previous(wdParagraph, 1)
        If PrevPara Is Nothing Or PrevPara.Start < NextEnd Then
        Else
            Set PPWords = PrevPara.Words
            If PPWords.Count > 1 Then 'yes
                NewDocRange.Start = NewDocRange.End
                NewDocRange.InsertParagraphBefore
                
                NewDocRange.Start = NewDocRange.End
                NewDocRange.InsertParagraphBefore
                NewDocRange.FormattedText = PrevPara.FormattedText
            End If
        End If
            
        'output the table
        NewDocRange.Start = NewDocRange.End
        NewDocRange.FormattedText = SrcDocTableRange.FormattedText
        
        'output the following paragraph?
        Set NextPara = SrcDocTableRange.Next(wdParagraph, 1)
        If NextPara Is Nothing Then
        Else
            Set PPWords = NextPara.Words
            NextEnd = NextPara.End
            If PPWords.Count > 1 Then 'yes
                NewDocRange.Start = NewDocRange.End
                NewDocRange.InsertParagraphBefore
                NewDocRange.FormattedText = NextPara.FormattedText
            End If
        End If
     
     Next SrcDocTable
End If
End Sub

Run this macro from Word by selecting Tools Macro Macro . . . , selecting Copy Tables Into New Document, and clicking Run. A new document will open that contains all the tables from your current document. It will also include the paragraphs immediately before and after each table. This feature was added to help readers find the table they want. Modify the macro code to suit your requirements.

See [Hack #54] for the detailed procedure. Speed up attachments with quick attachment actions [Hack #56] .

Pack or unpack PDF attachments from the Windows File Explorer with a quick right-click.

It’s best to perform simple tasks in a simple manner, especially when you must perform them often. Wire pdftk [Hack #79] into Windows Explorer so that you can pack or unpack attachments using PDF’s right-click context menu.

Create the Attach File Context Menu Item

In Windows XP and Windows 2000:

  1. In the Windows File Explorer menu, open Tools Folder Options . . . and click the File Types tab. Select the PDF file type and click the Advanced button.

  2. Click the New . . . button and a New Action dialog appears. Give the new action the name Attach File.

  3. Give the action an application to open by clicking the Browse . . . button and selecting cmd.exe, which lives somewhere such as C:\windows\system32\ (Windows XP) or C:\winnt\system32\ (Windows 2000).

  4. Add these arguments after cmd.exe, changing the path to suit, like so:

    C:\windows\system32\cmd.exe
    /C C:\windows\system32\pdftk.exe "%1" attach_file PROMPT output PROMPT
  5. Click OK, OK, OK and you should be done with the configuration.

Add a search feature to your print edition.

Creating a good document Index section is a difficult job performed by professionals. However, an automatically generated index still can be very helpful. Use automatic keywords [Hack #19] or select your own keywords. This hack will locate their pages, build a reference, and then create PDF pages that you can append to your document, as shown in Figure 5-5. It even uses your PDF’s page labels (also known as logical page numbering) to ensure trouble-free lookup.

Download and install pdftotext [Hack #19] , our kw_index [Hack #19] , and pdftk [Hack #79] . You must also have enscript (Windows users visit http://gnuwin32.sf.net/packages/enscript.htm ) and ps2pdf. ps2pdf comes with Ghostscript [Hack #39] . Our kw_index package includes the kw_catcher and page_refs programs (and source code) that we use in the following sections.

Pass the name of your PDF document and the kw_catcher window size to make_index.sh like so:

               make_index.sh

The script will create a document index named mydoc.index.pdf. Review this index and append it to your PDF document [Hack #51] if you desire. The script also creates two intermediate files: mydoc.data.txt and mydoc.txt. If the PDF index is faulty, review these intermediate files for problems. Delete them when you are satisfied with the PDF index.

The second argument to make_index.sh controls the keyword detection sensitivity. Smaller numbers yield fewer keywords at the risk of omitting some keywords; larger numbers admit more keywords and also more noise. [Hack #19] discusses this parameter and the kw_catcher program that uses it.

When distributing a PDF online, some vector drawings outweigh their usefulness.

Vector drawings yield the highest possible quality across all media. For simple illustrations such as charts and graphs, they are also more efficient than bitmaps. However, when preparing a PDF for online distribution, you will sometimes find an intricate vector drawing that has tripled your PDF’s file size. With Acrobat and Illustrator (or Photoshop), you can rasterize this detailed drawing in-place and reduce your PDF’s file size.

First, make a backup copy of your PDF so that you can go back to where you started at any time.

Open your PDF in Acrobat and locate the drawing you want to rasterize. Activate the TouchUp Object tool (Tools Advanced Editing TouchUp Object, in Acrobat 6) and try to select the drawing. This usually requires patience and experimentation because one illustration might use dozens of separate drawing objects. And, it usually is tangled with other items on the page that you don’t want to rasterize.

First, try dragging out a selection rectangle that encloses the artwork. If other, unwanted items get caught in your dragnet, try dropping them from your selection by holding down the Shift key and clicking them. If you missed items that you wanted to select, you can add them the same way: Shift-click. The Shift key is a useful way to incrementally add or remove items from your current selection. You can even hold down the Shift key while dragging out a selection rectangle. Items in the rectangle will be toggled in or out of the current selection, depending on their previous state.

If you accidentally move an item, immediately press Ctrl-Z (Edit Undo) to restore it. If things ever get messed up, close the PDF without saving it and reopen it to start again.

Aggressive page cropping ensures maximum on-screen clarity.

When viewing a PDF in Reader or Acrobat, the page is often scaled to fit its width or its height into the viewer window. This means you can make page content appear larger on-screen by cropping away excess page margins, as shown in Figure 5-6. Cropping has no effect on the printed page’s scale, but it might alter the content’s position on the printed page.

Acrobat’s cropping tool can remove excess page margins. Use it in combination with our freely available BBOX Acrobat plug-in. These two tools make it easy to find the best cropping for a page and then apply this cropping to the entire document.

Run your assembled PDF through Acrobat Distiller to reduce its file size. In Acrobat 6, try PDF Optimizer.

You started with two or three PDFs, combined them, and then cropped them. Before going any further, consider running your assembled PDF through Distiller. This refrying can reduce duplicate resources and ensures that your PDF is optimized for online reading. It also gives you a chance to improve your PDF’s compatibility with older versions of Acrobat and Reader. In Acrobat 6, you can conveniently refry a PDF without Distiller by using the PDF Optimizer feature. Even so, distilling a PDF can yield better results than the PDF Optimizer can.

Refrying traditionally has been done with a simple hack, reprinting the PDF out to Distiller, which creates a new PDF file:

The best time to refry a PDF using Distiller (as opposed to the PDF Optimizer) is after you have assembled it, but before you have added any PDF features. Here is the sequence I typically use when preparing a PDF for online distribution:

  1. Assemble the original PDF pages and Save As . . . to a new PDF.

  2. If page sizes are wildly irregular, crop them [Hack #59] .

  3. Refry the original PDF document and compare the resulting refried PDF to the original. Adjust Distiller settings [Hack #42] as necessary and choose the best results.

  4. Crop and rotate the refried PDF pages as needed.

  5. If the original document had bookmarks or other PDF features, copy them back to the refried PDF [Hack #61] .

  6. Add PDF features [Hack #63] or finishing touches [Hack #62] .

  7. Save again using Save As . . . to compact the PDF. In Acrobat 6, save the final PDF by selecting File Reduce File Size . . . and set the compatibility to Acrobat 5.

Restore bookmarks, annotations, and forms after refrying your PDF.

You just refried your high-fidelity PDF [Hack #60] to create a lightweight, online edition and Distiller burned off the nifty PDF navigation features and forms. Let’s combine the old PDF’s navigation features and forms with the new PDF’s pages to get a lightweight, interactive PDF. Here’s how:

Using the Replace Pages feature like this to separate the visible page from its interactive features can be pretty handy. If you ever need to transfer a page and its interactive features, use the Extract Pages, Insert Pages, and Delete Pages functions.

Little things can make a big difference to your readers.

Most creators don’t use these basic PDF features, yet they improve the reading experience and they are easy to add. Read [Hack #67] to learn about the additional features required for serving individual PDF pages on demand.

Open your PDF in Acrobat and go to page 1. If your PDF doesn’t have logical page numbering, Acrobat thinks its first page is “page 1.” Yet your document’s “page 1” might actually fall on Acrobat’s page 6, as shown in Figure 5-10. Imagine your readers trying to make sense of this, especially when your document refers them to page 52 or when they decide to print pages 42-47.

Synchronize your document’s page numbering with Acrobat/Reader by adding logical page numbers to your PDF. In Acrobat 6, select the Pages navigation tab (View Navigation Tabs Pages), click Options, and select Number Pages . . . from the drop-down menu. In Acrobat 5, access the Page Numbering dialog box by selecting Document Number Pages . . . .

Start from the beginning of your document and work to the end, to minimize confusion. If numbering gets tangled up, reset the page numbers by selecting All Pages, Begin New Section, Style: 1, 2, 3, . . . , Prefix: (blank), Start: 1, and clicking OK.

If your document has a front cover, you can give it the logical page number Cover by setting its Style to None and giving it a Prefix of Cover.

Does your document have front matter with Roman-numeral page numbers? Advance through your document until you reach page 1. Go back to the page before page 1. Open the Page Numbering dialog. Set the page range From: field to 1. Set the Style to “i, ii, iii, . . . ,” and make sure Prefix is empty. Click OK. Now, the pages preceding page 1 should be numbered “i, ii, . . . ,” and page 1 should be numbered 1.

Go to the final page in your PDF and make sure the numbering still matches. Sometimes people remove blank pages from a PDF, which causes the document page numbers to skip. If you plan to remove blank pages from your PDF, apply logical page numbering beforehand.

Page Orientation and Cropping

Quickly page through your document from beginning to end, making sure that your page cropping [Hack #59] didn’t chop off any data. Also check for rotated pages. Adjust rotated pages to a natural reading orientation by selecting Document Pages Rotate (Acrobat 6) or Document Rotate Pages (Acrobat 5).

Rotating and cropping PDF pages can affect how they print. The user should select “Auto-rotate and center pages” from the Print dialog box when printing PDF to minimize surprises.

Bookmarks greatly improve document navigation. Adding them is pretty easy.

Ideally, your document’s headings would have been turned into PDF bookmarks when it was created [Hack #32] . If you ended up with no bookmarks or the wrong bookmarks, you can add or change them using Acrobat. Here are a few tricks to speed things up.

Add Bookmarks

Create a bookmark to the current view using the Ctrl-B shortcut (Command-B on the Macintosh). Then, type a label into the new bookmark and press Enter. Note that current view means the current page, current viewing mode (e.g., Actual Size, Fit Width, or Fit Page), or current zoom. For example, if you want a bookmark to fill the page with a specific table, zoom in to that table before creating the bookmark. When quickly creating bookmarks to a document’s headings, I simply use the Fit Page viewing mode.

Every bookmark needs a text label, and this label usually corresponds to a document heading. Instead of typing in the label, use the Text Select tool to select the heading text on the PDF page. When you create the bookmark (Ctrl-B or Command-B), the selected text appears in the label. Review this text for errors.

Add document information to your PDF, even without using Acrobat.

Traditional metadata includes things such as your document’s title, authors, and ISBN. But you can add anything you want, such as the document’s revision number, category, internal ID, or expiration date. PDF can store this information in two different ways: using the PDF’s Info dictionary [Hack #80] or using an embedded Extensible Metadata Platform (XMP) stream. When you change the PDF’s title, authors, subject, or keywords using Acrobat, as shown in Figure 5-13, it updates both of these resources. Acrobat 6 also enables you to export or import PDF XMP datafiles. Visit http://www.adobe.com/products/xmp/ to learn about Adobe’s XMP.

In Acrobat 6, view and update metadata by selecting File Document Properties . . . Description or Advanced Document Metadata . . . . In Acrobat 5, select File Document Properties Summary. Save your PDF after making changes to the metadata.

Our pdftk [Hack #79] currently reads and writes only the metadata in a PDF’s Info dictionary. However, it does not restrict you to just the title, authors, subject, and keywords. This solves the basic problem of embedding information into a PDF document; pdftk allows you to add custom metadata fields to PDF as needed. pdftk is free software.

Xpdf’s (http://www.foolabs.com/xpdf/) pdfinfo reports a PDF’s Info dictionary contents, its XMP stream, and other document data. pdfinfo is free software.

Ensure that readers see your essential links.

Styles used on the Web suggest that readers love navigation bars, such as the one shown in Figure 5-14.

Creating a PDF navigation bar in Acrobat and then duplicating it across several (or all) document pages is easy. Links can open external web pages or internal PDF pages. Add graphics and other styling elements to make it stand out. Disable printing to prevent it from cluttering printed pages. All of this is possible with PDF form buttons.

Create and manage buttons using the appropriate Acrobat tool. In Acrobat 6, activate the Button Tool (Tools Advanced Editing Forms Button Tool), then click and drag out a rectangle. Release the mouse and the Button Properties dialog opens. Select the Actions tab and click the Select Action drop-down box. Choose Go to a Page in this Document or Open a Web Link and click Add . . . . Another dialog opens where you can enter destination details. Click OK and the action should be added to the button’s Mouse Up event. Click Close and your button should be functional. Test it by selecting the Hand Tool and clicking the button.

In Acrobat 5, select the Form Tool, then click and drag out a rectangle. Release the mouse and a Field Properties dialog opens. Enter the button’s name (it can be anything, but it must be unique) and change the Type: to Button. Click the Actions tab, select Mouse Up, and click Add . . . . Curiously, you do not have the choice to go to a page within the document. Instead, select JavaScript and enter this simple code; JavaScript page numbers are zero-based, so this example goes to page 6, not page 5:

this.pageNum = 5;

Or, select World Wide Web Link if you want the button to open a web page. Click Set Action and then OK, and your button should be functional. Test it by selecting the Hand Tool and clicking the button.

By default, buttons are plain, gray rectangles. Change a button’s background and border using the Appearance tab in its properties dialog; you can even make a button transparent. Change a button’s text label using the Options tab.

The Options tab also enables you to select a graphical icon. Change the Layout: to include an Icon and then click Choose Icon. You can use a bitmap, a PostScript drawing (Acrobat 6), or even a PDF page as a button icon.

Consider creating a no-op graphical button as a background to your other (transparent) buttons. This can be easier than trying to split a single navigation bar graphic into several pieces. Just make sure the active buttons end up on top of the graphic layer; otherwise, they won’t work. Do this by creating the graphic layer before creating any of the active buttons, or by giving the graphic layer a lower position in the tabbing order.

To prevent a button from printing, open its properties and select the General tab (Acrobat 6) or the Appearance tab (Acrobat 5). Change Form Field: to Visible but Doesn’t Print.

Control how far your document can wander by making it difficult to copy.

A large document represents a great deal of work, and PDF is a good way to distribute large documents. Sometimes, it is too good. Perhaps your readers are paying customers, and you don’t want them to make copies for their friends. Perhaps you want people to read your work only from your web site, not from a downloaded copy. These kinds of controls go beyond standard PDF security [Hack #52] . This hack discusses some solutions.

Low Tech: Print Editions

Copying and sharing print editions of your document would be too much trouble for most readers. Your price for this security is the cost and trouble of production and shipping. However, readers might prefer a print edition, such as the one shown in Figure 5-15, in which case you are also adding value to your work. [Hack #29] discusses how to create print-on-demand (POD) books. Print editions are vulnerable to being converted to unsecured PDF by scanning and OCR.

Serve PDF pages on demand and spare readers a long download.

Sometimes readers want to download the entire document; sometimes they want to read just a few pages. If a reader desires to read a single page from your PDF, she shouldn’t be stuck downloading the entire document. A large document download will turn her away. The easiest solution is to configure your PDF and your web server for serving individual pages on request. An alternative is to use our PDF skins [Hack #71] .

Prepare the PDF

To permit page-at-a-time delivery over the Web, a PDF must be linearized . Linearization organizes a PDF’s internal structure so that a client can request the PDF resources it needs on a byte-by-byte basis. If the reader wants to see page 12, then the client requests only the data it needs to display page 12.

Test whether a PDF is linearized by opening it in Acrobat/Reader and viewing its document properties. Open File Document Properties . . . Description (Acrobat 6) or File Document Properties Summary (Acrobat 5). A linearized PDF shows Fast Web View: Yes.

The Xpdf project (http://www.foolabs.com/xpdf/) includes a command-line tool called pdfinfo that can tell you if a PDF is linearized. Pass your PDF to pdfinfo like so:

               pdfinfo 

pdfinfo will create a text report on-screen that says Optimized: Yes if your PDF is linearized. pdfinfo is free software.

To create a linearized PDF using Acrobat, first inspect your preferences. Select Edit Preferences General . . . and choose the General category (Acrobat 6) or the Options category (Acrobat 5). Place a checkmark next to Save As Optimizes for Fast Web View and click OK.

Open the PDF you want to linearize and then Save As... to the same filename. In Acrobat 6, you can change the PDF’s compatibility level at the same time by selecting File Reduce File Size instead of Save As.... Open the document properties to check that it worked.

If you ever make changes to the PDF in Acrobat and then simply File Save your PDF, it will no longer be linearized. You must use Save As... to ensure that your PDF remains linearized.

Ghostscript [Hack #39] includes a command-line tool called pdfopt that can linearize PDF. To create a linearized PDF using pdfopt, invoke it from the command-line like so:

               pdfopt 
               
                  input.pdf output.linearized.pdf
               

Prevent your online PDF from appearing inside the browser.

Some PDF documents on the Web are intended for online reading, but most are intended for download and then offline reading or printing. You can prevent confusion by ensuring your readers get the Save As . . . dialog when they click your Download Now PDF link. Here are a few ways to do this.

Zip It Up

The quickest solution for a single PDF is to compress it into a zip file, which gives you a file that simply cannot be read online. This has the added benefit of reducing the download file size a little. The downside is that your readers must have a program to unzip the file. You should include a hyperlink to where they can download such a program (e.g., http://www.info-zip.org/pub/infozip/). Stay away from self-extracting executables, because they work on only a single platform.

You can also apply zip compression on the fly with your web server. Here is an example in PHP. Adjust the passthru argument so that it points to your local copy of zip:

<?php
// pdfzip.php, zip PDF at serve-time
// version 1.0
// http://www.pdfhacks.com/serving_pdf/
//
// WARNING:
// This script might compromise your server's security.

$fn= $_GET['fn'];
// as a security measure, only serve files located in our directory
if( $fn && $fn=== basename($fn) ) {

  // make sure we're zipping up a PDF (and not some system file)
  if( strtolower( strrchr( $fn, '.' ) )== '.pdf' ) {

    if( file_exists( $fn ) ) {
      header('Content-Type: application/zip');
      header('Content-Disposition: attachment; filename='.$fn.'.zip');
      header('Accept-Ranges: none'); // we don't support byte serving
      passthru("/usr/bin/zip - $fn");
    }
  }
}
?>

If you have a PDF located at http://www.pdfhacks.com/docs/mydoc.pdf and you copied the preceding script to http://www.pdfhacks.com/docs/pdfzip.php, you could serve mydoc.pdf.zip with the URL http://www.pdfhacks.com/docs/pdfzip.php?fn=mydoc.pdf.

This next script enables you to serve PDF downloads. It is handy for when you want to make a single PDF available for both online reading and downloading. You can use its technique of using the Content-Type and Content-Disposition headers in any script that serves download-only PDF.

<?php
// pdfdownload.php
// version 1.0
// http://www.pdfhacks.com/serving_pdf/
//
// WARNING:
// This script might compromise your server's security.

$fn= $_GET['fn'];
// as a security measure, only serve files located in our directory
if( $fn && $fn=== basename($fn) ) {

  // make sure we're serving a PDF (and not some system file)
  if( strtolower( strrchr( $fn, '.' ) )== '.pdf' ) {

    if( ($num_bytes= @filesize( $fn )) ) {
      // use file pointers instead of readfile( )
      // for better performance, esp. with large PDFs
      if( ($fp= @fopen( $fn, 'rb' )) ) { // open binary read success

        // try to conceal our content type
        header('Content-Type: application/octet-stream');

        // cue the client that this shouldn't be displayed inline
        header('Content-Disposition: attachment; filename='.$fn);

        // we don't support byte serving
        header('Accept-Ranges: none');

        header('Content-Length: '.$num_bytes);
        fpassthru( $fp ); // this closes $fp
      }
    }
  }
}
?>

If you have a PDF located at http://www.pdfhacks.com/docs/mydoc.pdf and you copied the preceding script to http://www.pdfhacks.com/docs/pdfdownload.php, the URL http://www.pdfhacks.com/docs/pdfdownload.php?fn=mydoc.pdf would prompt users to download mydoc.pdf to their computers.

Take readers directly to the information they seek.

You can use HTML hyperlinks, those famous filaments of the Web, to integrate PDF documents with HTML documents. A simple link to a PDF document is not enough, though, because a single PDF might hold hundreds of pages. It is like handing a haystack to somebody searching for a needle. The solution is to modify the HTML link so that it takes the reader directly to the PDF page of interest. This kind of seamless integration of HTML and PDF pages requires some groundwork. See [Hack #67] for details.

To tailor a hyperlink’s PDF destination, just add one or more of the suffixes listed in Table 5-3 to the href path.

These are glued together and appended to the href path using a special notation. The first suffix follows a hash mark. Each additional suffix follows an ampersand. These options are fully documented in PDF Open Parameters, located at http://partners.adobe.com/asn/acrobat/sdk/public/docs/PDFOpenParams.pdf.

For example, to open mydoc.pdf to page 17 and display its document bookmarks, the hyperlink href would look like this:

http://pdfhacks.com/mydoc.pdf#page=17&pagemode=bookmarks

Save Display Settings in the PDF

You can also save these display settings in the PDF file. Whenever and however the PDF is opened, it will be displayed according to your settings. See [Hack #62] for details.

Create an HTML Table of Contents from PDF Bookmarks

Give web surfers an inviting HTML gateway into your PDF.

When browsing the Web, I usually groan at the sight of a PDF link. You have probably experienced it, too. My research has brought me to this point where I must now download a large PDF before I can proceed. The problem isn’t so much with the PDF file, but with my inability to gauge just how much this PDF might help me before I commit to downloading it.

The PDF author might have even gone to great lengths to ensure a good, online read, with nice, clear fonts, navigational bookmarks, and page-at-a-time byte serving for quick, random access. But I can’t tell that from looking at this PDF link. Chances are that I’ll click and wait, and wait. When it finally opens, I’ll probably need to flip, page by page, through illegible text looking for a clue that this tome will help me somehow. I might never find out, especially because I have a dozen other possible lines of inquiry I am pursuing at the same time.

Don’t let this happen to your online PDF. If your PDF has bookmarks, use this hack to create an HTML table of contents that hyperlinks every heading directly to its PDF page (see Figure 5-16.

Split a PDF into pages and frame them in HTML, where the fun begins.

In general, HTML files are called pages, while PDF files are called documents. By splitting a PDF document into PDF pages we shift it into HTML’s paradigm where we now can program the document like a web site. Let’s start with a basic document skin, shown in Figure 5-17, which gives us a cool look and handy document navigation.

Our Classic skin has a number of nice built-in features:

Test-drive our online version at http://www.pdfhacks.com/eno/. The HTML, JavaScript, and user interface icons are freely distributable under the GPL, so feel free to use them in your own templates.

Skinning PDF

First, install pdftk [Hack #79] . Next, visit http://www.pdfhacks.com/skins/ and download pdfskins-1.1.zip. Unzip, and move pdfskins.exe to a convenient location, such as C:\Windows\system32\. On other platforms, compile pdfskins from the included source code. Just cd pdfskins-1.1 and run make.

Download a skin template from http://www.pdfhacks.com/skins/. The template pdfskins_classic_js uses client-side JavaScript to create the dynamic pieces. pdfskins_classic_php uses server-side PHP instead. Pick one and unzip it into a new directory:

               unzip pdfskins_classic_js-1.1.zip

Copy your PDF document into this new directory and burst it into pages with pdftk. This also creates doc_data.txt, which reports on the document’s title, metadata, and bookmarks:

               pdftk   
                burst

Finally, in this same directory, spin skins using pdfskins. It reads doc_data.txt, created earlier, for the document title and other data. Pass the PDF filename as the first argument, if you plan to make the full PDF document available for download. This first argument is used only for constructing the Download Full Document hyperlink. It can be a full or relative URL. Omit this filename, and this hyperlink will not be displayed.

               pdfskins 

Fire up your web browser and point it at index.html, located in the directory where you’ve been working. The portal should appear, showing the table of contents and graphic placeholders for your logo (logo.gif) and document cover thumbnail (thumb.gif). If you used the php or comments templates, the pages must be served to you by a PHP-enabled web server.

Tip

The PDF pages that make up our skinned PDF do not need to be linearized; nor does the web server require byte serving configuration [Hack #67] . The only requirement is that the user has Adobe Reader configured to display PDF inside the browser, which is the default Reader configuration.

Hacking the Hack

Now, you control the document. You can take it in any direction you choose. See [Hack #21] for some ideas on how to add full-text document search. See [Hack #72] to learn how to add online page commenting.

Use our PDF skins to add commenting features to PDF pages.

Using Acrobat, you can add various comments and annotations to PDF pages. You can also share these comments via email or by configuring Acrobat’s Online Comments. These collaboration tools require all contributors to have Acrobat; they do not work with Reader. And, in general, all contributors must have the same version of Acrobat.

Instead, add online commenting features to PDF pages with our PDF skins [Hack #71] and a couple PHP scripts. Users don’t need Acrobat, so it works on Mac and Linux as well as Windows. And, you can integrate PDF comments with your site’s current commenting system. Our Comments skin, shown in Figure 5-18, will get you up and running. View our online example at http://www.pdfhacks.com/eno/skinned_comments/.

See [Hack #71] to learn how to skin a PDF. Instead of using the template pdfskins_classic_php-1.0.zip, download pdfskins_classic_comments-1.0.zip. This Comments skin is the same as the php skin except it adds showannot.php and saveannot.php.

Skin a PDF with our comments template and move the results into a directory on your web server. Your server must have permission to write in this directory so that it can create and maintain comments. Point your web browser to this URL and the commenting frame should be visible on the right. Enter a comment into the field and click Add Comment. Your comment should appear above.

Hacking the Hack

Our commenting script saves page comments in text files. To reduce the chance of a file access collision, it copies the current comments to a temporary file before appending a new comment. When it is done, it replaces the original comments file with the updated temporary file. Even so, if two users submit comments simultaneously, they still might collide. Consider adapting the script so that it stores comments in a database instead of a text file.

Tally Topic Popularity

Organize PDF page hits by document headings to get a sense of what readers like best.

A single long document can cover dozens of topics. Which topics do readers find most useful? Use our PDF skins [Hack #71] to track hits to individual pages. Then, use our hit_report script to map these page hits back into your document’s headings, as shown in Figure 5-19. You’ll see topic hits, not ambiguous page hits. Visit http://www.pdfhacks.com/eno/ for an example.

Page hit logging is built into the pdfskins_classic_php template [Hack #71] . After unpacking this template, activate hit logging by editing script.include and setting $log_hits=true. You can do this at any time, before or after skinning your PDF. Page hits get logged into text files located in the same directory as the skinned PDF, so the web server must have write access to that directory.

If a page is named pg_0025.pdf, its hit log is named pg_0025.pdf.hits. Each hit adds one line to the file. Each line includes the IP number that requested the page, so you can identify unique visitors if you desire.

After skinning your PDF and making sure hit logging works, visit http://www.pdfhacks.com/skins/ and download hit_report.php-1.0.zip. Unpack this single PHP file and copy it onto your web server.

If your skinned PDF is located in the directory:

http://pdfhacks.com/eno/skinned_php/

pass its location to hit_report like so:

http://pdfhacks.com/hit_report.php?pdf=/eno/skinned_php

The document outline should appear in your browser, just as it does on the skinned PDF’s title page. On the right side of the page, a column of numbers shows the number of hits on each outline topic.

For sections that span multiple pages, page hits are summed to create the section-hit count. However, section-hit counts do not include subsection-hit counts. If multiple sections have headings that appear on the same document page, those sections will also share the same hit count. hit_report identifies these by giving these sections the same background color. For example, the report in Figure 5-19 shows that the section headings “CHAPTER SIX: THE COMPOSITIONAL PROCESS” and “Equipment” are both on the same page.

Hacking the Hack

Provide this hit information to your readers by merging hit_report features with the current skin templates index.html and index.toc.html.