The best OCR software in the world can still produce lousy results if you don’t set it up just so and give it the best possible input material to work with. You’re looking for a combination of settings that gives you the best balance of OCR accuracy, processing speed, image quality, and file size. I help you figure out what those are in this chapter. I also show you several ways to automate scanning so that it takes as little manual effort as possible, and provide guidance about how to file your scanned documents so you can find and use them quickly in the future.
The fact that your scanner includes OCR software, or that you’ve purchased such software separately, doesn’t necessarily mean that the process of creating a searchable PDF from a scanned document will be straightforward. It might be, but more often than not, it’s necessary to think through a multi-stage process, which may involve configuring the settings in two or more pieces of software.
Every scanner comes with customized software that handles the low-level communication between the scanner and your computer. For example, if you have a Fujitsu ScanSnap, the scanner-specific software is called ScanSnap Manager; with a Canon imageFORMULA scanner you’d use Canon CaptureOnTouch; with an Epson scanner it would be Epson Scan; and so on. This software is responsible for taking the raw data your scanner produces and turning it into a bitmap image stored on your hard disk. As a result, this software always provides some means of setting preferences such as resolution, destination, and file format. The scanner’s software may include many other capabilities, too, but for the moment, assume that its only purpose is to spit out a bitmap image, as shown in the top row of Figure 1.
If you were scanning photos, then the bitmap image would be all that you’d need. But for scanned documents, an additional step is generally necessary (the bottom row in Figure 1)—another piece of software opens the bitmap image, performs OCR on it, and generates a searchable PDF file.
Since you want to avoid manual effort whenever you scan, you may feel some concern about the fact that two or more applications may be involved. Fortunately, the process has several potential shortcuts:
Now that you know where we’re going, your first step is to consult the documentation that came with your scanner and open whatever application is responsible for managing those low-level settings such as resolution and destination. In the next couple of topics, I give you some guidance as to how you should configure that software to end up with a bitmap image suitable for OCR, and to send that image (if possible) directly to the software that’ll handle the OCR process.
Because space doesn’t permit me to give detailed instructions for every application, I illustrate most of the settings that follow with examples from the most recent version of Fujitsu’s ScanSnap Manager software. If you’re using software from another manufacturer, the options and wording will vary, but you should find roughly comparable settings.
Of the many settings you may want to consider, three are of particular importance because they affect file size and OCR accuracy, among other things: Resolution, Color Mode, and Compression. I look at each of those in turn, and then discuss how they fit together.
The first decision to make is the resolution at which your scanner will save images of scanned documents. Almost every scanner mentioned in this book has an optical resolution of 600 dpi, so that’s the maximum—but they can be set to scan at lower resolutions, too. The choice of this single number has significant implications:
You may have to experiment with resolution settings (in combination with color mode and compression) to find what works best for you. My experiments suggest that a good starting point is 300 dpi for grayscale and color scans and 600 dpi for black-and-white scans.
By color mode, I’m referring to whether the scanned image is black-and-white, grayscale, or color.
Black-and-white bitmaps take up very little disk space, while grayscale images take up more space, and color images still more. However, I’ve found that compared to black-and-white scans, grayscale or color images tend to have significantly better OCR accuracy, as well as superior reprint fidelity—even at a lower resolution (for example, a 300-dpi grayscale image produces better OCR accuracy than a 600-dpi black-and-white image). And, with careful attention to compression (discussed next), the file sizes need not be unreasonably large. In fact, because of the way some OCR software alters scanned images, you may paradoxically end up with far larger files with a black-and-white scan than with a grayscale scan at the same resolution!
OCR accuracy and file size aside, if colors or gray shades are essential for the documents you’re scanning (for example, ones that include photographs or artwork), you’ll want to set the appropriate color mode. Your scanner may have an Auto setting that enables it to figure out the proper color mode as it scans, which may be even better (but do a few spot checks to confirm that it’s making wise choices).
The bitmap files produced by your scanner can be compressed using a variety of methods in order to reduce the file size. With black-and-white images, the compression is normally lossless (no information is discarded), whereas with grayscale and color scans, lossy compression is normally used—it shrinks the files much more, but decreases clarity at higher compression settings. Excessive compression can reduce OCR accuracy, not to mention making images less attractive, while using little or no compression results in unreasonably large files. So, in my experience, a medium compression setting is usually the best compromise.
However, let me qualify that in two ways.
First, although some OCR apps (such as PDFpen) leave the bitmap image alone and simply add the recognized text to the PDF, others try to recompress the image after recognizing the text. Usually this results in smaller files, but other times—particularly with black-and-white scans—this process uses a less-effective compression method that actually increases the file size, sometimes dramatically. (The stand-alone version of ABBYY FineReader Express, which is excellent in most other respects, happens to be a culprit in the latter category.) All that to say: no matter what you choose in terms of initial compression for your images, there’s no guarantee that your choice will be honored by software that processes the file later on. File sizes may get much better—or much worse.
Second, one OCR app—Readiris Pro—includes a proprietary compression algorithm called iHQC, which the developer claims can reduce storage space “up to 50 times,” with no loss of visual fidelity. Even if you don’t use Readiris Pro for OCR, you can use a separate tool called IRISCompressor to shrink PDFs after the fact, if small file size is crucial. (I have not tested iHQC, so I can’t comment on its effectiveness from personal experience.)
Juggling three different variables to get the best combination of file size, OCR accuracy, and other benefits is no mean feat. I’ve performed hundreds of experiments with many combinations of settings and software in an attempt to find the sweet spot for my own needs, but there’s no single answer that’s ideal for every situation.
In general, I’ve found that 300 dpi grayscale scans, with medium compression, yield the most favorable tradeoff between file size and OCR accuracy. They also look very good and don’t tax my computer’s CPU excessively. Of course, if you work with different sorts of documents, different hardware, or different software than I do, you may reach other conclusions.
In any case, all these settings apply to the initial scanned image—before any OCR takes place. That means you must configure it in the scanner’s software package.
To illustrate with one version of ScanSnap Manager, the Scanning tab (Figure 2, slightly ahead) offers the following choices:
The “Excellent” option results in much slower scans as well as huge files, and I’ve never found that extra-high resolution to be helpful.
Although I had Image Quality set to Automatic Resolution and Color Mode set to Auto Color Detection for years without any apparent problems, my latest experiments have led me to prefer an Image Quality setting of Best (for improved OCR accuracy) and a Color Mode of Gray (to avoid the possibility of getting black-and-white images, which not only have lower OCR accuracy but which can grow precipitously when fed through certain third-party applications).
The next choice you must make is the destination for the bitmap file. In general, you can choose either of the following:
~/Pictures
or a subfolder that you create there, such as ~/Pictures/Scans
—where all the raw bitmap files will be stored as your scanner generates them. In some cases, even if you choose to send scans directly to an application, your scanner saves a copy of the bitmap file somewhere for future reference, so you’ll still need to set a location on your disk. In ScanSnap Manager, you set the destination on the Save tab.
Along with selecting a destination folder, you usually have the option to specify a pattern for naming files as they’re scanned. For example, in ScanSnap Manager, click the File Name Format button on the Save tab (or, in earlier versions of the software, the Options button) to display the dialog shown in Figure 4, where you can choose a predefined format (based on the date and time) or create a pattern of your own. Once the document has gone through the OCR step, you’ll have an opportunity to name the searchable PDF, so I don’t place much importance on the names of the bitmap files.
Your scanner software may offer more than just these two options (send scans to an application or save them to a file). For example, ScanSnap Manager lets you choose a feature called Quick Menu by selecting the Use Quick Menu checkbox at the top of the window. When you do this, every time you scan a document, a special window (the eponymous Quick Menu) pops up to let you choose what to do with that particular scan. This approach is useful if you do a variety of activities with scanned documents (for example, you print some, you email some, and you send still others to iPhoto). But since I almost always want to save my own scans as searchable PDFs, I leave this feature turned off most of the time.
Your document will eventually end up as a PDF, but the bitmap image that your scanner produces initially could in theory be in any of several common formats, such as TIFF, JPEG, PNG, and (naturally) PDF. Most scanning software lets you choose the bitmap file format, although some offer more choices than others.
Here are my recommendations, in order from most to least preferable:
The list of other scanning options you may be able to set is long and highly variable. And, in general, whether or how to change any of these things is up to you—you might want to experiment and see what works best with your combination of scanner, software, and documents. (Many of these settings are also available in certain OCR applications; it’s up to you to decide where it makes the most sense to use them.)
Examples of other commonly seen settings:
The preceding instructions should do it for configuring the software to create the initial bitmap files themselves. But whether you’re using OCR capabilities built into your scanner’s software or a separate OCR application, you should next take a quick spin through the OCR-related preferences. They’re likely to be less involved (and may include some of the same options described just previously), but at minimum, be sure you configure the following:
~/Documents
) or on a network volume. (If you need help deciding where to store the documents, read Choose a Naming and Filing Strategy, next.)The settings for those last two items—destination and file name—may be obvious to you, or they may require more thought. And, if you’ll be sharing scanned files with others, some additional questions may arise. So before you finalize your settings, read the next section for advice on naming and filing your scans.
Naming your searchable PDFs and filing them (that is, storing them in some particular location) may be entirely separate activities, but it usually makes sense to do them together. And, your OCR software may expect you to make decisions up front about how these tasks will be handled. So it’s a good idea to think through your options carefully.
Fundamentally, you have four questions to answer:
As you perform OCR on your scanned documents, you have three basic choices as to what happens next:
I can think of good reasons for choosing any of these approaches, but the important thing is to weigh the pros and cons, decide how you want to handle the process, and stick with it.
The appeal of the entirely hands-off approach is obvious: it requires no effort other than pressing a button. So, if you’re concerned that you’ll never get around to scanning otherwise, that’s a significant positive.
On the negative side, if you let your software name scanned documents, the names (usually a string of numbers based on the date and time) won’t be meaningful to you, and when it comes time to find a file, you won’t be able to distinguish one from another by name; you’ll have to examine their contents, too.
And, if you let your software file the documents too, they’ll almost certainly end up in a single big folder somewhere, which again makes it harder to find what you’re looking for later on.
For me, since I’m scanning documents in order to make my life easier, the negatives of the hands-off approach outweigh the positives.
You can choose to name every document as soon as your OCR software turns it into a searchable PDF. At the same time, you can optionally file it in an appropriate folder and, if you’re using a document manager that supports tags (or the Finder, starting in 10.9 Mavericks), apply tags that will help you identify the document later. (Read the sidebar Tags vs. Folders.)
The big advantage to doing this is that you’ll make it much quicker to find the document later—and by doing this work right away, the subject matter of the document will be fresh in your mind, making naming easier.
The disadvantage is that it’s not merely more work, it turns scanning into a task that demands your ongoing attention, because you have to stop after every document goes through the scanner, think about it, and perform one or more extra steps.
As a compromise between the first two options, you can let your computer process everything automatically at first, but then later —say, once a week—go back and review recently scanned documents, name them, file them in the proper locations, tag them, and so on.
Although this approach has the benefits of both of the other alternatives, it has a downside, too: it’s more time-consuming to identify and name documents after the fact than right away. And, if you let it go too long, the task might become so overwhelming that you never do it.
Nevertheless, this is what I do most of the time. I’m disciplined enough to avoid letting my unnamed scans pile up for months, and sometimes I’m even inspired enough to name files as I go.
After OCR is complete (whether or not you’ve taken the time to choose a file name) and you have a searchable PDF, you can leave it in the folder where it started—the one where the bitmap images straight from the scanner live—or you can move it somewhere else. I’m a proponent of the “somewhere else” approach, but before you can decide where, exactly, to store your files, you need to know what technique you’ll use to find and view your PDFs later. In particular, you need to decide whether you will store the PDF as an ordinary file in the Finder—and if so, where? Or, will you store everything in a document manager?
The default way to retrieve documents is through the Finder, possibly with the help of Spotlight. Storing PDFs in regular Finder folders is easy—it happens automatically if you don’t take any other action, and Spotlight automatically indexes the documents. Because the PDFs are now ordinary, searchable files, you can organize them just like all your other documents—for example, if you have scanned documents relating to a specific project, they might go in that project’s folder in the Finder; or if your scans are of utility bills, they might go in a folder with other financial documents. Or, you may keep all your searchable PDFs, regardless of contents, together in one place.
Before you choose Spotlight and the Finder as your retrieval tools, spend a moment pondering these questions:
If Spotlight works well for you, then the Finder is a great destination. If it doesn’t, consider using a Spotlight enhancement such as HoudahSpot. Or, try searching with an alternative search engine such as Google Quick Search Box or FoxTrot. Or, you could use a document manager (covered next).
I mentioned several document managers in the discussion of OCR software (Pick a Mac OCR Package). Essentially, they’re applications that provide their own storage, categorization, display, and search methods for files and other snippets of data. You might prefer one of these over storing files only in the Finder for any number of reasons, such as a more pleasant user interface, more-flexible (or faster) searching, support for tags in pre-Mavericks versions of Mac OS X, or other database features.
Neat for Mac and Paperless, both of which I covered earlier in Pick a Mac OCR Package, are OCR tools with built-in document managers (or vice-versa) that are specially designed to work with structured data such as receipts. And, I’ve mentioned that I’m personally a fan of DEVONthink Pro Office. However, a few other options are also worth considering, as long as you have some independent way to perform OCR:
If you’re considering one of these applications, I suggest downloading a demo version and making sure you can find a way (such as using an AppleScript folder action) to store your searchable PDFs directly in the document manager of your choice.
Depending on your needs, you may want to look for a few features in particular:
I don’t want to belabor this point, nor can I provide any universal solutions, but it’s worth giving some thought to what you’ll name your searchable PDFs so you can more easily find them later—and so anyone else who needs to use the files can clue into their contents. (If you’re content with file names like 2014_04_26_11_27_00.pdf
then feel free to skip ahead to the next section!)
Suppose you’re scanning a stack of invoices. Naming them all “invoice.pdf” may be a bit better than nothing, but then, when you search for one of these and the result is a list of 100 files named “invoice.pdf,” you won’t know which is which without examining each one individually. On the other hand, although nothing prevents you from naming a file “Invoice #416, dated April 22, 2014, from ABC Supply Corp for $432.19.pdf,” that’s cumbersome to type and equally awkward to read. So, let me offer a few suggestions:
Banks/Citibank/statements/2013/August.pdf
. As long as you’re looking at a file in a hierarchical view in the Finder, that’s fine, but if you do a search that lists 20 files all named “August.pdf,” it’ll require extra work to figure out which one you want. It’s better to include extra detail in the file name, even if it seems redundant.When you’re naming files that other people will be using—for example, invoices you’re sending to another company—think about what sort of name would be most useful on the other end. For example, if I send TidBITS an invoice named “TidBITS invoice—2013-10.pdf,” that’s meaningful to me but not to them. They already know who they are, but they won’t be able to tell, just by looking at the file name, that it came from me. Something that includes both names, like “Kissell invoice—TidBITS—2013-10.pdf” meets both of our needs.
If you’re storing searchable PDFs as ordinary, Finder-accessible files rather than using a document manager, make sure you put those files in a location that makes sense for your needs. Here are your options:
~/Documents
folder) is logical. If you want to use Mac OS X’s built-in file sharing feature to make the documents available to someone else, put them in a folder whose entire contents you’re willing to share.Earlier in this chapter, an option that I described was routing incoming scans to an OCR program or feature, which (if you’re lucky) then creates searchable PDFs automatically. If that’s what happens on your Mac, congratulations—you can skip this section. But if your scanner’s software doesn’t support that configuration, or if you want to use OCR software that doesn’t automatically generate a searchable PDF when it opens a document, read on for help with automating the process.
Any scanner can save bitmap files into a folder somewhere on your disk, so that’s our starting point. Fundamentally, you need to make both of the following tasks happen automatically:
Luckily, you can often use the same tool to accomplish both tasks: an AppleScript folder action. A folder action is an AppleScript that runs automatically when something happens to a specified folder—for example, you open or close it, or add files to it. So, the basic idea is this: create a folder action script and attach it to the folder where incoming scans are stored so that it watches for new files being added; have the script open those files in your OCR program and then instruct your OCR program to go ahead and process the files.
What’s particularly cool about this method is that sometimes it can even automate OCR in applications that don’t inherently support AppleScript—or don’t support it robustly. This is possible due to a feature called GUI Scripting, which means that instead of AppleScript issuing a direct command to an application to perform some action, it instead simulates the user actions of choosing menu commands, clicking buttons, filling in fields, and suchlike. Unfortunately, this means the application must be in the foreground—if you were to switch to another window while this was going on, the AppleScript would no longer be able to “see” and operate the necessary controls. Still, it’s way better than going through the entire process manually every time.
I’d like to offer you prewritten AppleScripts for every OCR program and with every possible combination of settings and behaviors, but life is too short. So, instead, I’m providing four scripts that can drive a few popular OCR tools, and serve as examples on which you can base your own scripts for other applications. You can also, of course, modify any of my scripts to make it work differently according to your needs.
Download the scripts. After unzipping them (if that doesn’t happen automatically), move the files into either /Library/Scripts/Folder Action Scripts
(which requires authenticating with your username and password) or ~/Library/Scripts/Folder Action Scripts
(see Basics for help accessing it); in either case, if the folder doesn’t already exist, create it. (Be sure not to put the scripts in the similarly named Folder Actions
folder, which may appear in the same location.) Then proceed with the instructions that follow to configure the scripts on your Mac.
Before you can use folder action scripts, you must enable the system-wide Folder Actions capability if you haven’t previously done so, and attach a particular script to the folder where your incoming scans are stored. These steps should work in 10.6 Snow Leopard and later:
You’re almost ready to go, but it’s best to tweak a setting or two for optimal behavior in your OCR application of choice, and to understand exactly what to expect of the scripts.
As I lamented earlier, Acrobat XI Pro is immune to this sort of scripting; however, if you’re still running an earlier version of Acrobat Pro (7, 8, 9, or X) you can use a folder action—but only after you’ve performed a slightly odd one-time procedure to prepare it for OCR:
Now you’re ready to try out the script—either by scanning a document or by dragging an existing scanned image into the folder to which the AppleScript is attached. I provide two different Acrobat scripts. Both work with Acrobat Standard version 7 and Acrobat Pro versions 7, 8, 9, and X, but they have slightly different behaviors:
The OCR This (PDFpen & PDFpenPro) script works with either version of the PDFpen software (version 5 or later), without requiring any modification. However, for best results I suggest making one small change in PDFpen’s settings before using the script:
This last step may seem counterintuitive, but if you leave that box checked, then whenever the script runs automatically on a newly scanned document, PDFpen will display a dialog asking if you want to perform OCR on it. That isn’t actually a problem—the script still works—but there’ll be a delay of a few seconds, and that dialog (and the beep that sounds when it appears) may be confusing and distracting.
Once PDFpen is configured, scan a document (or drop an already-scanned document into your designated scans folder) to try the script.
The Readiris Pro script was created with version 14 of the app. I can’t guarantee how well the script or setup instructions will work with older or newer versions.
Before using the OCR This (Readiris) script, open Readiris Pro and set it up as follows:
When a new PDF file appears in the folder to which you’ve attached the script, Readiris opens the file, recognizes the text in it, and saves it as a PDF; it prompts you to enter a name and select a location. (Unfortunately, because of Readiris Pro’s poor AppleScript support, I was unable to find a good way to avoid this need for interaction.) After you save the file, Readiris creates a new document (which clears all the existing scanned pages from its list).
The scripts I’ve provided are all fairly simple, but depending on your needs, preferences, and willingness to tinker with AppleScript, you could enhance them to do other things.
Here are a few ideas: