Choose OCR Software
One way or another, you need software to turn your raw scans into searchable PDFs. OCR software of some sort most likely came with your scanner, but if it didn’t—or if you’re not happy with its features or accuracy—you have oodles of other choices. This chapter provides an overview of major factors to consider when choosing Mac-compatible OCR software, along with a few specific suggestions for software to try (or avoid).
Determine Your Needs
I haven’t tried every scanner and every OCR application out there, but I’m going to go out on a limb and suggest that almost any combination of scanner and software can be made to yield acceptable results for most users. If you don’t want to agonize over the decision, the path of least resistance is to use whatever software came with your scanner.
However, you may be the sort of person who should look more deeply into the capabilities of OCR tools before jumping in if any of the following statements apply to you:
- You have (or plan to get) a Doxie. As of publication time, the original Doxie is the sole scanner, of those mentioned in this ebook, that doesn’t include any OCR software of its own. (Newer models, including the Doxie Go and Doxie One, do include OCR software.)
- You need to scan in multiple languages. All the OCR programs I discuss here support English text, and most support at least a few other languages too. If you have documents in more than one language (and especially if your documents mix more than one language on a page), you’ll want OCR software that supports those languages, as I discuss in the next topic.
- You want capabilities your existing OCR software lacks. Perhaps you’ve tried the software that came with your scanner and found it to be too slow, too cumbersome to use, or missing features you wish it had. If so, by all means look for a replacement!
- You need more-advanced PDF processing features. Some of the OCR programs here do nothing but spit out a searchable PDF file, whereas others let you manipulate PDFs to your heart’s content. If you want fine-grained control over your searchable PDFs, look for such a program.
If you have none of those needs, feel free to skip to the next chapter, Configure Your Software. Otherwise, continue reading to learn about features to consider when evaluating OCR software.
Consider Important OCR Features
Comparing OCR apps for Mac OS X is less of a science than an art—and a messy one at that. The information available on developers’ Web sites varies tremendously in scope and detail. Some have elaborate user manuals, while others have only a brief how-to guide. Many offer downloadable demo versions, but some don’t. Developers use different terms to describe the same features, and have wildly divergent ideas about what constitutes a nicely usable interface. A feature that one developer considers too obvious to mention may be a main selling point for another. And although most of these applications claim to have outstanding OCR accuracy, objective measurements are notoriously difficult to come by.
In short, it’s harder than one might expect to evaluate OCR software without trying it out (and even then, results may be ambiguous). However, a few factors are worth looking for:
- Accuracy: No OCR software is 100 percent accurate, but, it’s been a long time since I used OCR software that didn’t come close enough to meet my basic searching and archiving needs. (Remember, if all you need to do with your PDFs is search them, occasional OCR mistakes won’t affect your results much.) Nevertheless, because so many factors influence OCR accuracy—not the least of which is the quality of the raw scans that your scanner produces—it’s possible for two people to have dramatically different results with the same application and even the same document. So, my advice is to take developers’ claims of accuracy with a teaspoon of salt.
The best way to determine whether results are good enough for your needs is to try an OCR tool on freshly scanned documents from your scanner of choice. Then, select all the text in the PDF, paste it into an empty document in your favorite word processor, and run a spelling check (or skim for errors manually). If the terms you’re likely to search for are often incorrect, you might want to look for a different program.
- Languages: If not all the documents that you’ll scan are entirely in English, pay attention to the software’s multilingual support. The first task of OCR software is to recognize individual characters. If a document contains characters that don’t appear in English (such as Ω or ø), your OCR software must know about those other character sets in order to interpret them properly—otherwise, they’ll be represented by the nearest equivalents in the English alphabet. Beyond that, nearly all modern OCR software uses language-specific dictionaries and algorithms to improve its accuracy dramatically. If a certain group of pixels might be either
torn
or tcm
, a dictionary can compare those two strings and conclude that, for English, “torn” is the much more likely interpretation.
The same goes for other languages—even if they use the same alphabet as English, you’ll get much better results if the program knows about the rules of that particular language.
Even with support for multiple languages, though, OCR software may need help to narrow things down. If you specify up front which language a document is in (something usually done in the software’s preferences), you’ll get better results than if the software has to guess. Often, you can specify both a primary language and one or more secondary languages, to improve the odds that the software will use the right rules as needed. Even so, some OCR software that does perfectly well with pages that are entirely in a single language has trouble determining when languages change within a page. Once again, this is best judged by selecting some representative documents from your collection, scanning them using the settings I cover in the next chapter, and then checking to see what results the OCR software provides. Be sure to verify that they are, in fact, searchable, too. OCR software sometimes handles non-Latin text in ways that make searching (and even copying text) problematic.
- Automation support: Nearly all the OCR applications that I’ve tested support at least a bare minimum level of automation—that is, there’s some way to configure them, perhaps in conjunction with your scanner’s included driver, such that newly scanned images are converted automatically into searchable PDFs without any need for manual intervention. (Sometimes this requires a bit of fiddling to set up, but I discuss that in the next chapter, Configure Your Software.) However, in many cases, better still is the capability to automate PDF processing in more elaborate ways using AppleScript or Automator, a topic that I discuss in Automate OCR, later.
- Handwriting support: Recognizing text that was produced by a (good-quality) printer is relatively easy for a computer; recognizing handwritten text is considerably more challenging. A minority of OCR programs for the Mac claim to be able to recognize (neat, printed) handwriting to one degree or another, so if you need to scan lots of handwritten notes, be sure to look for that feature.
- Business card support: Any OCR program can recognize the raw text on a business card, but some have additional intelligence that enables them to infer (with mixed results) which string of text is a name, which is a title, which is a phone number, and so on—and then to put all those pieces into their appropriate fields in a database record that you can then export to, or sync with, Contacts or another contact manager. If you scan lots of business cards, this capability can save you lots of manual effort.
- Receipt processing: In much the same way as some OCR software can recognize the contents of a business card, other programs are designed to make sense of receipts—specifically looking for information such as date, merchant name, sales tax, and total, and storing that information in a database you can use for tracking expenses, preparing tax returns, and similar tasks.
- PDF editing: A few applications are designed mainly for creating, editing, annotating, optimizing, and otherwise transforming PDF files—with OCR merely being one of their many tricks. If you need such advanced features, then choosing one of these multifunctional applications may make your life easier.
- Layout retention: Although the focus of this book is on creating searchable PDFs, sometimes you may need to convert a scanned image into an editable document you can alter in, say, Word or Excel (see the sidebar Converting Scans to Microsoft Word Format). Several OCR programs can do just that—creating editable documents whose layouts closely resemble those of the originals, including graphics, tables, and even similar fonts. Although the end result won’t be as faithful to the look and feel of the scanned original as a searchable PDF, it will be much easier to work with if you need to do anything other than search, read, and copy the text.
- Document management: Once you have a searchable PDF, how do you search it? The answer may be Mac OS X’s system-wide Spotlight feature, in which case you can put the file in any convenient folder in the Finder—or use tags in Mavericks or later (see the sidebar just ahead). However, you may prefer an app that lets you catalog, cross-reference, and search files with far greater flexibility and precision. If so, look for an application that includes not only OCR but also document management capabilities—or opt for a stand-alone document manager such as Yojimbo.
With those thoughts in mind, let’s look at the range of OCR programs that you can choose from. (I offer my recommendations after the list of applications, in Joe’s OCR Software Recommendations.)
Pick a Mac OCR Package
The number and variety of Mac apps that can produce a searchable PDF are growing constantly. As I said earlier, if whatever software was bundled with your scanner yields results you find acceptable, there’s no need to look further. But if you’re looking for a better OCR package than what you have now, you should find no shortage of choices.
In the first edition of this book, I described 21 OCR apps. That list went out of date almost immediately. And frankly, I consider only a fraction of those apps to be noteworthy. So, just as I did with scanners, I’ve relocated the information on OCR software to the online appendixes, where you can peruse features and prices at your leisure, and where I can more easily keep them up to date.
Here, I want to call your attention to just a few of those choices that I find particularly interesting for one reason or another. If an OCR tool you’re wondering about isn’t listed here, check the online appendixes.
Notable OCR tools for Mac include:
- ABBYY FineReader Pro for Mac: Customized versions of this software are often bundled with scanners or embedded in other apps, but it’s available as a stand-alone product too. (FineReader Pro replaces an earlier version of the product, FineReader Express.) Although FineReader (in whichever form) is the most accurate OCR app I’ve tested, the stand-alone version offers little in the way of configurability, whereas some bundled versions (for example, the version included with DEVONthink Pro Office) give you more control over things like OCR accuracy and compression.
- Acrobat XI Pro: I mentioned this before, but it bears repeating: whatever other virtues Acrobat XI Pro may have, it’s a nonstarter for anyone who wants to automate OCR on newly scanned documents—and its accuracy isn’t great either. Those issues, plus the price, prevent me from recommending it anymore. However, I do want to mention one notable feature: an OCR option called ClearScan. With ClearScan selected, Acrobat creates a custom font for the recognized text that closely approximates the fonts, styles, and layout of the original (although not in a form that can be edited in other applications) and then replaces the bitmap image with one that has a much lower resolution. This can save a tremendous amount of disk space, and for many documents, the fidelity is quite good, but if you plan to print the document again, it may or may not be close enough to the original for your needs.
- DEVONthink Pro Office: Primarily a document manager—and a fantastic one, at that—DEVONthink Pro Office lets you categorize, tag, sort, link, and search documents of all kinds with ease. And, not only does it come with an integrated version of ABBYY FineReader (see the description earlier in this list), it also has special hooks that let it receive scans directly and seamlessly from any Fujitsu ScanSnap scanner, with no special configuration required.
- Evernote Premium: Evernote is the name of an application (available for Mac OS X, Windows, and iOS, among others) and an accompanying cloud-based service that lets you save, search, and share documents of many kinds—from notes to photos to PDF files—on nearly any device. The applications and the basic service are free, but a paid service called Evernote Premium ($5 per month or $45 per year) adds several options, including support for searching in PDFs created from scanned images or digital photos. The OCR conversion itself happens in the cloud—and you can search PDFs only within Evernote; you can’t save searchable PDFs to be used elsewhere. Currently, only English, Japanese, and Russian are supported.
Evernote has lots of fans, including many who swear by it for their paperless office use—they upload their scanned files, let Evernote Premium make them searchable, and then automatically have access to them on all their devices. That’s handy, but every time I’ve tried Evernote, I’ve been disappointed. In particular, Evernote’s document management features are largely limited to tagging—you can have at most two levels of hierarchy (that is, “stacks” at the top level containing notebooks, which in turn contain notes), but you can’t create an arbitrary hierarchy with folders inside folders—and searching is far more limited than in, say, DEVONthink. If you prefer tags to folders (see the sidebar Tags vs. Folders), don’t need complex searches, and don’t mind the fact that you can search your PDFs only in Evernote, it might work better for you than for me.
- Neat for Mac: This software, previously called NeatWorks, comes bundled with the developer’s scanners (both NeatDesk and NeatReceipts), but it is also available separately for those with other scanners. Its main claim to fame is that it can recognize the contents of business cards and receipts and store them in a searchable database, from which they can also be exported or synced in various ways. The software also provides document management for run-of-the-mill searchable PDFs. One downside, however: Neat for Mac currently supports only English.
- Paperless: Somewhat along the lines of Neat for Mac, Mariner Software’s Paperless is a document management application that performs OCR on scanned documents, with special treatment for receipts and other financial records, which can be decoded and entered into a database—and exported for use with Quicken or spreadsheet applications if you wish. Like DEVONthink Pro Office, it integrates directly with Fujitsu’s ScanSnap scanners. And, it includes an application called ScanHelper, which lets you easily route scanned documents to specific applications. Unfortunately, Paperless currently supports OCR only in English.
- PDFpen and PDFpenPro: When I need to make a quick edit to a PDF (such as superimposing a copy of my handwritten signature—see Sign Documents without Paper), I immediately turn to PDFpen, which makes this sort of activity simple. Among many other features, it offers OCR in 12 languages, solid AppleScript support, and the option to save files in Word format. The Pro version of PDFpen has extra features—it can convert Web pages into multi-page PDFs, create PDF forms, and add a table of contents for easier navigation—but its OCR capabilities are the same as the less-expensive PDFpen.
- Readiris Pro for Mac: Readiris is one of the more advanced OCR programs. It recognizes text in over 130 languages, and it supports mixed character sets in a single sentence. The documentation emphasizes that although Readiris recognizes handprinting, it doesn’t recognize handwriting. Readiris can also save documents in an editable form (such as Microsoft Word format) that preserves the original layout. And it has a proprietary compression system that claims to produce the smallest possible PDF files.
Joe’s OCR Software Recommendations
Of the applications listed in the online appendixes, I have experience with about half. All the Mac OCR tools I’ve tried have had adequate (if not always great) accuracy, but some are easier to use than others.
My preference is for a tool that works more or less invisibly behind the scenes. I like to configure things so that images from my scanner get the OCR treatment without interrupting my work or taking over my screen. A few OCR applications—Presto PageManager, OmniPage, and Readiris Pro—are what I think of as “old school,” in that their design assumes you’ll open the application, initiate a single-page scan from within it (typically, on a flatbed scanner), watch the OCR as it progresses, edit the final document, and then save it. There’s nothing wrong with any of that—and all three of those applications can be used in a much more automated, hands-off way—but I tend to gravitate toward apps with a more modern, minimalist approach.
I’ve been pleased with the results, interface, and flexibility of ABBYY FineReader—especially the versions of it included with Fujitsu’s ScanSnap scanners and DEVONthink Pro Office. It’s reasonably fast, recognizes text in multiple languages without any fuss (for several years I scanned about equal amounts of English and French text), and requires no interaction under normal circumstances.
If you need an uncommon feature, you should go for one of the tools that offers it. Otherwise, if you’re looking for strictly OCR, I’d lean toward ABBYY FineReader—either an embedded version or the stand-alone FineReader Express. For PDF editing, I’d choose PDFpen over the much-more-expensive and automation-unfriendly Acrobat XI Pro; if document management is your focus, I’d go with DEVONthink Pro Office; and if you particularly need to deal with receipts, either Neat for Mac or Paperless is a fine choice.