The Design and Construction of eBooks, by Steve Thomas : appendix8

When OCR goes bad

OCR does a remarkable job converting page images into text, providing your image is of optimal quality. Nothing will save you when the original pages were of dismal quality; you’re going to have lots of errors, and it may not be worth attempting to create a text version from them.

But more often, I find that most pages are good, and the OCR result is therefore acceptable, with just a few pages where the entire page is corrupted. This is usually because the original scan skewed the page beyond the limits of the OCR software. In such a case, you may be able to rescue the text by de-skewing the image for a page and re-doing the OCR for that page.

Download the original page scan (e.g. from archive.org)
Straighten the page in an image edit program
Run these commands (Linux command line):
convert page.jpg page.tif
tesseract page.tif page
Insert page.txt into ebook, replacing the corrupted text.

The Design and Construction of eBooks, by Steve Thomas

Techniques

When OCR goes bad