The Design and Construction of eBooks, by Steve Thomas

Techniques

When OCR goes bad

OCR does a remarkable job converting page images into text, providing your image is of optimal quality. Nothing will save you when the original pages were of dismal quality; you’re going to have lots of errors, and it may not be worth attempting to create a text version from them.

But more often, I find that most pages are good, and the OCR result is therefore acceptable, with just a few pages where the entire page is corrupted. This is usually because the original scan skewed the page beyond the limits of the OCR software. In such a case, you may be able to rescue the text by de-skewing the image for a page and re-doing the OCR for that page.

  1. Download the original page scan (e.g. from archive.org)
  2. Straighten the page in an image edit program
  3. Run these commands (Linux command line):
    convert page.jpg page.tif
    tesseract page.tif page
  4. Insert page.txt into ebook, replacing the corrupted text.

https://ebooks.adelaide.edu.au/about/appendix8.html

Last updated Tuesday, January 26, 2016 at 23:27