Identifying text in an image is a very popular application for computer vision. This process is commonly called optical character recognition, and is divided as follows:
- Text preprocessing and segmentation: During this step, the computer must deal with image noise, and rotation (skewing), and identify what areas are candidate text.
- Text identification: This is the process of identifying each letter in text which will be covered in the later chapters.
The preprocessing and segmentation phase can vary greatly depending on the source of the text. Let's take a look at common situations where preprocessing is done:
- Production OCR applications with a scanner: This is a very reliable source of text. In this scenario, the background of the image is usually white and the document is almost aligned with the scanner margins. The content that's being scanned contains basically text, with almost no noise. This kind of application relies on simple preprocessing techniques that can adjust text quickly and maintain a fast scanning pace. When writing production OCR software, it is common to delegate the identification of important text regions to the user, and create a quality pipeline for text verification and indexing.
- Scanning text in a casually taken picture or in a video: This is a much more complex scenario, since there's no indication of where the text can be. This scenario is called scene text recognition, and OpenCV 4.0 contains a contrib library to deal with it. We will cover this in Chapter 11, Text Recognition with Tesseract. Usually, the preprocessor will use texture analysis techniques to identify the text patterns.
- Creating a production quality OCR for historical texts: Historical texts are also scanned, but they have several additional problems, such as noise that's created by the old paper color and the use of ink. Other common problems are decorated letters and specific text fonts, and low contrast content that's created by ink that is erased over time. It's not uncommon to write specific OCR software for the documents at hand.
- Scanning maps, diagrams, and charts: Maps, diagrams, and charts pose an especially difficult scenario since the text is usually in any orientation and in the middle of image content. For example, city names are often clustered, and ocean names often follow country shore contour lines. Some charts are heavily colored, with text appearing in both clear and dark tones.
OCR application strategies also vary according to the objective of the identification. Will it be used for a full text search? Or should the text be separated into logical fields to index a database with information for a structured search?
In this chapter, we will focus on preprocessing scanned text, or text that's been photographed by a camera. We'll consider that the text is the main purpose of the image, such as in a photographed piece of paper or card, for example, in this parking ticket:
We'll try to remove common noise, deal with text rotation (if any), and crop the possible text regions. While most OCR APIs already do these things automatically – and probably with state-of-the-art algorithms—it is still worth knowing how things happen under the hood. This will allow you to better understand most OCR APIs parameters and will give you better knowledge about the potential OCR problems you may face.