The scene detection problem

Detecting text that randomly appears in a scene is a problem that's harder than it looks. There are several new variables that you need to take into account when you're comparing to identified scanned text, such as the following:

Tridimensionality: The text may be in any scale, orientation, or perspective. Also, the text may be partially occluded or interrupted. There are literally thousands of possible regions where it may appear in the image.
Variety: Text can be in several different fonts and colors. The font may have outline borders. The background can be dark, light, or a complex image.
Illumination and shadows: The sunlight's position and apparent color changes over time. Different weather conditions like fog or rain can generate noise. Illumination may be a problem even in closed spaces, since light reflects over colored objects and hits the text.
Blurring: Text may appear in a region that's not prioritized by lens auto-focus. Blurring is also common in moving cameras, in perspective text, or in the presence of fog.

The following picture, taken from Google Street View, illustrates these problems. Note how several of these situations occur simultaneously in just a single image:

Performing text detection to deal with such situations may prove computationally expensive, since there are 2ⁿ subsets of pixels, n being the number of pixels in the image.

To reduce complexity, two strategies are commonly applied:

Use a sliding window to search just a subset of image rectangles: This strategy just reduces the number of subsets to a smaller amount. The amount of regions varies according to the complexity of text being considered. Algorithms that deal just with text rotation may use small values, compared to the ones that also deal with rotation, skewing, perspective, and so on. The advantage of this approach is its simplicity, but they are usually limited to a narrow range of fonts and often to a lexicon of specific words.
Use of connected component analysis: This approach assumes that pixels can be grouped into regions, where pixels have similar properties. These regions are supposed to have higher chances to be identified as characters. The advantage of this approach is that it does not depend on several text properties (orientation, scale, fonts, and so on), and they also provide a segmentation region that can be used to crop text to the OCR. This was the approach that we used in Chapter 10, Developing Segmentation Algorithms for Text Recognition. Lighting could also affect the result, for example, if a shadow is cast over the letters, creating two distinct regions. However, since scene detection is commonly used in moving vehicles (for example, drones or cars) and with videos, the text will end up being detected eventually, since these lighting conditions will differ from frame to frame.

The OpenCV 4.0 algorithm uses the second strategy by performing connected component analysis and searching for extremal regions.