Object recognition strategies

This section presents different computational techniques used in implementing the automated recognition of objects in digital images. Let's start by giving a definition of object recognition. In a nutshell, it is the task of finding and labeling parts of a 2D image of a scene that correspond to objects inside that scene. The following screenshot shows an example of object recognition performed manually by a human using a pencil:

Figure 13.1: An example of manual object detection

The image has been marked and labeled to show fruits recognizable as a banana and a pumpkin. This is exactly the same as what happens for calculated object recognition; it can be simply thought of as the process of drawing lines and outlining areas of an image, and finally attaching to each structure a label corresponding to the model that best represents it.

A combination of factors, such as the semantics of a scene context or information present in the image, must be used in object recognition. Context is particularly important when interpreting images. Let's first have a look at the following screenshot:

Figure 13.2: Object in isolation (no context)

It is nearly impossible to identify in isolation the object in the center of that image. Let's have a look now at the following screenshot, where the same object appears in the position as it had in the original image:

Figure 13.3: The object from figure 13.2 in its original context

Providing no further information, it is still difficult to identify that object, but not as difficult as for Figure 13.2. Given context information that the image in the preceding screenshot is a circuit board, the initial object is more easily recognized as a polarized capacitor. Cultural context plays a key role in enabling the proper interpretation of a scene.

Let's now consider a second example (shown in the following screenshot), a consistent 3D image of a stairwell:

Figure 13.4: A consistent 3D image showing a stairwell

By changing the light in that image, the final result could make it harder for the eye (and also a computer) to see a consistent 3D image (as shown in the following screenshot):

Figure 13.5: The result of applying a different light to the image in figure 13.4

Compared with the original image (Figure 13.3) its brightness and contrast have been modified (as shown in the following screenshot):

Figure 13.6: The image in figure 13.3 with changed brightness and contrast

The eye can still recognize three-dimensional steps. However, using different brightness and contrast values to the original image looks as shown in following screenshot:

Figure 13.7: The image in figure 13.3 with different brightness and contrast

It is almost impossible to recognize the same image. What we have learned is that although the retouched image in the previous screenshot retains a significant part of the important visual information in the original one (Figure 13.3), the images in Figure 13.4 and the preceding screenshot became less interpretable because of the 3D details that have been removed by retouching them. The examples presented provide evidence that computers (like human eyes) need appropriate context models in order to successfully complete object recognition and scene interpretation.

Computational strategies for object recognition can be classified based on their suitability for complex image data or for complex models. Data complexity in a digital image corresponds to its signal-to-noise ratio. An image with semantic ambiguity corresponds to complex (or noisy) data. Data consisting of perfect outlines of model instances throughout an image is called simple. Image data with poor resolution, noise, or other kinds of anomalies, or with easily confused false model instances, is referred to as complex. Model complexity is indicated by the level of detail in the data structures in an image, and in the techniques required to determine the form of the data. If a model is defined by a simple criterion (such as a single shape template or the optimization of a single function implicitly containing a shape model), then no other context may be needed to attach model labels to a given scene. But, in cases where many atomic model components must be assembled or some way hierarchically related to establish the existence of the desired model instance, complex data structures and non-trivial techniques are required.

Based on the previous definitions, object recognition strategies can then be classified into four main categories, as follows:

Feature vector classification: This relies on a trivial model of an object's image characteristics. Typically, it is applied only to simple data.
Fitting model to photometry: This is applied when simple models are sufficient but the photometric data of an image is noisy and ambiguous.
Fitting model to symbolic structures: Applied when complex models are required, but reliable symbolic structures can be accurately inferred from simple data. These approaches look for instances of objects by matching data structures that represent relationships between globally object parts.
Combined strategies: Applied when both data and desired model instances are complex.

The implementation of the available API to build and train CNNs for object recognition provided by the major open source frameworks detailed in this book have been done keeping these considerations and strategies in mind. While those APIs are very high-level, the same mindset should be taken when choosing the proper combination of hidden layers for a model.