Cookie Monster: Hey, you know what? A round cookie with one bite out of it looks like a "C". A round donut with one bite out of it also looks like a "C" but it is not as good as a cookie. Oh, and the moon sometimes looks like a "C" but you can't eat that. | ||
-- "C is for Cookie", Sesame Street |
Think about cloud watching. If you lie on the ground and look up at the clouds, maybe you imagine that one cloud is shaped like a mound of mashed potatoes on a plate. If you board an airplane and fly to this cloud, you will still see some resemblance between the cloud's surface and the fluffy, lumpy texture of hearty mashed potatoes. However, if you could slice off a piece of cloud and examine it under a microscope, you might see ice crystals that do not resemble the microscopic structure of mashed potatoes at all.
Similarly, in an image made up of pixels, a person or a computer vision algorithm can see many distinctive shapes or patterns, partly depending on the level of magnification. During the creation of a Haar cascade, various parts of the image are cropped and/or scaled so that we consider only a few pixels at a time (though these pixels might represent any level of magnification). This sample of the image is called a window. We will subtract some of the grayscale pixel values from others in order to measure the window's similarity to certain common shapes where a dark region meets a light region. Examples include an edge, a corner, or a thin line, as seen in the following diagram. If a window is very similar to one of these archetypes, it can be selected as a feature. We expect to find similar features, at similar positions and magnifications relative to each other, across all images of the same subject.
Not all features are equally significant. Across a set of images, we can notice whether a feature is truly typical of images that include our subject (the positive training set) and atypical of images that exclude our subject (the negative training set). We will give the features a different rank or stage depending on how well they distinguish subjects from non-subjects. Together, a set of stages forms a cascade or a series of comparison criteria. Every stage must be passed in order to reach a positive detection result. Conversely, a negative detection result can be reached in fewer stages, perhaps only a single stage (an important optimization). Like the training images, scenes are examined through various windows and we might end up detecting multiple subjects in one scene.
For more information about Haar and LBP cascades in OpenCV, refer to the official documentation at http://docs.opencv.org/trunk/doc/py_tutorials/py_objdetect/py_face_detection/py_face_detection.html.
An LBPH model, as the name suggests, is based on a kind of histogram. The histogram has three dimensions: height and localized x and y coordinates. For each pixel in a window, we will note whether each neighboring pixel in a certain radius is brighter or darker. Our histogram's height at a given position is a count of the darker pixels in each neighbor position. For example, suppose a window contains the following two neighborhoods of 1 pixel radius:
Black |
White |
Black |
White |
White |
White |
Black |
White |
Black |
Black |
Black |
Black |
White |
White |
White |
White |
White |
White |
Counting these two neighborhoods (and not yet counting other neighborhoods in the window), our histogram has the following height values:
2 |
1 |
2 |
0 |
0 |
0 |
1 |
0 |
1 |
If we compute the LBPH of multiple sets of reference images for multiple subjects, we can determine which set of LBPH references is the least distant from the LBPH of a piece of a scene, such as a detected face. Based on the least distant set of references, we can predict the identity of the face (or other object) in the scene.
An LBPH model is good at capturing fine texture detail in any subject, not just faces. Moreover, it is good for applications where the model needs to be updated, such as the Interactive Recognizer. The histograms for any two images are computed independently, so a new reference image can be added without recomputing the rest of the model.
OpenCV also implements other models that are popular for face recognition, namely Eigenfaces and Fisherfaces. We wil use LBPH because it supports updates in real time, whereas Eigenfaces and Fisherfaces do not. For more information on these three recognition models, refer to the official documentation at http://docs.opencv.org/modules/contrib/doc/facerec/facerec_tutorial.html.
Alternatively, for detection rather than recognition, we can organize LBPH models into a cascade of multiple tests, much like a Haar cascade. Unlike an LBPH recognition model, an LBP cascade cannot be updated in real time.
Haar cascades, LBP cascades, and LBPH recognition models are not robust with respect to rotation or flipping. For example, if we look at a face upside down, it will not be detected by a Haar cascade that was trained only for upright faces. Similarly, if we had an LBPH recognition model trained for a cat whose face is black on the cat's left-paw side and orange on the cat's right-paw side, the model might not recognize the same cat in a mirror. The exception is that we could include mirror images in the training set, but then we might get a false positive recognition for a different cat whose face is orange on the cat's left-paw side and black on the cat's right-paw side.
Unless otherwise noted, we can assume that a Haar cascade or LBPH model is trained for an upright subject. That is, the subject is not tilted or upside down in the image's coordinate space. If a man is standing on his head, we can take an upright photo of his face by turning the camera upside down or, equivalently, by applying a 180-degree rotation in software.
Some other directional terms are worth noting. A frontal, rear, or profile subject has its front, rear, or profile visible in the image. Most computer vision people, including the authors of OpenCV, express left and right in the image's coordinate space. For example, if we say "left eye", for an upright, frontal, nonmirrored face, we mean the subject's right eye, since left and right in image space are opposite from an upright, frontal, nonmirrored subject's left-hand and right-hand directions. Refer to the following image. It shows how we would label the "left eye" and "right eye" in a nonmirrored image (shown on the left-hand side) and mirrored image (as seen on the right-hand side).
Our human and feline detectors deal with upright, frontal faces.
Of course, in a real-world photo, we cannot expect a face to be perfectly upright. The person's head or the camera might have been slightly tilted. Moreover, we cannot expect boundary regions, where a face meets the background, to be similar across images. We must take great care to preprocess the training images so that the face is rotated to a nearly perfect upright pose and the boundary regions are cropped off. While cropping, we should place the major features of the face, such as eyes, in a consistent position. These considerations are addressed further in the Planning the cat detection model section later in this chapter.
If we must detect faces in various rotations, one option is to rotate the scene before sending it to the detector. For example, we can try to detect faces in the original scene, then in a version of the scene that has been rotated 15 degrees, then a version rotated -15 degrees (345 degrees), then a version rotated 30 degrees, and so on. Similarly, we can send mirrored versions of the scene to the detector. Depending on how many variations of the scene are tested, such an approach might be too slow for real-time use and thus, we do not use it in this chapter.