Understanding image tracking

Imagine the following conversation:

Person A: I can't find my print of The Starry Night. Do you know where it is?
Person B: What does it look like?

For a computer, or for someone who is naive about Western art, Person B's question is quite reasonable. Before we can use our sense of sight (or other senses) to track something, we need to have sensed that thing before. (Failing that, we at least need a good description of what we will sense.) For computer vision, we must provide a reference image that will be compared with the live camera image or scene. If the target has complex geometry or moving parts, we might need to provide many reference images to account for different perspectives and poses. However, for our examples using famous paintings, we will assume that the target is rectangular and rigid.

For this chapter's purposes, let's say that the goal of tracking is to determine how our rectangular target is posed in 3D. With this information, we can draw an outline around our target. In the final 2D image, the outline will be a quadrilateral (not necessarily a rectangle), since the target could be skewed away from the camera.

There are four major steps in this type of tracking:

Find features in the reference image and scene. A feature is a point that is likely to maintain a similar appearance when viewed from different distances or angles. For example, corners often have this characteristic.
Find descriptors for each set of features. A descriptor is a vector of data about a feature. Some features are not suitable for generating a descriptor, so an image has fewer descriptors than features.
Find matches between the two sets of descriptors. If we imagine the descriptors as points in a multidimensional space, a match is defined in terms of some measure of distance between points. Descriptors that are close enough to each other are considered a match.
Find the homography between a reference image and a matching image in the scene. A homography is a 3D transformation that would be necessary to line up the two projected 2D images (or come as close as possible to lining them up). It is calculated based on the two images' matching feature points. By applying the homography to a rectangle, we can get an outline of the tracked object.

There are many different techniques for performing each of the first three steps. OpenCV provides relevant classes called FeatureDetector, DescriptorExtractor, and DescriptorMatcher, each supporting several techniques. We will use a combination of techniques that OpenCV calls FeatureDetector.STAR, DescriptorExtractor.FREAK, and DescriptorMatcher.BRUTEFORCE_HAMMING. This combination is relatively fast and robust. Unlike some alternatives, it is scale-invariant and rotation-invariant, meaning that the target can be tracked from various distances and perspectives. Also, unlike some other alternatives, it is not patented so it is free for use even in commercial applications.

Tip

For a mathematical description of FREAK and its merits relative to other descriptor extractors, see the paper FREAK: Fast Retina Keypoint by Alahi, Ortiz, and Vandergheynst. An electronic version of the paper is available at http://infoscience.epfl.ch/record/175537/files/2069.pdf.