As already mentioned, you may often want to assess motion between two frames (or a sequence of frames) without any other prior knowledge about the content of those frames. Typically, the motion itself is what indicates that something interesting is going on. Optical flow is illustrated in Figure 10-3.
Figure 10-3. Optical flow: target features (upper left) are tracked over time and their movement is converted into velocity vectors (upper right); lower panels show a single image of the hallway (left) and flow vectors (right) as the camera moves down the hall (original images courtesy of Jean-Yves Bouguet)
We can associate some kind of velocity with each pixel in the frame or, equivalently, some displacement that represents the distance a pixel has moved between the previous frame and the current frame. Such a construction is usually referred to as a dense optical flow, which associates a velocity with every pixel in an image. The Horn-Schunck method [Horn81] attempts to compute just such a velocity field. One seemingly straightforward method—simply attempting to match windows around each pixel from one frame to the next—is also implemented in OpenCV; this is known as block matching. Both of these routines will be discussed in the "Dense Tracking Techniques" section.
In practice, calculating dense optical flow is not easy. Consider the motion of a white sheet of paper. Many of the white pixels in the previous frame will simply remain white in the next. Only the edges may change, and even then only those perpendicular to the direction of motion. The result is that dense methods must have some method of interpolating between points that are more easily tracked so as to solve for those points that are more ambiguous. These difficulties manifest themselves most clearly in the high computational costs of dense optical flow.
This leads us to the alternative option, sparse optical flow. Algorithms of this nature rely on some means of specifying beforehand the subset of points that are to be tracked. If these points have certain desirable properties, such as the "corners" discussed earlier, then the tracking will be relatively robust and reliable. We know that OpenCV can help us by providing routines for identifying the best features to track. For many practical applications, the computational cost of sparse tracking is so much less than dense tracking that the latter is relegated to only academic interest. [144]
The next few sections present some different methods of tracking. We begin by considering the most popular sparse tracking technique, Lucas-Kanade (LK) optical flow; this method also has an implementation that works with image pyramids, allowing us to track faster motions. We'll then move on to two dense techniques, the Horn-Schunck method and the block matching method.
The Lucas-Kanade (LK) algorithm [Lucas81], as originally proposed in 1981, was an attempt to produce dense results. Yet because the method is easily applied to a subset of the points in the input image, it has become an important sparse technique. The LK algorithm can be applied in a sparse context because it relies only on local information that is derived from some small window surrounding each of the points of interest. This is in contrast to the intrinsically global nature of the Horn and Schunck algorithm (more on this shortly). The disadvantage of using small local windows in Lucas-Kanade is that large motions can move points outside of the local window and thus become impossible for the algorithm to find. This problem led to development of the "pyramidal" LK algorithm, which tracks starting from highest level of an image pyramid (lowest detail) and working down to lower levels (finer detail). Tracking over image pyramids allows large motions to be caught by local windows.
Because this is an important and effective technique, we shall go into some mathematical detail; readers who prefer to forgo such details can skip to the function description and code. However, it is recommended that you at least scan the intervening text and figures, which describe the assumptions behind Lucas-Kanade optical flow, so that you'll have some intuition about what to do if tracking isn't working well.
The basic idea of the LK algorithm rests on three assumptions.
Brightness constancy. A pixel from the image of an object in the scene does not change in appearance as it (possibly) moves from frame to frame. For grayscale images (LK can also be done in color), this means we assume that the brightness of a pixel does not change as it is tracked from frame to frame.
Temporal persistence or "small movements". The image motion of a surface patch changes slowly in time. In practice, this means the temporal increments are fast enough relative to the scale of motion in the image that the object does not move much from frame to frame.
Spatial coherence. Neighboring points in a scene belong to the same surface, have similar motion, and project to nearby points on the image plane.
We now look at how these assumptions, which are illustrated in Figure 10-4, lead us to an effective tracking algorithm. The first requirement, brightness constancy, is just the requirement that pixels in one tracked patch look the same over time:
Figure 10-4. Assumptions behind Lucas-Kanade optical flow: for a patch being tracked on an object in a scene, the patch's brightness doesn't change (top); motion is slow relative to the frame rate (lower left); and neighboring points stay neighbors (lower right) (component images courtesy of Michael Black [Black92])
That's simple enough, and it means that our tracked pixel intensity exhibits no change over time:
The second assumption, temporal persistence, essentially means that motions are small from frame to frame. In other words, we can view this change as approximating a derivative of the intensity with respect to time (i.e., we assert that the change between one frame and the next in a sequence is differentially small). To understand the implications of this assumption, first consider the case of a single spatial dimension.
In this case we can start with our brightness consistency equation, substitute the definition of the brightness f(x, t) while taking into account the implicit dependence of x on t, I (x(t), t), and then apply the chain rule for partial differentiation. This yields:
where Ix is the spatial derivative across the first image, It is the derivative between images over time, and v is the velocity we are looking for. We thus arrive at the simple equation for optical flow velocity in the simple one-dimensional case:
Let's now try to develop some intuition for the one-dimensional tracking problem. Consider Figure 10-5, which shows an "edge"—consisting of a high value on the left and a low value on the right—that is moving to the right along the x-axis. Our goal is to identify the velocity v at which the edge is moving, as plotted in the upper part of Figure 10-5. In the lower part of the figure we can see that our measurement of this velocity is just "rise over run," where the rise is over time and the run is the slope (spatial derivative). The negative sign corrects for the slope of x.
Figure 10-5 reveals another aspect to our optical flow formulation: our assumptions are probably not quite true. That is, image brightness is not really stable; and our time steps (which are set by the camera) are often not as fast relative to the motion as we'd like. Thus, our solution for the velocity is not exact. However, if we are "close enough" then we can iterate to a solution. Iteration is shown in Figure 10-6, where we use our first (inaccurate) estimate of velocity as the starting point for our next iteration and then repeat. Note that we can keep the same spatial derivative in x as computed on the first frame because of the brightness constancy assumption—pixels moving in x do not change. This reuse of the spatial derivative already calculated yields significant computational savings. The time derivative must still be recomputed each iteration and each frame, but if we are close enough to start with then these iterations will converge to near exactitude within about five iterations. This is known as Newton's method. If our first estimate was not close enough, then Newton's method will actually diverge.
Figure 10-5. Lucas-Kanade optical flow in one dimension: we can estimate the velocity of the moving edge (upper panel) by measuring the ratio of the derivative of the intensity over time divided by the derivative of the intensity over space
Figure 10-6. Iterating to refine the optical flow solution (Newton's method): using the same two images and the same spatial derivative (slope) we solve again for the time derivative; convergence to a stable solution usually occurs within a few iterations
Now that we've seen the one-dimensional solution, let's generalize it to images in two dimensions. At first glance, this seems simple: just add in the y coordinate. Slightly changing notation, we'll call the y component of velocity v and the x component of velocity u; then we have:
Unfortunately, for this single equation there are two unknowns for any given pixel. This means that measurements at the single-pixel level are underconstrained and cannot be used to obtain a unique solution for the two-dimensional motion at that point. Instead, we can only solve for the motion component that is perpendicular or "normal" to the line described by our flow equation. Figure 10-7 presents the mathematical and geometric details.
Figure 10-7. Two-dimensional optical flow at a single pixel: optical flow at one pixel is underdetermined and so can yield at most motion, which is perpendicular ("normal") to the line described by the flow equation (figure courtesy of Michael Black)
Normal optical flow results from the aperture problem, which arises when you have a small aperture or window in which to measure motion. When motion is detected with a small aperture, you often see only an edge, not a corner. But an edge alone is insufficient to determine exactly how (i.e., in what direction) the entire object is moving; see Figure 10-8.
So then how do we get around this problem that, at one pixel, we cannot resolve the full motion? We turn to the last optical flow assumption for help. If a local patch of pixels moves coherently, then we can easily solve for the motion of the central pixel by using the surrounding pixels to set up a system of equations. For example, if we use a 5-by-5 [145] window of brightness values (you can simply triple this for color-based optical flow) around the current pixel to compute its motion, we can then set up 25 equations as follows.
Figure 10-8. Aperture problem: through the aperture window (upper row) we see an edge moving to the right but cannot detect the downward part of the motion (lower row)
We now have an overconstrained system for which we can solve provided it contains more than just an edge in that 5-by-5 window. To solve for this system, we set up a least-squares minimization of the equation, whereby
is solved in standard form as:
From this relation we obtain our u and v motion components. Writing this out in more detail yields:
The solution to this equation is then:
When can this be solved?—when
(ATA) is
invertible. And
(ATA) is
invertible when it has full rank (2), which occurs when it has two large eigenvectors.
This will happen in image regions that include texture running in at least two
directions. In this case,
(ATA) will have
the best properties then when the tracking window is centered over a corner region in an
image. This ties us back to our earlier discussion of the Harris corner detector. In fact, those corners were "good features to track"
(see our previous remarks concerning cvGoodFeaturesToTrack()
) for precisely the reason that
(ATA) had two
large eigenvectors there! We'll see shortly how all this computation is done for us by
the cvCalcOpticalFlowLK()
function.
The reader who understands the implications of our assuming small and coherent motions will now be bothered by the fact that, for most video cameras running at 30 Hz, large and noncoherent motions are commonplace. In fact, Lucas-Kanade optical flow by itself does not work very well for exactly this reason: we want a large window to catch large motions, but a large window too often breaks the coherent motion assumption! To circumvent this problem, we can track first over larger spatial scales using an image pyramid and then refine the initial motion velocity assumptions by working our way down the levels of the image pyramid until we arrive at the raw image pixels.
Hence, the recommended technique is first to solve for optical flow at the top layer
and then to use the resulting motion estimates as the starting point for the next layer
down. We continue going down the pyramid in this manner until we reach the lowest level.
Thus we minimize the violations of our motion assumptions and so can track faster and
longer motions. This more elaborate function is known as pyramid Lucas-Kanade
optical flow and is illustrated in Figure 10-9. The OpenCV function that
implements Pyramid Lucas-Kanade optical flow is cvCalcOpticalFlowPyrLK()
, which we examine next.
The routine that implements the nonpyramidal Lucas-Kanade dense optical flow algorithm is:
void cvCalcOpticalFlowLK( const CvArr* imgA, const CvArr* imgB, CvSize winSize, CvArr* velx, CvArr* vely );
The result arrays for this OpenCV routine are populated only by those pixels for which it is able to compute the minimum error. For the pixels for which this error (and thus the displacement) cannot be reliably computed, the associated velocity will be set to 0. In most cases, you will not want to use this routine. The following pyramid-based method is better for most situations most of the time.
We come now to OpenCV's algorithm that computes Lucas-Kanade optical flow in a
pyramid, cvCalcOpticalFlowPyrLK()
. As we will see,
this optical flow function makes use of "good features to track" and also returns
indications of how well the tracking of each point is proceeding.
Figure 10-9. Pyramid Lucas-Kanade optical flow: running optical flow at the top of the pyramid first mitigates the problems caused by violating our assumptions of small and coherent motion; the motion estimate from the preceding level is taken as the starting point for estimating motion at the next layer down
void cvCalcOpticalFlowPyrLK( const CvArr* imgA, const CvArr* imgB, CvArr* pyrA, CvArr* pyrB, CvPoint2D32f* featuresA, CvPoint2D32f* featuresB, int count, CvSize winSize, int level, char* status, float* track_error, CvTermCriteria criteria, int flags );
This function has a lot of inputs, so let's take a moment to figure out what they all do. Once we have a handle on this routine, we can move on to the problem of which points to track and how to compute them.
The first two arguments of cvCalcOpticalFlowPyrLK()
are the initial and final images; both should be
single-channel, 8-bit images. The next two arguments are buffers allocated to store the
pyramid images. The size of these buffers should be at least (img.width + 8)*img.height/3
bytes, [146] with one such buffer for each of the two input images (pyrA
and pyrB
). (If these
two pointers are set to NULL
then the routine will
allocate, use, and free the appropriate memory when called, but this is not so good for
performance.) The array featuresA
contains the points
for which the motion is to be found, and featuresB
is
a similar array into which the computed new locations of the points from featuresA
are to be placed; count
is the number of points in the featuresA
list. The window used for computing the local coherent motion is
given by winSize
. Because we are constructing an
image pyramid, the argument level
is used to set the
depth of the stack of images. If level
is set to 0
then the pyramids are not used. The array status
is
of length count
; on completion of the routine, each
entry in status
will be either 1 (if the
corresponding point was found in the second image) or 0 (if it was not). The track_error
parameter is optional and can be turned off by
setting it to NULL
. If track_error
is active then it is an array of numbers, one for each tracked
point, equal to the difference between the patch around a tracked point in the first
image and the patch around the location to which that point was tracked in the second
image. You can use track_error
to prune away points
whose local appearance patch changes too much as the points move.
The next thing we need is the termination criteria
. This is a structure used by many OpenCV algorithms that iterate
to a solution:
cvTermCriteria( int type, // CV_TERMCRIT_ITER, CV_TERMCRIT_EPS, or both int max_iter, double epsilon );
Typically we use the cvTermCriteria()
function to
generate the structure we need. The first argument of this function is either CV_TERMCRIT_ITER
or CV_TERMCRIT_EPS
, which tells the algorithm that we want to terminate either
after some number of iterations or when the convergence metric reaches some small value
(respectively). The next two arguments set the values at which one, the other, or both
of these criteria should terminate the algorithm. The reason we have both options is so
we can set the type to CV_TERMCRIT_ITER |
CV_TERMCRIT_EPS
and thus stop when either limit is reached (this is what is
done in most real code).
Finally, flags
allows for some fine control of
the routine's internal bookkeeping; it may be set to any or all (using bitwise OR) of
the following.
CV_LKFLOW_PYR_A_READY
The image pyramid for the first frame is calculated before the call and stored
in pyrA
.
CV_LKFLOW_PYR_B_READY
The image pyramid for the second frame is calculated before the call and
stored in pyrB
.
CV_LKFLOW_INITIAL_GUESSES
The array B already contains an initial guess for the feature's coordinates when the routine is called.
These flags are particularly useful when handling sequential video. The image pyramids are somewhat costly to compute, so recomputing them should be avoided whenever possible. The final frame for the frame pair you just computed will be the initial frame for the pair that you will compute next. If you allocated those buffers yourself (instead of asking the routine to do it for you), then the pyramids for each image will be sitting in those buffers when the routine returns. If you tell the routine that this information is already computed then it will not be recomputed. Similarly, if you computed the motion of points from the previous frame then you are in a good position to make good initial guesses for where they will be in the next frame.
So the basic plan is simple: you supply the images, list the points you want to
track in featuresA
, and call the routine. When the
routine returns, you check the status
array to see
which points were successfully tracked and then check featuresB
to find the new locations of those points.
This leads us back to that issue we put aside earlier: how to decide which features
are good ones to track. Earlier we encountered the OpenCV routine cvGoodFeatures ToTrack()
, which uses the method originally
proposed by Shi and Tomasi to solve this problem in a reliable way. In most cases, good
results are obtained by using the combination of cvGoodFeaturesToTrack()
and cvCalcOpticalFlowPyrLK()
. Of course, you can also use your own criteria to
determine which points to track.
Let's now look at a simple example (Example 10-1) that uses both cvGoodFeaturesToTrack()
and cvCalcOpticalFlowPyrLK()
; see also Figure 10-10.
Example 10-1. Pyramid Lucas-Kanade optical flow code
// Pyramid L-K optical flow example // #include <cv.h> #include <cxcore.h> #include <highgui.h> const int MAX_CORNERS = 500; int main(int argc, char** argv) { // Initialize, load two images from the file system, and // allocate the images and other structures we will need for // results. // IplImage* imgA = cvLoadImage("image0.jpg",CV_LOAD_IMAGE_GRAYSCALE); IplImage* imgB = cvLoadImage("image1.jpg",CV_LOAD_IMAGE_GRAYSCALE); CvSize img_sz = cvGetSize( imgA ); int win_size = 10; IplImage* imgC = cvLoadImage( "../Data/OpticalFlow1.jpg", CV_LOAD_IMAGE_UNCHANGED ); // The first thing we need to do is get the features // we want to track. // IplImage* eig_image = cvCreateImage( img_sz, IPL_DEPTH_32F, 1 ); IplImage* tmp_image = cvCreateImage( img_sz, IPL_DEPTH_32F, 1 ); int corner_count = MAX_CORNERS; CvPoint2D32f* cornersA = new CvPoint2D32f[ MAX_CORNERS ]; cvGoodFeaturesToTrack( imgA, eig_image, tmp_image, cornersA, &corner_count, 0.01, 5.0, 0, 3, 0, 0.04 ); cvFindCornerSubPix( imgA, cornersA, corner_count, cvSize(win_size,win_size), cvSize(-1,-1), cvTermCriteria(CV_TERMCRIT_ITER|CV_TERMCRIT_EPS,20,0.03) ); // Call the Lucas Kanade algorithm // char features_found[ MAX_CORNERS ]; float feature_errors[ MAX_CORNERS ]; CvSize pyr_sz = cvSize( imgA->width+8, imgB->height/3 ); IplImage* pyrA = cvCreateImage( pyr_sz, IPL_DEPTH_32F, 1 ); IplImage* pyrB = cvCreateImage( pyr_sz, IPL_DEPTH_32F, 1 ); CvPoint2D32f* cornersB = new CvPoint2D32f[ MAX_CORNERS ]; cvCalc OpticalFlowPyrLK( imgA, imgB, pyrA, pyrB, cornersA, cornersB, corner_count, cvSize( win_size,win_size ), 5, features_found, feature_errors, cvTermCriteria( CV_TERMCRIT_ITER | CV_TERMCRIT_EPS, 20, .3 ), 0 ); // Now make some image of what we are looking at: // for( int i=0; i<corner_count; i++ ) { if( features_found[i]==0|| feature_errors[i]>550 ) { printf("Error is %f/n",feature_errors[i]); continue; } printf("Got it/n"); CvPoint p0 = cvPoint( cvRound( cornersA[i].x ), cvRound( cornersA[i].y ) ); CvPoint p1 = cvPoint( cvRound( cornersB[i].x ), cvRound( cornersB[i].y ) ); cvLine( imgC, p0, p1, CV_RGB(255,0,0),2 ); } cvNamedWindow("ImageA",0); cvNamedWindow("ImageB",0); cvNamedWindow("LKpyr_OpticalFlow",0); cvShowImage("ImageA",imgA); cvShowImage("ImageB",imgB); cvShowImage("LKpyr_OpticalFlow",imgC); cvWaitKey(0); return 0; }
OpenCV contains two other optical flow techniques that are now seldom used. These routines are typically much slower than Lucas-Kanade; moreover, they (could, but) do not support matching within an image scale pyramid and so cannot track large motions. We will discuss them briefly in this section.
Figure 10-10. Sparse optical flow from pyramid Lucas-Kanade: the center image is one video frame after the left image; the right image illustrates the computed motion of the "good features to track" (lower right shows flow vectors against a dark background for increased visibility)
The method of Horn and Schunck was developed in 1981 [Horn81]. This technique was one of the first to make use of the brightness constancy assumption and to derive the basic brightness constancy equations. The solution of these equations devised by Horn and Schunck was by hypothesizing a smoothness constraint on the velocities vx and vy. This constraint was derived by minimizing the regularized Laplacian of the optical flow velocity components:
Here α is a constant weighting coefficient known as the regularization constant. Larger values of α lead to smoother (i.e., more locally consistent) vectors of motion flow. This is a relatively simple constraint for enforcing smoothness, and its effect is to penalize regions in which the flow is changing in magnitude. As with Lucas-Kanade, the Horn-Schunck technique relies on iterations to solve the differential equations. The function that computes this is:
void cvCalcOpticalFlowHS( const CvArr* imgA, const CvArr* imgB, int usePrevious, CvArr* velx, CvArr* vely, double lambda, CvTermCriteria criteria );
Here imgA
and imgB
must be 8-bit, single-channel images. The x and
y velocity results will be stored in velx
and vely
, which must be 32-bit,
floating-point, single-channel images. The usePrevious
parameter tells the algorithm to use the velx
and vely
velocities
computed from a previous frame as the initial starting point for computing the new
velocities. The parameter lambda
is a weight related
to the Lagrange multiplier. You are probably asking yourself: "What
Lagrange multiplier?" [147] The Lagrange multiplier arises when we attempt to minimize (simultaneously)
both the motion-brightness equation and the smoothness equations; it represents the
relative weight given to the errors in each as we minimize.
You might be thinking: "What's the big deal with optical flow? Just match where pixels in one frame went to in the next frame." This is exactly what others have done. The term "block matching" is a catchall for a whole class of similar algorithms in which the image is divided into small regions called blocks [Huang95; Beauchemin95]. Blocks are typically square and contain some number of pixels. These blocks may overlap and, in practice, often do. Block-matching algorithms attempt to divide both the previous and current images into such blocks and then compute the motion of these blocks. Algorithms of this kind play an important role in many video compression algorithms as well as in optical flow for computer vision.
Because block-matching algorithms operate on aggregates of pixels, not on individual pixels, the returned "velocity images" are typically of lower resolution than the input images. This is not always the case; it depends on the severity of the overlap between the blocks. The size of the result images is given by the following formula:
The implementation in OpenCV uses a spiral search that works out from the location of the original block (in the previous frame) and compares the candidate new blocks with the original. This comparison is a sum of absolute differences of the pixels (i.e., an L1 distance). If a good enough match is found, the search is terminated. Here's the function prototype:
void cvCalcOpticalFlowBM( const CvArr* prev, const CvArr* curr, CvSize block_size, CvSize shift_size, CvSize max_range, int use_previous, CvArr* velx, CvArr* vely );
The arguments are straightforward. The prev
and
curr
parameters are the previous and current
images; both should be 8-bit, single-channel images. The block_size
is the size of the block to be used, and shift_size
is the step size between blocks (this parameter
controls whether—and, if so, by how much—the blocks will overlap). The max_range
parameter is the size of the region around a given
block that will be searched for a corresponding block in the subsequent frame. If set,
use_previous
indicates that the values in velx
and vely
should be
taken as starting points for the block searches. [148] Finally, velx
and vely
are themselves 32-bit single-channel images that will
store the computed motions of the blocks. As mentioned previously, motion is computed at
a block-by-block level and so the coordinates of the result images are for the blocks
(i.e., aggregates of pixels), not for the individual pixels of the original
image.
[144] Black and Anadan have created dense optical flow techniques [Black93; Black96] that are often used in movie production, where, for the sake of visual quality, the movie studio is willing to spend the time necessary to obtain detailed flow information. These techniques are slated for inclusion in later versions of OpenCV (see Chapter 14).
[145] Of course, the window could be 3-by-3, 7-by-7, or anything you choose. If the window is too large then you will end up violating the coherent motion assumption and will not be able to track well. If the window is too small, you will encounter the aperture problem again.
[146] If you are wondering why the funny size, it's because these scratch spaces need to accommodate not just the image itself but the entire pyramid.
[147] You might even be asking yourself: "What is a Lagrange
multiplier?". In that case, it may be best to ignore this part of the paragraph and
just set lambda
equal to 1.
[148] If use_previous==0
, then the search for a
block will be conducted over a region of max_range
distance from the location of the original block. If use_previous!=0
, then the center of that search is first
displaced by and
.