Optical Flow

As already mentioned, you may often want to assess motion between two frames (or a sequence of frames) without any other prior knowledge about the content of those frames. Typically, the motion itself is what indicates that something interesting is going on. Optical flow is illustrated in Figure 10-3.

Optical flow: target features (upper left) are tracked over time and their movement is converted into velocity vectors (upper right); lower panels show a single image of the hallway (left) and flow vectors (right) as the camera moves down the hall (original images courtesy of Jean-Yves Bouguet)

Figure 10-3. Optical flow: target features (upper left) are tracked over time and their movement is converted into velocity vectors (upper right); lower panels show a single image of the hallway (left) and flow vectors (right) as the camera moves down the hall (original images courtesy of Jean-Yves Bouguet)

We can associate some kind of velocity with each pixel in the frame or, equivalently, some displacement that represents the distance a pixel has moved between the previous frame and the current frame. Such a construction is usually referred to as a dense optical flow, which associates a velocity with every pixel in an image. The Horn-Schunck method [Horn81] attempts to compute just such a velocity field. One seemingly straightforward method—simply attempting to match windows around each pixel from one frame to the next—is also implemented in OpenCV; this is known as block matching. Both of these routines will be discussed in the "Dense Tracking Techniques" section.

In practice, calculating dense optical flow is not easy. Consider the motion of a white sheet of paper. Many of the white pixels in the previous frame will simply remain white in the next. Only the edges may change, and even then only those perpendicular to the direction of motion. The result is that dense methods must have some method of interpolating between points that are more easily tracked so as to solve for those points that are more ambiguous. These difficulties manifest themselves most clearly in the high computational costs of dense optical flow.

This leads us to the alternative option, sparse optical flow. Algorithms of this nature rely on some means of specifying beforehand the subset of points that are to be tracked. If these points have certain desirable properties, such as the "corners" discussed earlier, then the tracking will be relatively robust and reliable. We know that OpenCV can help us by providing routines for identifying the best features to track. For many practical applications, the computational cost of sparse tracking is so much less than dense tracking that the latter is relegated to only academic interest. [144]

The next few sections present some different methods of tracking. We begin by considering the most popular sparse tracking technique, Lucas-Kanade (LK) optical flow; this method also has an implementation that works with image pyramids, allowing us to track faster motions. We'll then move on to two dense techniques, the Horn-Schunck method and the block matching method.

The Lucas-Kanade (LK) algorithm [Lucas81], as originally proposed in 1981, was an attempt to produce dense results. Yet because the method is easily applied to a subset of the points in the input image, it has become an important sparse technique. The LK algorithm can be applied in a sparse context because it relies only on local information that is derived from some small window surrounding each of the points of interest. This is in contrast to the intrinsically global nature of the Horn and Schunck algorithm (more on this shortly). The disadvantage of using small local windows in Lucas-Kanade is that large motions can move points outside of the local window and thus become impossible for the algorithm to find. This problem led to development of the "pyramidal" LK algorithm, which tracks starting from highest level of an image pyramid (lowest detail) and working down to lower levels (finer detail). Tracking over image pyramids allows large motions to be caught by local windows.

Because this is an important and effective technique, we shall go into some mathematical detail; readers who prefer to forgo such details can skip to the function description and code. However, it is recommended that you at least scan the intervening text and figures, which describe the assumptions behind Lucas-Kanade optical flow, so that you'll have some intuition about what to do if tracking isn't working well.

The basic idea of the LK algorithm rests on three assumptions.

We now look at how these assumptions, which are illustrated in Figure 10-4, lead us to an effective tracking algorithm. The first requirement, brightness constancy, is just the requirement that pixels in one tracked patch look the same over time:

image with no caption

That's simple enough, and it means that our tracked pixel intensity exhibits no change over time:

image with no caption

The second assumption, temporal persistence, essentially means that motions are small from frame to frame. In other words, we can view this change as approximating a derivative of the intensity with respect to time (i.e., we assert that the change between one frame and the next in a sequence is differentially small). To understand the implications of this assumption, first consider the case of a single spatial dimension.

In this case we can start with our brightness consistency equation, substitute the definition of the brightness f(x, t) while taking into account the implicit dependence of x on t, I (x(t), t), and then apply the chain rule for partial differentiation. This yields:

image with no caption

where Ix is the spatial derivative across the first image, It is the derivative between images over time, and v is the velocity we are looking for. We thus arrive at the simple equation for optical flow velocity in the simple one-dimensional case:

image with no caption

Let's now try to develop some intuition for the one-dimensional tracking problem. Consider Figure 10-5, which shows an "edge"—consisting of a high value on the left and a low value on the right—that is moving to the right along the x-axis. Our goal is to identify the velocity v at which the edge is moving, as plotted in the upper part of Figure 10-5. In the lower part of the figure we can see that our measurement of this velocity is just "rise over run," where the rise is over time and the run is the slope (spatial derivative). The negative sign corrects for the slope of x.

Figure 10-5 reveals another aspect to our optical flow formulation: our assumptions are probably not quite true. That is, image brightness is not really stable; and our time steps (which are set by the camera) are often not as fast relative to the motion as we'd like. Thus, our solution for the velocity is not exact. However, if we are "close enough" then we can iterate to a solution. Iteration is shown in Figure 10-6, where we use our first (inaccurate) estimate of velocity as the starting point for our next iteration and then repeat. Note that we can keep the same spatial derivative in x as computed on the first frame because of the brightness constancy assumption—pixels moving in x do not change. This reuse of the spatial derivative already calculated yields significant computational savings. The time derivative must still be recomputed each iteration and each frame, but if we are close enough to start with then these iterations will converge to near exactitude within about five iterations. This is known as Newton's method. If our first estimate was not close enough, then Newton's method will actually diverge.

Now that we've seen the one-dimensional solution, let's generalize it to images in two dimensions. At first glance, this seems simple: just add in the y coordinate. Slightly changing notation, we'll call the y component of velocity v and the x component of velocity u; then we have:

image with no caption

Unfortunately, for this single equation there are two unknowns for any given pixel. This means that measurements at the single-pixel level are underconstrained and cannot be used to obtain a unique solution for the two-dimensional motion at that point. Instead, we can only solve for the motion component that is perpendicular or "normal" to the line described by our flow equation. Figure 10-7 presents the mathematical and geometric details.

Normal optical flow results from the aperture problem, which arises when you have a small aperture or window in which to measure motion. When motion is detected with a small aperture, you often see only an edge, not a corner. But an edge alone is insufficient to determine exactly how (i.e., in what direction) the entire object is moving; see Figure 10-8.

So then how do we get around this problem that, at one pixel, we cannot resolve the full motion? We turn to the last optical flow assumption for help. If a local patch of pixels moves coherently, then we can easily solve for the motion of the central pixel by using the surrounding pixels to set up a system of equations. For example, if we use a 5-by-5 [145] window of brightness values (you can simply triple this for color-based optical flow) around the current pixel to compute its motion, we can then set up 25 equations as follows.

image with no caption

We now have an overconstrained system for which we can solve provided it contains more than just an edge in that 5-by-5 window. To solve for this system, we set up a least-squares minimization of the equation, whereby

image with no caption

is solved in standard form as:

image with no caption

From this relation we obtain our u and v motion components. Writing this out in more detail yields:

image with no caption

The solution to this equation is then:

image with no caption

When can this be solved?—when (ATA) is invertible. And (ATA) is invertible when it has full rank (2), which occurs when it has two large eigenvectors. This will happen in image regions that include texture running in at least two directions. In this case, (ATA) will have the best properties then when the tracking window is centered over a corner region in an image. This ties us back to our earlier discussion of the Harris corner detector. In fact, those corners were "good features to track" (see our previous remarks concerning cvGoodFeaturesToTrack()) for precisely the reason that (ATA) had two large eigenvectors there! We'll see shortly how all this computation is done for us by the cvCalcOpticalFlowLK() function.

The reader who understands the implications of our assuming small and coherent motions will now be bothered by the fact that, for most video cameras running at 30 Hz, large and noncoherent motions are commonplace. In fact, Lucas-Kanade optical flow by itself does not work very well for exactly this reason: we want a large window to catch large motions, but a large window too often breaks the coherent motion assumption! To circumvent this problem, we can track first over larger spatial scales using an image pyramid and then refine the initial motion velocity assumptions by working our way down the levels of the image pyramid until we arrive at the raw image pixels.

Hence, the recommended technique is first to solve for optical flow at the top layer and then to use the resulting motion estimates as the starting point for the next layer down. We continue going down the pyramid in this manner until we reach the lowest level. Thus we minimize the violations of our motion assumptions and so can track faster and longer motions. This more elaborate function is known as pyramid Lucas-Kanade optical flow and is illustrated in Figure 10-9. The OpenCV function that implements Pyramid Lucas-Kanade optical flow is cvCalcOpticalFlowPyrLK(), which we examine next.

We come now to OpenCV's algorithm that computes Lucas-Kanade optical flow in a pyramid, cvCalcOpticalFlowPyrLK(). As we will see, this optical flow function makes use of "good features to track" and also returns indications of how well the tracking of each point is proceeding.

void cvCalcOpticalFlowPyrLK(
  const CvArr*    imgA,
  const CvArr*    imgB,
  CvArr*          pyrA,
  CvArr*          pyrB,
  CvPoint2D32f*   featuresA,
  CvPoint2D32f*   featuresB,
  int             count,
  CvSize          winSize,
  int             level,
  char*           status,
  float*          track_error,
  CvTermCriteria  criteria,
  int             flags
);

This function has a lot of inputs, so let's take a moment to figure out what they all do. Once we have a handle on this routine, we can move on to the problem of which points to track and how to compute them.

The first two arguments of cvCalcOpticalFlowPyrLK() are the initial and final images; both should be single-channel, 8-bit images. The next two arguments are buffers allocated to store the pyramid images. The size of these buffers should be at least (img.width + 8)*img.height/3 bytes, [146] with one such buffer for each of the two input images (pyrA and pyrB). (If these two pointers are set to NULL then the routine will allocate, use, and free the appropriate memory when called, but this is not so good for performance.) The array featuresA contains the points for which the motion is to be found, and featuresB is a similar array into which the computed new locations of the points from featuresA are to be placed; count is the number of points in the featuresA list. The window used for computing the local coherent motion is given by winSize. Because we are constructing an image pyramid, the argument level is used to set the depth of the stack of images. If level is set to 0 then the pyramids are not used. The array status is of length count; on completion of the routine, each entry in status will be either 1 (if the corresponding point was found in the second image) or 0 (if it was not). The track_error parameter is optional and can be turned off by setting it to NULL. If track_error is active then it is an array of numbers, one for each tracked point, equal to the difference between the patch around a tracked point in the first image and the patch around the location to which that point was tracked in the second image. You can use track_error to prune away points whose local appearance patch changes too much as the points move.

The next thing we need is the termination criteria. This is a structure used by many OpenCV algorithms that iterate to a solution:

cvTermCriteria(
    int    type,      // CV_TERMCRIT_ITER, CV_TERMCRIT_EPS, or both
    int    max_iter,
    double epsilon
);

Typically we use the cvTermCriteria() function to generate the structure we need. The first argument of this function is either CV_TERMCRIT_ITER or CV_TERMCRIT_EPS, which tells the algorithm that we want to terminate either after some number of iterations or when the convergence metric reaches some small value (respectively). The next two arguments set the values at which one, the other, or both of these criteria should terminate the algorithm. The reason we have both options is so we can set the type to CV_TERMCRIT_ITER | CV_TERMCRIT_EPS and thus stop when either limit is reached (this is what is done in most real code).

Finally, flags allows for some fine control of the routine's internal bookkeeping; it may be set to any or all (using bitwise OR) of the following.

These flags are particularly useful when handling sequential video. The image pyramids are somewhat costly to compute, so recomputing them should be avoided whenever possible. The final frame for the frame pair you just computed will be the initial frame for the pair that you will compute next. If you allocated those buffers yourself (instead of asking the routine to do it for you), then the pyramids for each image will be sitting in those buffers when the routine returns. If you tell the routine that this information is already computed then it will not be recomputed. Similarly, if you computed the motion of points from the previous frame then you are in a good position to make good initial guesses for where they will be in the next frame.

So the basic plan is simple: you supply the images, list the points you want to track in featuresA, and call the routine. When the routine returns, you check the status array to see which points were successfully tracked and then check featuresB to find the new locations of those points.

This leads us back to that issue we put aside earlier: how to decide which features are good ones to track. Earlier we encountered the OpenCV routine cvGoodFeatures ToTrack(), which uses the method originally proposed by Shi and Tomasi to solve this problem in a reliable way. In most cases, good results are obtained by using the combination of cvGoodFeaturesToTrack() and cvCalcOpticalFlowPyrLK(). Of course, you can also use your own criteria to determine which points to track.

Let's now look at a simple example (Example 10-1) that uses both cvGoodFeaturesToTrack() and cvCalcOpticalFlowPyrLK(); see also Figure 10-10.

Example 10-1. Pyramid Lucas-Kanade optical flow code

// Pyramid L-K optical flow example
//
#include <cv.h>
#include <cxcore.h>
#include <highgui.h>

const int MAX_CORNERS = 500;

int main(int argc, char** argv) {

  // Initialize, load two images from the file system, and
  // allocate the images and other structures we will need for
  // results.
  //
  IplImage* imgA = cvLoadImage("image0.jpg",CV_LOAD_IMAGE_GRAYSCALE);
  IplImage* imgB = cvLoadImage("image1.jpg",CV_LOAD_IMAGE_GRAYSCALE);

  CvSize    img_sz = cvGetSize( imgA );
  int       win_size = 10;

  IplImage* imgC = cvLoadImage(
    "../Data/OpticalFlow1.jpg",
    CV_LOAD_IMAGE_UNCHANGED
  );

  // The first thing we need to do is get the features
  // we want to track.
  //
  IplImage* eig_image = cvCreateImage( img_sz, IPL_DEPTH_32F, 1 );
  IplImage* tmp_image = cvCreateImage( img_sz, IPL_DEPTH_32F, 1 );
  
  int           corner_count = MAX_CORNERS;
  CvPoint2D32f* cornersA     = new CvPoint2D32f[ MAX_CORNERS ];
  
  cvGoodFeaturesToTrack(
    imgA,
    eig_image,
    tmp_image,
    cornersA,
    &corner_count,
    0.01,
    5.0,
    0,
    3,
    0,
    0.04
  );
  
  cvFindCornerSubPix(
    imgA,
    cornersA,
    corner_count,
    cvSize(win_size,win_size),
    cvSize(-1,-1),
    cvTermCriteria(CV_TERMCRIT_ITER|CV_TERMCRIT_EPS,20,0.03)
  );
  
  // Call the Lucas Kanade algorithm
  //
  char features_found[ MAX_CORNERS ];
  float feature_errors[ MAX_CORNERS ];
  
  CvSize pyr_sz = cvSize( imgA->width+8, imgB->height/3 );
  
  IplImage* pyrA = cvCreateImage( pyr_sz, IPL_DEPTH_32F, 1 );
  IplImage* pyrB = cvCreateImage( pyr_sz, IPL_DEPTH_32F, 1 );
  
  CvPoint2D32f* cornersB = new CvPoint2D32f[ MAX_CORNERS ];
  
  cvCalc OpticalFlowPyrLK(
    imgA,
    imgB,
    pyrA,
    pyrB,
    cornersA,
    cornersB,
    corner_count,
    cvSize( win_size,win_size ),
    5,
    features_found,
    feature_errors,
    cvTermCriteria( CV_TERMCRIT_ITER | CV_TERMCRIT_EPS, 20, .3 ),
    0
  );
  
  // Now make some image of what we are looking at:
  //
  for( int i=0; i<corner_count; i++ ) {
    if( features_found[i]==0|| feature_errors[i]>550 ) {
      printf("Error is %f/n",feature_errors[i]);
      continue;
    }
    printf("Got it/n");
    CvPoint p0 = cvPoint(
      cvRound( cornersA[i].x ),
      cvRound( cornersA[i].y )
    );
    CvPoint p1 = cvPoint(
      cvRound( cornersB[i].x ),
      cvRound( cornersB[i].y )
    );
    cvLine( imgC, p0, p1, CV_RGB(255,0,0),2 );
  }
  
  cvNamedWindow("ImageA",0);
  cvNamedWindow("ImageB",0);
  cvNamedWindow("LKpyr_OpticalFlow",0);
  
  cvShowImage("ImageA",imgA);
  cvShowImage("ImageB",imgB);
  cvShowImage("LKpyr_OpticalFlow",imgC);
  
  cvWaitKey(0);
  
  return 0;
}

OpenCV contains two other optical flow techniques that are now seldom used. These routines are typically much slower than Lucas-Kanade; moreover, they (could, but) do not support matching within an image scale pyramid and so cannot track large motions. We will discuss them briefly in this section.

The method of Horn and Schunck was developed in 1981 [Horn81]. This technique was one of the first to make use of the brightness constancy assumption and to derive the basic brightness constancy equations. The solution of these equations devised by Horn and Schunck was by hypothesizing a smoothness constraint on the velocities vx and vy. This constraint was derived by minimizing the regularized Laplacian of the optical flow velocity components:

image with no caption
image with no caption

Here α is a constant weighting coefficient known as the regularization constant. Larger values of α lead to smoother (i.e., more locally consistent) vectors of motion flow. This is a relatively simple constraint for enforcing smoothness, and its effect is to penalize regions in which the flow is changing in magnitude. As with Lucas-Kanade, the Horn-Schunck technique relies on iterations to solve the differential equations. The function that computes this is:

void cvCalcOpticalFlowHS(
    const CvArr*      imgA,
    const CvArr*      imgB,
    int               usePrevious,
    CvArr*            velx,
    CvArr*            vely,
    double            lambda,
    CvTermCriteria    criteria
);

Here imgA and imgB must be 8-bit, single-channel images. The x and y velocity results will be stored in velx and vely, which must be 32-bit, floating-point, single-channel images. The usePrevious parameter tells the algorithm to use the velx and vely velocities computed from a previous frame as the initial starting point for computing the new velocities. The parameter lambda is a weight related to the Lagrange multiplier. You are probably asking yourself: "What Lagrange multiplier?" [147] The Lagrange multiplier arises when we attempt to minimize (simultaneously) both the motion-brightness equation and the smoothness equations; it represents the relative weight given to the errors in each as we minimize.

You might be thinking: "What's the big deal with optical flow? Just match where pixels in one frame went to in the next frame." This is exactly what others have done. The term "block matching" is a catchall for a whole class of similar algorithms in which the image is divided into small regions called blocks [Huang95; Beauchemin95]. Blocks are typically square and contain some number of pixels. These blocks may overlap and, in practice, often do. Block-matching algorithms attempt to divide both the previous and current images into such blocks and then compute the motion of these blocks. Algorithms of this kind play an important role in many video compression algorithms as well as in optical flow for computer vision.

Because block-matching algorithms operate on aggregates of pixels, not on individual pixels, the returned "velocity images" are typically of lower resolution than the input images. This is not always the case; it depends on the severity of the overlap between the blocks. The size of the result images is given by the following formula:

image with no caption
image with no caption

The implementation in OpenCV uses a spiral search that works out from the location of the original block (in the previous frame) and compares the candidate new blocks with the original. This comparison is a sum of absolute differences of the pixels (i.e., an L1 distance). If a good enough match is found, the search is terminated. Here's the function prototype:

void cvCalcOpticalFlowBM(
    const CvArr* prev,
    const CvArr* curr,
    CvSize       block_size,
    CvSize       shift_size,
    CvSize       max_range,
    int          use_previous,
    CvArr*       velx,
    CvArr*       vely
);

The arguments are straightforward. The prev and curr parameters are the previous and current images; both should be 8-bit, single-channel images. The block_size is the size of the block to be used, and shift_size is the step size between blocks (this parameter controls whether—and, if so, by how much—the blocks will overlap). The max_range parameter is the size of the region around a given block that will be searched for a corresponding block in the subsequent frame. If set, use_previous indicates that the values in velx and vely should be taken as starting points for the block searches. [148] Finally, velx and vely are themselves 32-bit single-channel images that will store the computed motions of the blocks. As mentioned previously, motion is computed at a block-by-block level and so the coordinates of the result images are for the blocks (i.e., aggregates of pixels), not for the individual pixels of the original image.



[144] Black and Anadan have created dense optical flow techniques [Black93; Black96] that are often used in movie production, where, for the sake of visual quality, the movie studio is willing to spend the time necessary to obtain detailed flow information. These techniques are slated for inclusion in later versions of OpenCV (see Chapter 14).

[145] Of course, the window could be 3-by-3, 7-by-7, or anything you choose. If the window is too large then you will end up violating the coherent motion assumption and will not be able to track well. If the window is too small, you will encounter the aperture problem again.

[146] If you are wondering why the funny size, it's because these scratch spaces need to accommodate not just the image itself but the entire pyramid.

[147] You might even be asking yourself: "What is a Lagrange multiplier?". In that case, it may be best to ignore this part of the paragraph and just set lambda equal to 1.

[148] If use_previous==0, then the search for a block will be conducted over a region of max_range distance from the location of the original block. If use_previous!=0, then the center of that search is first displaced by and .