Motion Templates

Motion templates were invented in the MIT Media Lab by Bobick and Davis [Bobick96; Davis97] and were further developed jointly with one of the authors [Davis99; Bradski00]. This more recent work forms the basis for the implementation in OpenCV.

Motion templates are an effective way to track general movement and are especially applicable to gesture recognition. Using motion templates requires a silhouette (or part of a silhouette) of an object. Object silhouettes can be obtained in a number of ways.

The simplest method of obtaining object silhouettes is to use a reasonably stationary camera and then employ frame-to-frame differencing (as discussed in Chapter 9). This will give you the moving edges of objects, which is enough to make motion templates work.
You can use chroma keying. For example, if you have a known background color such as bright green, you can simply take as foreground anything that is not bright green.
Another way (also discussed in Chapter 9) is to learn a background model from which you can isolate new foreground objects/people as silhouettes.
You can use active silhouetting techniques—for example, creating a wall of near-infrared light and having a near-infrared-sensitive camera look at the wall. Any intervening object will show up as a silhouette.
You can use thermal imagers; then any hot object (such as a face) can be taken as foreground.
Finally, you can generate silhouettes by using the segmentation techniques (e.g., pyramid segmentation or mean-shift segmentation) described in Chapter 9.

For now, assume that we have a good, segmented object silhouette as represented by the white rectangle of Figure 10-13(A). Here we use white to indicate that all the pixels are set to the floating-point value of the most recent system time stamp. As the rectangle moves, new silhouettes are captured and overlaid with the (new) current time stamp; the new silhouette is the white rectangle of Figure 10-13(B) and Figure 10-13(C). Older motions are shown in Figure 10-13 as successively darker rectangles. These sequentially fading silhouettes record the history of previous movement and thus are referred to as the "motion history image".

Motion template diagram: (A) a segmented object at the current time stamp (white); (B) at the next time step, the object moves and is marked with the (new) current time stamp, leaving the older segmentation boundary behind; (C) at the next time step, the object moves further, leaving older segmentations as successively darker rectangles whose sequence of encoded motion yields the motion history image

Figure 10-13. Motion template diagram: (A) a segmented object at the current time stamp (white); (B) at the next time step, the object moves and is marked with the (new) current time stamp, leaving the older segmentation boundary behind; (C) at the next time step, the object moves further, leaving older segmentations as successively darker rectangles whose sequence of encoded motion yields the motion history image

Silhouettes whose time stamp is more than a specified duration older than the current system time stamp are set to 0, as shown in Figure 10-14. The OpenCV function that accomplishes this motion template construction is cvUpdateMotionHistory():

void cvUpdateMotionHistory(
   const CvArr* silhouette,
   CvArr*       mhi,
   double       timestamp,
   double       duration
);

Figure 10-14. Motion template silhouettes for two moving objects (left); silhouettes older than a specified duration are set to 0 (right)

In cvUpdateMotionHistory(), all image arrays consist of single-channel images. The silhouette image is a byte image in which nonzero pixels represent the most recent segmentation silhouette of the foreground object. The mhi image is a floating-point image that represents the motion template (aka motion history image). Here timestamp is the current system time (typically a millisecond count) and duration, as just described, sets how long motion history pixels are allowed to remain in the mhi. In other words, any mhi pixels that are older (less) than timestamp minus duration are set to 0.

Once the motion template has a collection of object silhouettes overlaid in time, we can derive an indication of overall motion by taking the gradient of the mhi image. When we take these gradients (e.g., by using the Scharr or Sobel gradient functions discussed in Chapter 6), some gradients will be large and invalid. Gradients are invalid when older or inactive parts of the mhi image are set to 0, which produces artificially large gradients around the outer edges of the silhouettes; see Figure 10-15(A). Because we know the time-step duration with which we've been introducing new silhouettes into the mhi via cvUpdateMotionHistory(), we know how large our gradients (which are just dx and dy step derivatives) should be. We can therefore use the gradient magnitude to eliminate gradients that are too large, as in Figure 10-15(B). Finally, we can collect a measure of global motion; see Figure 10-15(C). The function that effects parts (A) and (B) of the figure is cvCalcMotionGradient():

void cvCalcMotionGradient(
   const CvArr* mhi,
   CvArr* mask,
   CvArr* orientation,
   double delta1,
   double delta2,
   int aperture_size=3
);

Motion gradients of the mhi image: (A) gradient magnitudes and directions; (B) large gradients are eliminated; (C) overall direction of motion is found

Figure 10-15. Motion gradients of the mhi image: (A) gradient magnitudes and directions; (B) large gradients are eliminated; (C) overall direction of motion is found

In cvCalcMotionGradient(), all image arrays are single-channel. The function input mhi is a floating-point motion history image, and the input variables delta1 and delta2 are (respectively) the minimal and maximal gradient magnitudes allowed. Here, the expected gradient magnitude will be just the average number of milliseconds in the time-stamp between each silhouette in successive calls to cvUpdateMotionHistory(); setting delta1 halfway below and delta2 halfway above this average value should work well. The variable aperture_size sets the size in width and height of the gradient operator. These values can be set to -1 (the 3-by-3 CV_SCHARR gradient filter), 3 (the default 3-by-3 Sobel filter), 5 (for the 5-by-5 Sobel filter), or 7 (for the 7-by-7 filter). The function outputs are mask, a single-channel 8-bit image in which nonzero entries indicate where valid gradients were found, and orientation, a floating-point image that gives the gradient direction's angle at each point.

The function cvCalcGlobalOrientation() finds the overall direction of motion as the vector sum of the valid gradient directions.

double cvCalcGlobalOrientation(
   const CvArr* orientation,
   const CvArr* mask,
   const CvArr* mhi,
   double       timestamp,
   double       duration
);

When using cvCalcGlobalOrientation(), we pass in the orientation and mask image computed in cvCalcMotionGradient() along with the timestamp, duration, and resulting mhi from cvUpdateMotionHistory(); what's returned is the vector-sum global orientation, as in Figure 10-15(C). The timestamp together with duration tells the routine how much motion to consider from the mhi and motion orientation images. One could compute the global motion from the center of mass of each of the mhi silhouettes, but summing up the precomputed motion vectors is much faster.

We can also isolate regions of the motion template mhi image and determine the local motion within that region, as shown in Figure 10-16. In the figure, the mhi image is scanned for current silhouette regions. When a region marked with the most current time stamp is found, the region's perimeter is searched for sufficiently recent motion (recent silhouettes) just outside its perimeter. When such motion is found, a downward-stepping flood fill is performed to isolate the local region of motion that "spilled off" the current location of the object of interest. Once found, we can calculate local motion gradient direction in the spill-off region, then remove that region, and repeat the process until all regions are found (as diagrammed in Figure 10-16).

Segmenting local regions of motion in the mhi image: (A) scan the mhi image for current silhouettes (a) and, when found, go around the perimeter looking for other recent silhouettes (b); when a recent silhouette is found, perform downward-stepping flood fills (c) to isolate local motion; (B) use the gradients found within the isolated local motion region to compute local motion; (C) remove the previously found region and search for the next current silhouette region (d), scan along it (e), and perform downward-stepping flood fill on it (f); (D) compute motion within the newly isolated region and continue the process (A)-(C) until no current silhouette remains

Figure 10-16. Segmenting local regions of motion in the mhi image: (A) scan the mhi image for current silhouettes (a) and, when found, go around the perimeter looking for other recent silhouettes (b); when a recent silhouette is found, perform downward-stepping flood fills (c) to isolate local motion; (B) use the gradients found within the isolated local motion region to compute local motion; (C) remove the previously found region and search for the next current silhouette region (d), scan along it (e), and perform downward-stepping flood fill on it (f); (D) compute motion within the newly isolated region and continue the process (A)-(C) until no current silhouette remains

The function that isolates and computes local motion is cvSegmentMotion():

CvSeq* cvSegmentMotion(
   const CvArr*  mhi,
   CvArr*        seg_mask,
   CvMemStorage* storage,
   double        timestamp,
   double        seg_thresh
);

In cvSegmentMotion(), the mhi is the single-channel floating-point input. We also pass in storage, a CvMemoryStorage structure allocated via cvCreateMemStorage(). Another input is timestamp, the value of the most current silhouettes in the mhi from which you want to segment local motions. Finally, you must pass in seg_thresh, which is the maximum downward step (from current time to previous motion) that you'll accept as attached motion. This parameter is provided because there might be overlapping silhouettes from recent and much older motion that you don't want to connect together.

It's generally best to set seg_thresh to something like 1.5 times the average difference in silhouette time stamps. This function returns a CvSeq of CvConnectedComp structures, one for each separate motion found, which delineates the local motion regions; it also returns seg_mask, a single-channel, floating-point image in which each region of isolated motion is marked a distinct nonzero number (a zero pixel in seg_mask indicates no motion). To compute these local motions one at a time we call cvCalcGlobalOrientation(), using the appropriate mask region selected from the appropriate CvConnectedComp or from a particular value in the seg_mask; for example,

cvCmpS(
  seg_mask,
//  [value_wanted_in_seg_mask],
//  [your_destination_mask],
   CV_CMP_EQ
)

Given the discussion so far, you should now be able to understand the motempl.c example that ships with OpenCV in the …/opencv/samples/c/ directory. We will now extract and explain some key points from the update_mhi() function in motempl.c. The update_mhi() function extracts templates by thresholding frame differences and then passing the resulting silhouette to cvUpdateMotionHistory():

...
cvAbsDiff( buf[idx1], buf[idx2], silh );
cvThreshold( silh, silh, diff_threshold, 1, CV_THRESH_BINARY );
cvUpdateMotionHistory( silh, mhi, timestamp, MHI_DURATION );
...

The gradients of the resulting mhi image are then taken, and a mask of valid gradients is produced using cvCalcMotionGradient(). Then CvMemStorage is allocated (or, if it already exists, it is cleared), and the resulting local motions are segmented into CvConnectedComp structures in the CvSeq containing structure seq:

...
cvCalcMotionGradient(
  mhi,
  mask,
  orient,
  MAX_TIME_DELTA,
  MIN_TIME_DELTA,
  3
);

if( !storage )
  storage = cvCreateMemStorage(0);
else
  cvClearMemStorage(storage);

seq = cvSegment Motion(
  mhi,
  segmask,
  storage,
  timestamp,
  MAX_TIME_DELTA
);

A "for" loop then iterates through the seq->total CvConnectedComp structures extracting bounding rectangles for each motion. The iteration starts at -1, which has been designated as a special case for finding the global motion of the whole image. For the local motion segments, small segmentation areas are first rejected and then the orientation is calculated using cvCalcGlobalOrientation(). Instead of using exact masks, this routine restricts motion calculations to regions of interest (ROIs) that bound the local motions; it then calculates where valid motion within the local ROIs was actually found. Any such motion area that is too small is rejected. Finally, the routine draws the motion. Examples of the output for a person flapping their arms is shown in Figure 10-17, where the output is drawn above the raw image for four sequential frames going across in two rows. (For the full code, see …/opencv/samples/c/motempl.c.) In the same sequence, "Y" postures were recognized by the shape descriptors (Hu moments) discussed in Chapter 8, although the shape recognition is not included in the samples code.

for( i = -1; i < seq->total; i++ ) {
    if( i < 0 ) { // case of the whole image
//       ...[does the whole image]...
    else { // i-th motion component
        comp_rect = ((CvConnectedComp*)cvGetSeqElem( seq, i ))->rect;
//           [reject very small components]...
    }
    ...[set component ROI regions]...
    angle  = cvCalcGlobalOrientation( orient, mask, mhi,
                                      timestamp, MHI_DURATION);
    ...[find regions of valid motion]...
    ...[reset ROI regions]...
    ...[skip small valid motion regions]...
    ...[draw the motions]...
    }

Results of motion template routine: going across and top to bottom, a person moving and the resulting global motions indicated in large octagons and local motions indicated in small octagons; also, the "Y" pose can be recognized via shape descriptors (Hu moments)

Figure 10-17. Results of motion template routine: going across and top to bottom, a person moving and the resulting global motions indicated in large octagons and local motions indicated in small octagons; also, the "Y" pose can be recognized via shape descriptors (Hu moments)