Capturing images and tracking faces in an activity

An Android app is a state machine in which each state is called an activity. An activity has a lifecycle. For example, it can be created, paused, resumed, and finished. During a transition between activities, the paused or finished activity can send data to the created or resumed activity. An app can define many activities and transition to them in any order. It can even transition to activities defined by the Android SDK or by other apps.

Note

For more information about Android activities and their lifecycle, refer to the official documentation at http://developer.android.com/guide/components/activities.html.

For more information about OpenCV's Android and Java APIs (used throughout our activity class), refer to the official Javadocs at http://docs.opencv.org/java/. Unfortunately, at the time of writing this book, many of the comments in OpenCV's Javadocs are too closely based on the comments for C++. The class and method signatures in the Javadocs are correct, but in some cases, the comments are incorrect due to differences between the Java and C++ APIs.

OpenCV provides classes and interfaces that can be considered as add-ons to an activity's lifecycle. Specifically, we can use OpenCV callback methods to handle the following events:

The OpenCV library is loaded.
The camera preview starts.
The camera preview stops.
The camera preview captures a new frame.

Goldgesture uses just one activity called CameraActivity. As we saw earlier while implementing its layout in XML, CameraActivity uses a CameraBridgeViewBase object (more specifically, a JavaCameraView object) as its camera preview. CameraActivity implements an interface called CvCameraViewListener2, which provides callbacks for this camera preview. (Alternatively, an interface called CvCameraViewListener can serve this purpose. The difference between the two interfaces is that CvCameraViewListener2 allows us to specify a format for the captured image, whereas CvCameraViewListener does not.) The implementation of our class begins as follows:

package com.nummist.goldgesture;

// ...
// See code bundle for imports.
// ...

public final class CameraActivity extends Activity
implements CvCameraViewListener2 {

  // A tag for log output.
  private static final String TAG = "CameraActivity";

For readability and easy editing, we will use static final variables to store many parameters to our computer vision functions. You might wish to adjust these values based on experimentation. First, we have face detection parameters that should be familiar to you from last chapter's project:

  // Parameters for face detection.
  private static final double SCALE_FACTOR = 1.2;
  private static final int MIN_NEIGHBORS = 3;
  private static final int FLAGS = Objdetect.CASCADE_SCALE_IMAGE;
  private static final double MIN_SIZE_PROPORTIONAL = 0.25;
  private static final double MAX_SIZE_PROPORTIONAL = 1.0;

For the purpose of selecting features, we do not use the entire detected face. Rather, we use an inner portion that is less likely to contain any non-face background. Thus, we will define a proportion of the face that should be excluded from feature selection on each side:

  // The portion of the face that is excluded from feature
  // selection on each side.
  // (We want to exclude boundary regions containing background.)
  private static final double MASK_PADDING_PROPORTIONAL = 0.15;

For face tracking using optical flow, we will define a minimum and maximum number of features. If we fail to track at least the minimum number of features, we deem that the face has been lost. We will also define a minimum feature quality (relative to the quality of the best feature found), a minimum pixel distance between features, and a maximum acceptable error value while trying to match a new feature to an old feature. As we will see in a later section, these parameters pertain to OpenCV's calcOpticalFlowPyrLK function and its return values. The declarations are given in the following code:

  // Parameters for face tracking.
  private static final int MIN_FEATURES = 10;
  private static final int MAX_FEATURES = 80;
  private static final double MIN_FEATURE_QUALITY = 0.05;
  private static final double MIN_FEATURE_DISTANCE = 4.0;
  private static final float MAX_FEATURE_ERROR = 200f;

We will also define how much movement (as a proportion of the image size) and how many back and forth cycles are required before we deem that a nod or shake has occurred:

  // Parameters for gesture detection
  private static final double MIN_SHAKE_DIST_PROPORTIONAL = 0.04;
  private static final double MIN_NOD_DIST_PROPORTIONAL = 0.005;
  private static final double MIN_BACK_AND_FORTH_COUNT = 2;

Our member variables include the camera view, the dimensions of captured images, and the images at various stages of processing. The images are stored in OpenCV Mat objects, which are analogous to the NumPy arrays that we have seen in the Python bindings. OpenCV always captures images in landscape format, but we reorient them to portrait format, which is a more usual orientation for a picture of one's own face on a smartphone. Here are the relevant variable declarations:

  // The camera view.
  private CameraBridgeViewBase mCameraView;

  // The dimensions of the image before orientation.
  private double mImageWidth;
  private double mImageHeight;

  // The current gray image before orientation.
  private Mat mGrayUnoriented;

  // The current and previous equalized gray images.
  private Mat mEqualizedGray;
  private Mat mLastEqualizedGray;

As seen in the following code and comments, we also declare several member variables related to face detection and tracking:

  // The mask, in which the face region is white and the
  // background is black.
  private Mat mMask;
  private Scalar mMaskForegroundColor;
  private Scalar mMaskBackgroundColor;

  // The face detector, more detection parameters, and
  // detected faces.
  private CascadeClassifier mFaceDetector;
  private Size mMinSize;
  private Size mMaxSize;
  private MatOfRect mFaces;

  // The initial features before tracking.
  private MatOfPoint mInitialFeatures;

  // The current and previous features being tracked.
  private MatOfPoint2f mFeatures;
  private MatOfPoint2f mLastFeatures;

  // The status codes and errors for the tracking.

  private MatOfByte mFeatureStatuses;
  private MatOfFloat mFeatureErrors;

  // Whether a face was being tracked last frame.
  private boolean mWasTrackingFace;

  // Colors for drawing.
  private Scalar mFaceRectColor;
  private Scalar mFeatureColor;

We store instances of the classes that we defined earlier, namely BackAndForthGesture and YesNoAudioTree:

  // Gesture detectors.
  private BackAndForthGesture mNodHeadGesture;
  private BackAndForthGesture mShakeHeadGesture;

  // The audio tree for the 20 questions game.
  private YesNoAudioTree mAudioTree;

Our last member variable is an instance of an OpenCV class called BaseLoaderCallback. It is responsible for loading the OpenCV library. We will initialize an anonymous (inline) subclass with a custom callback method that enables the camera preview (provided that the library loaded successfully). The code for its implementation is as follows:

  // The OpenCV loader callback.
  private BaseLoaderCallback mLoaderCallback =
      new BaseLoaderCallback(this) {
    @Override
    public void onManagerConnected(final int status) {
      switch (status) {
        case LoaderCallbackInterface.SUCCESS:
          Log.d(TAG, "OpenCV loaded successfully");
          mCameraView.enableView();
          break;
        default:
          super.onManagerConnected(status);
          break;
      }
    }
  };

Now, let's implement the standard lifecycle callbacks of an Android activity. First, when the activity is created, we will specify that we want to keep the screen on even when there is no touch interaction (since all interaction is via the camera). Moreover, we need to load the layout from the XML file, get a reference to the camera preview, and set this activity as the handler for the camera preview's events. This implementation is given in the following code:

  @Override
  protected void onCreate(final Bundle savedInstanceState) {
    super.onCreate(savedInstanceState);

    final Window window = getWindow();
    window.addFlags(
        WindowManager.LayoutParams.FLAG_KEEP_SCREEN_ON);

    setContentView(R.layout.activity_camera);
    mCameraView = (CameraBridgeViewBase)
        findViewById(R.id.camera_view);
    //mCameraView.enableFpsMeter();
    mCameraView.setCvCameraViewListener(this);
  }

Note that we have not yet initialized most of our member variables. Instead, we do so once the camera preview has started.

When the activity is paused, we will disable the camera preview, stop the audio, and reset the gesture recognition data, as seen in the following code:

  @Override
  public void onPause() {
    if (mCameraView != null) {
      mCameraView.disableView();
    }
    if (mAudioTree != null) {
      mAudioTree.stop();
    }
    resetGestures();
    super.onPause();
  }

When the activity resumes (including the first time it comes to the foreground after being created), we will start loading the OpenCV library:

  @Override
  public void onResume() {
    super.onResume();
    OpenCVLoader.initAsync(OpenCVLoader.OPENCV_VERSION_2_4_5,
        this, mLoaderCallback);
  }

When the activity is destroyed, we will clean things up in the same way as when the activity is paused:

  @Override
  public void onDestroy() {
    super.onDestroy();
    if (mCameraView != null) {
      mCameraView.disableView();
    }
    if (mAudioTree != null) {
      mAudioTree.stop();
    }
    resetGestures();
  }

Now, let's turn our attention to the camera callbacks. When the camera preview starts (after the OpenCV library is loaded), we will initialize our remaining member variables. Many of these variables cannot be initialized earlier because they are OpenCV types and their constructors are implemented in the library that we just loaded. To begin, we will store the pixel dimensions that the camera is using:

  @Override
  public void onCameraViewStarted(final int width,
      final int height) {

    mImageWidth = width;
    mImageHeight = height;

Next, we will initialize our face detection variables, mostly via a helper method called initFaceDetector. The role of initFaceDetector includes loading the detector's cascade file, res/raw/lbpcascade_frontalface.xml. A lot of boilerplate code for file handling and error handling is involved in this task, so separating it into another function improves readability. We will examine the helper function's implementation later, but here is the call:

    initFaceDetector();
    mFaces = new MatOfRect();

As we did in the last chapter's project, we will determine the smaller of the two image dimensions and use it in proportional size calculations:

    final int smallerSide;
    if (height < width) {
      smallerSide = height;
    } else {
      smallerSide = width;
    }

    final double minSizeSide =
        MIN_SIZE_PROPORTIONAL * smallerSide;
    mMinSize = new Size(minSizeSide, minSizeSide);

    final double maxSizeSide =
        MAX_SIZE_PROPORTIONAL * smallerSide;
    mMaxSize = new Size(maxSizeSide, maxSizeSide);

We will initialize the matrices related to the features:

    mInitialFeatures = new MatOfPoint();
    mFeatures = new MatOfPoint2f(new Point());
    mLastFeatures = new MatOfPoint2f(new Point());
    mFeatureStatuses = new MatOfByte();
    mFeatureErrors = new MatOfFloat();

We will specify colors (in RGB format, not BGR) for drawing a rectangle around the face and circles around the features:

    mFaceRectColor = new Scalar(0.0, 0.0, 255.0);
    mFeatureColor = new Scalar(0.0, 255.0, 0.0);

We will initialize variables related to nod and shake recognition:

    final double minShakeDist =
        smallerSide * MIN_SHAKE_DIST_PROPORTIONAL;
    mShakeHeadGesture = new BackAndForthGesture(minShakeDist);

    final double minNodDist =
        smallerSide * MIN_NOD_DIST_PROPORTIONAL;
    mNodHeadGesture = new BackAndForthGesture(minNodDist);

We will initialize and start the audio sequence:

    mAudioTree = new YesNoAudioTree(this);
    mAudioTree.start();

Finally, we will initialize the image matrices, most of which are transposed to be in portrait format:

    mGrayUnoriented = new Mat(height, width, CvType.CV_8UC1);

    // The rest of the matrices are transposed.

    mEqualizedGray = new Mat(width, height, CvType.CV_8UC1);
    mLastEqualizedGray = new Mat(width, height, CvType.CV_8UC1);

    mMask = new Mat(width, height, CvType.CV_8UC1);
    mMaskForegroundColor = new Scalar(255.0);
    mMaskBackgroundColor = new Scalar(0.0);
  }

When the camera view stops, we do not do anything. Here is the empty callback method:

  @Override
  public void onCameraViewStopped() {
  }

When the camera captures a frame, we do all the real work, the computer vision. We will start by getting the color image (in RGBA format, not BGR), converting it to gray, reorienting it to portrait format, and equalizing it. Thus, the callback's implementation begins as follows:

  @Override
  public Mat onCameraFrame(final CvCameraViewFrame inputFrame) {
    final Mat rgba = inputFrame.rgba();

    // For processing, orient the image to portrait and equalize
    // it.
    Imgproc.cvtColor(rgba, mGrayUnoriented,
             Imgproc.COLOR_RGBA2GRAY);
    Core.transpose(mGrayUnoriented, mEqualizedGray);
    Core.flip(mEqualizedGray, mEqualizedGray, -1);
    Imgproc.equalizeHist(mEqualizedGray, mEqualizedGray);

Note

Note that we get the RGBA image by calling inputFrame.rgba() and then we convert that image to grayscale. Alternatively, we could get the grayscale image directly by calling inputFrame.gray(). In our case, we want both the RGBA and grayscale images because we use the RGBA image for display and the grayscale image for detection and tracking.

Next, we will declare a list of features. A standard Java List is mutable whereas an OpenCV Mat is not, so we are going to need a List when we filter out features that did not track well. The declaration is as follows:

    final List<Point> featuresList;

We will detect faces—a familiar task from the last chapter. Unlike in OpenCV's Python bindings, the structure to store the face rectangles is provided as an argument:

    mFaceDetector.detectMultiScale(
        mEqualizedGray, mFaces, SCALE_FACTOR, MIN_NEIGHBORS,
        FLAGS, mMinSize, mMaxSize);

When at least one face is detected, we take the first detected face and draw a rectangle around it. We are detecting a face in portrait orientation but drawing to the original image in landscape orientation, so some conversion of coordinates is necessary. The code for this is as follows:

    if (mFaces.rows() > 0) {

      // Get the first detected face.
      final double[] face = mFaces.get(0, 0);

      double minX = face[0];
      double minY = face[1];
      double width = face[2];
      double height = face[3];
      double maxX = minX + width;
      double maxY = minY + height;

      // Draw the face.
      Core.rectangle(
          rgba, new Point(mImageWidth-minY, mImageHeight-minX),
          new Point(mImageWidth-maxY, mImageHeight-maxX),
          mFaceRectColor);

Next, we will select the features within the inner part of the detected face. We will specify the region of interest by passing a mask to OpenCV's goodFeaturesToTrack function. A mask is an image that is white in the foreground (the inner part of the face) and black in the background. The following code finds the region of interest, creates the mask, and calls goodFeaturesToTrack with all relevant parameters:

      // Create a mask for the face region.
      double smallerSide;
      if (height < width) {
        smallerSide = height;
      } else {
        smallerSide = width;
      }
      double maskPadding =
          smallerSide * MASK_PADDING_PROPORTIONAL;
      mMask.setTo(mMaskBackgroundColor);
      Core.rectangle(
          mMask,
          new Point(minX + maskPadding,
                    minY + maskPadding),
          new Point(maxX - maskPadding,
                    maxY - maskPadding),
          mMaskForegroundColor, -1);

      // Find features in the face region.
      Imgproc.goodFeaturesToTrack(
          mEqualizedGray, mInitialFeatures, MAX_FEATURES,
          MIN_FEATURE_QUALITY, MIN_FEATURE_DISTANCE,
          mMask, 3, false, 0.04);
      mFeatures.fromArray(mInitialFeatures.toArray());
      featuresList = mFeatures.toList();

Note that we will copy the features into several variables: a matrix of initial features, a matrix of current features, and a mutable list of features that we will filter later.

Depending on whether we were already tracking a face, we will call a helper function to either initialize our data on gestures or update our data on gestures. We will also record that we are now tracking a face:

      if (mWasTrackingFace) {
        updateGestureDetection();
      } else {
        startGestureDetection();
      }
      mWasTrackingFace = true;

Alternatively, we might not have detected any face in this frame. Then, we will update any previously selected features using OpenCV's calcOpticalFlowPyrLK function to give us a matrix of error values, a matrix of status values, and a matrix of new features (0 for an invalid feature, 1 for a valid feature). Being invalid typically means that the new feature is estimated to be outside the frame and thus it can no longer be tracked by optical flow. We will convert the new features to a list and filter out the ones that are invalid or have a high error value, as seen in the following code:

    } else {
      Video.calcOpticalFlowPyrLK(
          mLastEqualizedGray, mEqualizedGray, mLastFeatures,
          mFeatures, mFeatureStatuses, mFeatureErrors);

      // Filter out features that are not found or have high
      // error.
      featuresList = new LinkedList<Point>(mFeatures.toList());
      final LinkedList<Byte> featureStatusesList =
          new LinkedList<Byte>(mFeatureStatuses.toList());
      final LinkedList<Float> featureErrorsList =
          new LinkedList<Float>(mFeatureErrors.toList());
      for (int i = 0; i < featuresList.size();) {
        if (featureStatusesList.get(i) == 0 ||
            featureErrorsList.get(i) > MAX_FEATURE_ERROR) {
          featuresList.remove(i);
          featureStatusesList.remove(i);
          featureErrorsList.remove(i);
        } else {
          i++;
        }
      }

If too few features remain after filtering, we will deem that the face is no longer tracked and we will discard all features. Otherwise, we will put the accepted features back in the matrix of current features and update our data on gestures:

      if (featuresList.size() < MIN_FEATURES) {
        // The number of remaining features is too low; we have
        // probably lost the target completely.

        // Discard the remaining features.
        featuresList.clear();
        mFeatures.fromList(featuresList);

        mWasTrackingFace = false;
      } else {
        mFeatures.fromList(featuresList);
        updateGestureDetection();
      }
    }

We will draw green circles around the current features. Again, we must convert coordinates from portrait format back to landscape format in order to draw on the original image:

    // Draw the current features.
    for (int i = 0; i< featuresList.size(); i++) {
      final Point p = featuresList.get(i);
      final Point pTrans = new Point(
          mImageWidth - p.y,
          mImageHeight - p.x);
      Core.circle(rgba, pTrans, 8, mFeatureColor);
    }

At the end of the frame, the current equalized gray image and current features become the previous equalized gray image and previous features. Rather than copying these matrices, we swap the references:

    // Swap the references to the current and previous images.
    final Mat swapEqualizedGray = mLastEqualizedGray;
    mLastEqualizedGray = mEqualizedGray;
    mEqualizedGray = swapEqualizedGray;

    // Swap the references to the current and previous features.
    final MatOfPoint2f swapFeatures = mLastFeatures;
    mLastFeatures = mFeatures;
    mFeatures = swapFeatures;

We will horizontally flip the preview image to make it look like a mirror. Then, we will return it so that OpenCV can display it:

    // Mirror (horizontally flip) the preview.
    Core.flip(rgba, rgba, 1);

    return rgba;
  }

We have mentioned several helper functions, which we will examine now. When we start analyzing face motion, we will find the geometric mean of the features and use the mean's x and y coordinates, respectively, as the starting coordinates for shake and nod gestures, as done in the following code:

  private void startGestureDetection() {

    double[] featuresCenter = Core.mean(mFeatures).val;

    // Motion in x may indicate a shake of the head.
    mShakeHeadGesture.start(featuresCenter[0]);

    // Motion in y may indicate a nod of the head.
    mNodHeadGesture.start(featuresCenter[1]);
  }

Note

Remember that our BackAndForthGesture class uses one-dimensional positions. For an upright head, only the x coordinate is relevant to a shake gesture and only the y coordinate is relevant to a nod gesture.

  private void updateGestureDetection() {

    final double[] featuresCenter = Core.mean(mFeatures).val;

    // Motion in x may indicate a shake of the head.
    mShakeHeadGesture.update(featuresCenter[0]);
    final int shakeBackAndForthCount =
        mShakeHeadGesture.getBackAndForthCount();
    //Log.i(TAG, "shakeBackAndForthCount=" +
    //        shakeBackAndForthCount);
    final boolean shakingHead =
        (shakeBackAndForthCount >=
        MIN_BACK_AND_FORTH_COUNT);

    // Motion in y may indicate a nod of the head.
    mNodHeadGesture.update(featuresCenter[1]);
    final int nodBackAndForthCount =
        mNodHeadGesture.getBackAndForthCount();
    //Log.i(TAG, "nodBackAndForthCount=" +
    //        nodBackAndForthCount);
    final boolean noddingHead =
        (nodBackAndForthCount >=
        MIN_BACK_AND_FORTH_COUNT);

    if (shakingHead && noddingHead) {
      // The gesture is ambiguous. Ignore it.
      resetGestures();
    } else if (shakingHead) {
      mAudioTree.takeNoBranch();
      resetGestures();
    } else if (noddingHead) {
      mAudioTree.takeYesBranch();
      resetGestures();
    }
  }

We will always reset the nod gesture data and the shake gesture data at the same time:

  private void resetGestures() {
    if (mNodHeadGesture != null) {
      mNodHeadGesture.resetCounts();
    }
    if (mShakeHeadGesture != null) {
      mShakeHeadGesture.resetCounts();
    }
  }

Our helper method to initialize the face detector is very similar to the method found in an official OpenCV sample project that performs face detection on Android. We will copy the cascade's raw data from the app bundle to a new file that is more accessible. Then, we will initialize a CascadeClassifier object using this file's path. If an error is encountered at any point, we will log it and close the app. Here is the method's implementation:

  private void initFaceDetector() {
    try {
      // Load cascade file from application resources.

      InputStream is = getResources().openRawResource(
          R.raw.lbpcascade_frontalface);
      File cascadeDir = getDir(

          "cascade", Context.MODE_PRIVATE);
      File cascadeFile = new File(
          cascadeDir, "lbpcascade_frontalface.xml");
      FileOutputStream os = new FileOutputStream(cascadeFile);

      byte[] buffer = new byte[4096];
      int bytesRead;
      while ((bytesRead = is.read(buffer)) != -1) {
        os.write(buffer, 0, bytesRead);
      }
      is.close();
      os.close();

      mFaceDetector = new CascadeClassifier(
          cascadeFile.getAbsolutePath());
      if (mFaceDetector.empty()) {
        Log.e(TAG, "Failed to load cascade");
        finish();
      } else {
        Log.i(TAG, "Loaded cascade from " +
           cascadeFile.getAbsolutePath());
      }

      cascadeDir.delete();

    } catch (IOException e) {
      e.printStackTrace();
      Log.e(TAG, "Failed to load cascade. Exception thrown: "
         + e);
      finish();
    }
  }
}

That's all the code! We are ready to test. Make sure your Android device has its sound turned on. Plug the device into a USB port and press the Run button (the play icon in green). The first time you run the project, you might see the Run As window, as shown in the following screenshot:

Capturing images and tracking faces in an activity

If you see this window, select Android Application and click on the OK button. Then, you might see another window, Android Device Chooser, as shown in the following screenshot:

If you see this window, select your device and click on the OK button.

Soon, you should see the app's camera preview appear on your device as shown in the following screenshot. Nod or shake your head knowingly as the questions are asked.