Mahalanobis Distance

The Mahalanobis distance is a distance measure that accounts for the covariance or "stretch" of the space in which the data lies. If you know what a Z-score is then you can think of the Mahalanobis distance as a multidimensional analogue of the Z-score. Figure 13-4(a) shows an initial distribution between three sets of data that make the vertical sets look closer together. When we normalize the space by the covariance in the data, we see in Figure 13-4(b) that that horizontal data sets are actually closer together. This sort of thing occurs frequently; for instance, if we are comparing people's height in meters with their age in days, we'd see very little variance in height to relate to the large variance in age. By normalizing for the variance we can obtain a more realistic comparison of variables. Some classifiers such as K-nearest neighbors deal poorly with large differences in variance, whereas other algorithms (such as decision trees) don't mind it.

We can already get a hint for what the Mahalanobis distance must be by looking at Figure 13-4;^[239] we must somehow divide out the covariance of the data while measuring distance. First, let us review what covariance is. Given a list X of N data points, where each data point may be of dimension (vector length) K with mean vector μ_x (consisting of individual means μ_1,…,K), the covariance is a K-by-K matrix given by:

where E[·] is the expectation operator. OpenCV makes computing the covariance matrix easy, using

void cvCalcCovarMatrix(
  const CvArr** vects,
  int           count,
  CvArr*        cov_mat,
  CvArr*        avg,
  int           flags
);

This function is a little bit tricky. Note that vects is a pointer to a pointer of CvArr. This implies that we have vects[0] through vects[count-1], but it actually depends on the flags settings as described in what follows. Basically, there are two cases.

Vects is a 1D vector of pointers to 1D vectors or 2D matrices (the two dimensions are to accommodate images). That is, each vects[i] can point to a 1D or a 2D vector, which occurs if neither CV_COV_ROWS nor CV_COV_COLS is set. The accumulating covariance computation is scaled or divided by the number of data points given by count if CV_COVAR_SCALE is set.

Often there is only one input vector, so use only vects[0] if either CV_COVAR_ROWS or CV_COVAR_COLS is set. If this is set, then scaling by the value given by count is ignored in favor of the number of actual data vectors contained in vects[0]. All the data points are then in:
Figure 13-4. The Mahalanobis computation allows us to reinterpret the data's covariance as a "stretch" of the space: (a) the vertical distance between raw data sets is less than the horizontal distance; (b) after the space is normalized for variance, the horizontal distance between data sets is less than the vertical distance
1. the rows of vects[0] if CV_COVAR_ROWS is set; or
2. the columns of vects[0] if instead CV_COVAR_COLS is set. You cannot set both row and column flags simultaneously (see flag descriptions for more details).

Vects can be of types 8UC1, 16UC1, 32FC1, or 64FC1. In any case, vects contains a list of K-dimensional data points. To reiterate: count is how many vectors there are in vects[] for case 1 (CV_COVAR_ROWS and CV_COVAR_COLS not set); for case 2a and 2b (CV_COVAR_ROWS or CV_COVAR_COLS is set), count is ignored and the actual number of vectors in vects[0] is used instead. The resulting K-by-K covariance matrix will be returned in cov_mat, and it can be of type CV_32FC1 or CV_64FC1. Whether or not the vector avg is used depends on the settings of flags (see listing that follows). If avg is used then it has the same type as vects and contains the K-feature averages across vects. The parameter flags can have many combinations of settings formed by adding values together (for more complicated applications, refer to the …/opencv/docs/ref/opencvref_cxcore.htm documentation). In general, you will set flags to one of the following.

CV_COVAR_NORMAL: Do the regular type of covariance calculation as in the previously displayed equation. Average the results by the number in count if CV_COVAR_SCALE is not set; otherwise, average by the number of data points in vects[0].
CV_COVAR_SCALE: Normalize the computed covariance matrix.
CV_COVAR_USE_AVG: Use the avg matrix instead of automatically calculating the average of each feature. Setting this saves on computation time if you already have the averages (e.g., by having called cvAvg() yourself); otherwise, the routine will compute these averages for you.^[240]

Most often you will combine your data into one big matrix, let's say by rows of data points; then flags would be set as flags = CV_COVAR_NORMAL | CV_COVAR_SCALE | CV_COVAR_ROWS.

We now have the covariance matrix. For Mahalanobis distance, however, we'll need to divide out the variance of the space and so will need the inverse covariance matrix. This is easily done by using:

double cvInvert(
  const CvArr* src,
  CvArr*       dst,
  int          method = CV_LU
);

In cvInvert(), the src matrix should be the covariance matrix calculated before and dst should be a same sized matrix, which will be filled with the inverse on return. You could leave the method at its default value, CV_LU, but it is better to set the method to CV_SVD_SYM.^[241]

With the inverse covariance matrix

finally in hand, we can move on to the Mahalanobis distance measure. This measure is much like the Euclidean distance measure, which is the square root of the sum of squared differences between two vectors x and y, but it divides out the covariance of the space:

This distance is just a number. Note that if the covariance matrix is the identity matrix then the Mahalanobis distance is equal to the Euclidean distance. We finally arrive at the actual function that computes the Mahalanobis distance. It takes two input vectors (vec1 and vec2) and the inverse covariance in mat, and it returns the distance as a double:

double cvMahalanobis(
  const CvArr* vec1,
  const CvArr* vec2,
  CvArr*       mat
);

The Mahalanobis distance is an important measure of similarity between two different data points in a multidimensional space, but is not a clustering algorithm or classifier itself. Let us now move on, starting with the most frequently used clustering algorithm: K-means.

^[239] Note that Figure 13-4 has a diagonal covariance matrix, which entails independent X and Y variance rather than actual covariance. This was done to make the explanation simple. In reality, data is often "stretched" in much more interesting ways.

^[240]A precomputed average data vector should be passed if the user has a more statistically justified value of the average or if the covariance matrix is computed by blocks.

^[241]CV_SVD could also be used in this case, but it is somewhat slower and less accurate than CV_SVD_SYM. CV_SVD_SYM, even if it is slower than CV_LU, still should be used if the dimensionality of the space is much smaller than the number of data points. In such a case the overall computing time will be dominated by cvCalcCovarMatrix() anyway. So it may be wise to spend a little bit more time on computing inverse covariance matrix more accurately (much more accurately, if the set of points is concentrated in a subspace of a smaller dimensionality). Thus, CV_SVD_SYM is usually the best choice for this task.