K-Means

K-means is a clustering algorithm implemented in the cxcore because it was written long before the ML library. K-means attempts to find the natural clusters or "clumps" in the data. The user sets the desired number of clusters and then K-means rapidly finds a good placement for those cluster centers, where "good" means that the cluster centers tend to end up located in the middle of the natural clumps of data. It is one of the most used clustering techniques and has strong similarities to the expectation maximization algorithm for Gaussian mixture (implemented as CvEM() in the ML library) as well as some similarities to the mean-shift algorithm discussed in Chapter 9 (implemented as cvMeanShift() in the CV library). K-means is an iterative algorithm and, as implemented in OpenCV, is also known as Lloyd's algorithm^[242] or (equivalently) "Voronoi iteration". The algorithm runs as follows.

Take as input (a) a data set and (b) desired number of clusters K (chosen by the user).
Randomly assign cluster center locations.
Associate each data point with its nearest cluster center.
Move cluster centers to the centroid of their data points.
Return to step 3 until convergence (centroid does not move).

Figure 13-5 diagrams K-means in action; in this case, it takes just two iterations to converge. In real cases the algorithm often converges rapidly, but it can sometimes require a large number of iterations.

Problems and Solutions

K-means is an extremely effective clustering algorithm, but it does have three problems.

K-means isn't guaranteed to find the best possible solution to locating the cluster centers. However, it is guaranteed to converge to some solution (i.e., the iterations won't continue indefinitely).
K-means doesn't tell you how many cluster centers you should use. If we had chosen two or four clusters for the example of Figure 13-5, then the results would be different and perhaps nonintuitive.
K-means presumes that the covariance in the space either doesn't matter or has already been normalized (cf. our discussion of the Mahalanobis distance).

Each one of these problems has a "solution", or at least an approach that helps. The first two of these solutions depend on "explaining the variance of the data". In K-means, each cluster center "owns" its data points and we compute the variance of those points.

K-means in action for two iterations: (a) cluster centers are placed randomly and each data point is then assigned to its nearest cluster center; (b) cluster centers are moved to the centroid of their points; (c) data points are again assigned to their nearest cluster centers; (d) cluster centers are again moved to the centroid of their points

Figure 13-5. K-means in action for two iterations: (a) cluster centers are placed randomly and each data point is then assigned to its nearest cluster center; (b) cluster centers are moved to the centroid of their points; (c) data points are again assigned to their nearest cluster centers; (d) cluster centers are again moved to the centroid of their points

The best clustering minimizes the variance without causing too much complexity (too many clusters). With that in mind, the listed problems can be ameliorated as follows.

Run K-means several times, each with different placement of the cluster centers (easy to do, since OpenCV places the centers at random); then choose the run whose results exhibit the least variance.
Start with one cluster and try an increasing number of clusters (up to some limit), each time employing the method of #1 as well. Usually the total variance will shrink quite rapidly, after which an "elbow" will appear in the variance curve; this indicates that a new cluster center does not significantly reduce the total variance. Stop at the elbow and keep that many cluster centers.
Multiply the data by the inverse covariance matrix (as described in the "Mahalanobis Distance" section). For example, if the input data vectors D are organized as rows with one data point per row, then normalize the "stretch" in the space by computing a new data vector
, where
.

K-Means Code

The call for K-means is simple:

void cvKMeans2(
  const CvArr*   samples,
  int            cluster_count,
  CvArr*         labels,
  CvTermCriteria termcrit
);

The samples array is a matrix of multidimensional data points, one per row. There is a little subtlety here in that each element of the data point may be either a regular floating-point vector of CV_32FC1 numbers or a multidimensional point of type CV_32FC2 or CV_32FC3 or even CV_32FC(K).^[243] The parameter cluster_count is simply how many clusters you want, and the return vector labels contains the final cluster index for each data point. We encountered termcrit in the section "Common Routines in the ML Library" and in the "Controlling Training Iterations" subsection.

It's instructive to see a complete example of K-means in code (Example 13-1), because the data generation sections can be used to test other machine learning routines.

Example 13-1. Using K-means

#include "cxcore.h"
#include "highgui.h"

void main( int argc, char** argv )
{
    #define MAX_CLUSTERS 5
    CvScalar color_tab[MAX_CLUSTERS];
    IplImage* img = cvCreateImage( cvSize( 500, 500 ), 8, 3 );
    CvRNG rng = cvRNG(0xffffffff);

    color_tab[0] = CV_RGB(255,0,0);
    color_tab[1] = CV_RGB(0,255,0);
    color_tab[2] = CV_RGB(100,100,255);
    color_tab[3] = CV_RGB(255,0,255);
    color_tab[4] = CV_RGB(255,255,0);

    cvNamedWindow( "clusters", 1 );

    for(;;)
    {
        int k, cluster_count = cvRandInt(&rng)%MAX_CLUSTERS + 1;
        int i, sample_count = cvRandInt(&rng)%1000 + 1;
        CvMat* points = cvCreateMat( sample_count, 1, CV_32FC2 );
        CvMat* clusters = cvCreateMat( sample_count, 1, CV_32SC1 );

        /* generate random sample from multivariate
           Gaussian distribution */
        for( k = 0; k < cluster_count; k++ )
        {
            CvPoint center;
            CvMat point_chunk;
            center.x = cvRandInt(&rng)%img->width;
            center.y = cvRandInt(&rng)%img->height;
            cvGetRows( points, &point_chunk,
                       k*sample_count/cluster_count,
                       k == cluster_count - 1 ? sample_count :
                       (k+1)*sample_count/cluster_count );
            cvRandArr( &rng, &point_chunk, CV_RAND_NORMAL,
                       cvScalar(center.x,center.y,0,0),
                       cvScalar(img->width/6, img->height/6,0,0) );
        }

        /* shuffle samples */
        for( i = 0; i < sample_count/2; i++ )
        {
            CvPoint2D32f* pt1 = (CvPoint2D32f*)points->data.fl +
                                 cvRandInt(&rng)%sample_count;
            CvPoint2D32f* pt2 = (CvPoint2D32f*)points->data.fl +
                                 cvRandInt(&rng)%sample_count;
            CvPoint2D32f temp;
            CV_SWAP( *pt1, *pt2, temp );
        }

        cvKMeans2( points, cluster_count, clusters,
                   cvTermCriteria( CV_TERMCRIT_EPS+CV_TERMCRIT_ITER,
                                   10, 1.0 ));
        cvZero( img );
        for( i = 0; i < sample_count; i++ )
        {
            CvPoint2D32f pt = ((CvPoint2D32f*)points->data.fl)[i];
            int cluster_idx = clusters->data.i[i];
            cvCircle( img, cvPointFrom32f(pt), 2,
                      color_tab[cluster_idx], CV_FILLED );
        }

        cvReleaseMat( &points );
        cvReleaseMat( &clusters );

        cvShowImage( "clusters", img );

        int key = cvWaitKey(0);
        if( key == 27 ) // 'ESC'
            break;
    }
}

In this code we included highgui.h to use a window output interface and cxcore.h because it contains Kmeans2(). In main(), we set up the coloring of returned clusters for display, set the upper limit to how many cluster centers can be chosen at random to MAX_CLUSTERS (here 5) in cluster_count, and allow up to 1,000 data points, where the random value for this is kept in sample_count. In the outer for{} loop, which repeats until the Esc key is hit, we allocate a floating point matrix points to contain sample_count data points (in this case, a single column of 2D data points CV_32FC2) and allocate an integer matrix clusters to contain their resulting cluster labels, 0 through cluster_count - 1.

We next enter a data generation for{} loop that can be reused for testing other algorithms. For each cluster, we fill in the points array in successive chunks of size sample_count/cluster_count. Each chunk is filled with a normal distribution, CV_RAND_NORMAL, of 2D (CV_32FC2) data points centered on a randomly chosen 2D center.

The next for{} loop merely shuffles the resulting total "pack" of points. We then call cvKMeans2(), which runs until the largest movement of a cluster center is less than 1 (but allowing no more than ten iterations).

The final for{} loop just draws the results. This is followed by de-allocating the allocated arrays and displaying the results in the "clusters" image. Finally, we wait indefinitely (cvWaitKey(0)) to allow the user another run or to quit via the Esc key.

^[242] S. P. Lloyd, "Least Squares Quantization in PCM," IEEE Transactions on Information Theory 28 (1982), 129–137.

^[243]This is exactly equivalent to an N-by-K matrix in which the N rows are the data points, the K columns are the individual components of each point's location, and the underlying data type is 32FC1. Recall that, owing to the memory layout used for arrays, there is no distinction between these representations.