K-means is a clustering algorithm implemented in the cxcore because it was
written long before the ML library. K-means attempts to find the natural clusters or
"clumps" in the data. The user sets the desired number of clusters and then K-means rapidly
finds a good placement for those cluster centers, where "good" means that the cluster
centers tend to end up located in the middle of the natural clumps of data. It is one of the
most used clustering techniques and has strong similarities to the expectation maximization algorithm for Gaussian mixture (implemented as CvEM()
in the ML library) as well as some similarities to the
mean-shift algorithm discussed in Chapter 9
(implemented as cvMeanShift()
in the CV library). K-means
is an iterative algorithm and, as implemented in OpenCV, is also known as Lloyd's algorithm[242] or (equivalently) "Voronoi iteration". The algorithm runs as follows.
Take as input (a) a data set and (b) desired number of clusters K (chosen by the user).
Randomly assign cluster center locations.
Associate each data point with its nearest cluster center.
Move cluster centers to the centroid of their data points.
Return to step 3 until convergence (centroid does not move).
Figure 13-5 diagrams K-means in action; in this case, it takes just two iterations to converge. In real cases the algorithm often converges rapidly, but it can sometimes require a large number of iterations.
K-means is an extremely effective clustering algorithm, but it does have three problems.
K-means isn't guaranteed to find the best possible solution to locating the cluster centers. However, it is guaranteed to converge to some solution (i.e., the iterations won't continue indefinitely).
K-means doesn't tell you how many cluster centers you should use. If we had chosen two or four clusters for the example of Figure 13-5, then the results would be different and perhaps nonintuitive.
K-means presumes that the covariance in the space either doesn't matter or has already been normalized (cf. our discussion of the Mahalanobis distance).
Each one of these problems has a "solution", or at least an approach that helps. The first two of these solutions depend on "explaining the variance of the data". In K-means, each cluster center "owns" its data points and we compute the variance of those points.
Figure 13-5. K-means in action for two iterations: (a) cluster centers are placed randomly and each data point is then assigned to its nearest cluster center; (b) cluster centers are moved to the centroid of their points; (c) data points are again assigned to their nearest cluster centers; (d) cluster centers are again moved to the centroid of their points
The best clustering minimizes the variance without causing too much complexity (too many clusters). With that in mind, the listed problems can be ameliorated as follows.
Run K-means several times, each with different placement of the cluster centers (easy to do, since OpenCV places the centers at random); then choose the run whose results exhibit the least variance.
Start with one cluster and try an increasing number of clusters (up to some limit), each time employing the method of #1 as well. Usually the total variance will shrink quite rapidly, after which an "elbow" will appear in the variance curve; this indicates that a new cluster center does not significantly reduce the total variance. Stop at the elbow and keep that many cluster centers.
Multiply the data by the inverse covariance matrix (as described in the "Mahalanobis Distance" section). For example, if the input data vectors D are organized as rows with one data point per row, then normalize the "stretch" in the space by computing a new data vector
, where
.
The call for K-means is simple:
void cvKMeans2( const CvArr* samples, int cluster_count, CvArr* labels, CvTermCriteria termcrit );
The samples
array is a matrix of multidimensional
data points, one per row. There is a little subtlety here in that each element of the data
point may be either a regular floating-point vector of CV_32FC1
numbers or a multidimensional point of type CV_32FC2
or CV_32FC3
or even CV_32FC(K)
.[243] The parameter cluster_count
is simply how
many clusters you want, and the return vector labels
contains the final cluster index for each data point. We encountered termcrit
in the section "Common Routines in the ML Library"
and in the "Controlling Training Iterations" subsection.
It's instructive to see a complete example of K-means in code (Example 13-1), because the data generation sections can be used to test other machine learning routines.
Example 13-1. Using K-means
#include "cxcore.h" #include "highgui.h" void main( int argc, char** argv ) { #define MAX_CLUSTERS 5 CvScalar color_tab[MAX_CLUSTERS]; IplImage* img = cvCreateImage( cvSize( 500, 500 ), 8, 3 ); CvRNG rng = cvRNG(0xffffffff); color_tab[0] = CV_RGB(255,0,0); color_tab[1] = CV_RGB(0,255,0); color_tab[2] = CV_RGB(100,100,255); color_tab[3] = CV_RGB(255,0,255); color_tab[4] = CV_RGB(255,255,0); cvNamedWindow( "clusters", 1 ); for(;;) { int k, cluster_count = cvRandInt(&rng)%MAX_CLUSTERS + 1; int i, sample_count = cvRandInt(&rng)%1000 + 1; CvMat* points = cvCreateMat( sample_count, 1, CV_32FC2 ); CvMat* clusters = cvCreateMat( sample_count, 1, CV_32SC1 ); /* generate random sample from multivariate Gaussian distribution */ for( k = 0; k < cluster_count; k++ ) { CvPoint center; CvMat point_chunk; center.x = cvRandInt(&rng)%img->width; center.y = cvRandInt(&rng)%img->height; cvGetRows( points, &point_chunk, k*sample_count/cluster_count, k == cluster_count - 1 ? sample_count : (k+1)*sample_count/cluster_count ); cvRandArr( &rng, &point_chunk, CV_RAND_NORMAL, cvScalar(center.x,center.y,0,0), cvScalar(img->width/6, img->height/6,0,0) ); } /* shuffle samples */ for( i = 0; i < sample_count/2; i++ ) { CvPoint2D32f* pt1 = (CvPoint2D32f*)points->data.fl + cvRandInt(&rng)%sample_count; CvPoint2D32f* pt2 = (CvPoint2D32f*)points->data.fl + cvRandInt(&rng)%sample_count; CvPoint2D32f temp; CV_SWAP( *pt1, *pt2, temp ); } cvKMeans2( points, cluster_count, clusters, cvTermCriteria( CV_TERMCRIT_EPS+CV_TERMCRIT_ITER, 10, 1.0 )); cvZero( img ); for( i = 0; i < sample_count; i++ ) { CvPoint2D32f pt = ((CvPoint2D32f*)points->data.fl)[i]; int cluster_idx = clusters->data.i[i]; cvCircle( img, cvPointFrom32f(pt), 2, color_tab[cluster_idx], CV_FILLED ); } cvReleaseMat( &points ); cvReleaseMat( &clusters ); cvShowImage( "clusters", img ); int key = cvWaitKey(0); if( key == 27 ) // 'ESC' break; } }
In this code we included highgui.h to use a
window output interface and cxcore.h because it
contains Kmeans2()
. In main()
, we set up the coloring of returned clusters for display, set the
upper limit to how many cluster centers can be chosen at random to MAX_CLUSTERS
(here 5) in cluster_count
, and allow up to 1,000 data points, where the random value for
this is kept in sample_count
. In the outer for{}
loop, which repeats until the Esc key is hit, we
allocate a floating point matrix points
to contain
sample_count
data points (in this case, a single
column of 2D data points CV_32FC2
) and allocate an
integer matrix clusters
to contain their resulting
cluster labels, 0 through cluster_count - 1
.
We next enter a data generation for{}
loop that can
be reused for testing other algorithms. For each cluster, we fill in the points
array in successive chunks of size sample_count/cluster_count
. Each chunk is filled with a normal distribution,
CV_RAND_NORMAL
, of 2D (CV_32FC2
) data points centered on a randomly chosen 2D center
.
The next for{}
loop merely shuffles the resulting
total "pack" of points. We then call cvKMeans2()
, which
runs until the largest movement of a cluster center is less than 1 (but allowing no more
than ten iterations).
The final for{}
loop just draws the results. This
is followed by de-allocating the allocated arrays and displaying the results in the
"clusters
" image. Finally, we wait indefinitely (cvWaitKey(0)
) to allow
the user another run or to quit via the Esc key.
[242] S. P. Lloyd, "Least Squares Quantization in PCM," IEEE Transactions on Information Theory 28 (1982), 129–137.
[243] This is exactly equivalent to an N-by-K
matrix in which the N rows are the data points, the
K columns are the individual components of each point's
location, and the underlying data type is 32FC1
.
Recall that, owing to the memory layout used for arrays, there is no distinction
between these representations.