Hierarchical clustering

Of the several clustering algorithms that we will examine in this chapter, hierarchical clustering is probably the simplest. The trade-off is that it works well only with small datasets in Euclidean space.

The general setup is that we have a dataset S of m points in Hierarchical clustering which we want to partition into a given number k of clusters C₁, C₂,..., C_k, where within each cluster the points are relatively close together. (B. J. Frey and D. Dueck, Clustering by Passing Messages Between Data Points Science 315, Feb 16, 2007 http://science.sciencemag.org/content/315/5814/972).

Here is the algorithm:

Create a singleton cluster for each of the m data points.
Repeat m – k times:
- Find the two clusters whose centroids are closest
- Replace those two clusters with a new cluster that contains their points

The centroid of a cluster is the point whose coordinates are the averages of the corresponding coordinates of the cluster points. For example, the centroid of the cluster C = {(2, 4), (3, 5), (6, 6), (9, 1)} is the point (5, 4), because (2 + 3 + 6 + 9)/4 = 5 and (4 + 5 + 6 + 1)/4 = 4. This is illustrated in Figure 8.6:

Figure 8.6: The centroid of a cluster

A Java implementation of this algorithm is shown in Listing 8.2, with partial output in Figure 8.7. It uses a Point class and a Cluster class, which are shown in Listing 8.3 and Listing 8.4, respectively.

The dataset is defined at lines 11-12 in Listing 8.2:

Listing 8.2. An implementation of hierarchical clustering

These 13 data points are loaded in the clusters set at line 17. Then the loop at lines 18-22 iterates m – k times, as specified in the algorithm.

A cluster, as defined by the Cluster class in Listing 8.4, contains two objects: a set of points and a centroid, which is a single point. The distance between two clusters is defined as the Euclidean distance between their centroids. Note that, for simplicity, some of the code in Listing 8.4 has been folded.

The program uses the HashSet<Cluster> class to implement the set of clusters. That is why the Cluster class overrides the hashCode() and equals() methods (at lines 52-68 in Listing 8.4). That, in turn, requires the Point class to override its corresponding methods (at lines 27-43 in Listing 8.3).

Note that the Point class defines private fields xb and yb of type long. These hold the 64 bit representations of the double values of x and y, providing a more faithful way to determine when they are equal.

The output shown in Figure 8.7 is generated by the code at line 19 and 21 of the program. The call to println() at line 21 implicitly invokes the overridden toString() method at lines 70-74 of the Cluster class.

The coalesce() method at lines 33-49 implements the two parts of step 2 of the algorithm. The double loop at lines 36-44 finds the two clusters that are closest to each other (step 2 first part). These are removed from the clusters set and their union is added to it at lines 45-47 (step 2, second part).

The output in Figure 8.7 shows the results of two iterations of the double loop: six clusters coalescing into five, and then into four.

Figure 8.7: Partial output from hierarchical clustering program

These three stages are illustrated in Figure 8.8, Figure 8.9, and Figure 8.10.

Figure 8.8: Output with six clusters

Figure 8.9: Output with five clusters

Figure 8.10: Output with four clusters

Among the six clusters shown in Figure 8.8, you can see the closest are the ones whose centroids are (6.00, 3.50) and (6.33, 5.67). They get replaced by their union, the five-element cluster whose centroid is (6.20, 4.80).

Figure 8.11: Output with three clusters

We can set the number K to any value from 1 to M (at line 14). Although setting K = 1 has no practical value, it can be used to generate the dendrogram for the original dataset, shown in Figure 8.12. It graphically displays the hierarchical structure of the entire clustering process, identifying each coalesce step.

Note that the dendrogram is not a complete transcript of the process. It shows, for example, that (3,4) and (4,3) get united before they unite with (3,2), but it does not show whether (1,5) and (2,6) get united before or after (1,1) and (1,3) do.

Although easy to understand and implement, hierarchical clustering is not very efficient. The basic version, shown in Listing 8.2, runs in O(n³) time, which means that the number of operations performed is roughly proportional to the cube of the number of points in the dataset. So, if you double the number of points, the program will take about eight times as long to run. That is not practical for large datasets.

You can see where the O(n³) comes from by looking at the code in Listing 8.2. There, n = 13. The main loop (lines 18-22) iterates nearly n times. Each iteration calls the coalesce() method, which has a double loop (lines 36-44), each iterating c times, where c is the number of clusters. That number decreases from n down to k, averaging about n/2. So, each call to the coalesce() method will execute about (n/2)² times, which is proportional to n². Being called nearly n times gives us the O(n³) runtime.

Listing 8.3: A Point class

This kind of complexity analysis is standard for computer algorithms. The main idea is to find some simple function f(n) that classifies the algorithm this way. The O(n³) classification means slow. (The letter O stands for "order of". So, O(n³) means "on the order of n³".)

We can bump up the runtime classification of the hierarchical clustering algorithm by using a more elaborate data structure. The idea is to maintain a priority queue (a balanced binary tree structure) of objects, where each object consists of a pair of points and the distance between them. Objects can be inserted into and removed from a priority queue in O(log n) time. Consequently, the whole algorithm can be run in O(n²log n), which is almost as good as O(n²).

This improvement is a good example of the classic alternative of speed vs. simplicity in computing: often, we can make an algorithm faster (more efficient) at the cost of making it more complicated. Of course, the same thing can be said about cars and airplanes.

The Point class in Listing 8.3 encapsulates the idea of a two-dimensional point in Euclidean space. The hashCode() and equals() methods must be included (overriding the default versions defined in the Object class) because we intend to use this class as the element type in a HashSet (in Listing 8.4).

The code at lines 26-29 defines a typical implementation; the expression new Double(x).hashCode() at line 26 returns the hashCode of the Double object that represents the value of x.

The code at lines 33-42 similarly defines a typical implementation of an equals() method. The first statement (lines 33-39) checks whether the explicit object is null, equals the implicit object (this), and is itself an instance of the Point class, taking the appropriate action in each case. If it passes those three tests, then it is recast as a Point object at line 40 so that we can access its x and y fields.

To check whether they match the corresponding type double fields of the implicit object, we use an auxiliary bits() method, which simply returns a long integer containing all the bits that represent the specified double value.

Listing 8.4: A Cluster class

Weka implementation

The program in Listing 8.5 is equivalent to that in Listing 8.2:

Listing 8.5: Hierarchical clustering with Weka

You can see that the results are the same as illustrated in Figure 8.11: The first seven points are in cluster number 0, and all the others except (7,1) are in cluster number 1.

The load() method at lines 40-52 uses an ArrayList to specify the two attributes x and y. Then it creates the dataset as an Instances object at line 44, loads the 13 data points as Instance objects in the loop at lines 45-50, and returns it back to line 24.

The code at lines 25-26 specifies that the centroids of the clusters are to be used for computing the distances between the clusters.

The algorithm itself is run by the buildClusterer() method at line 29. Then, the loop at lines 30-34 prints the results. The clusterInstance() method returns the number of the cluster to which the specified instance belongs.

The dendrogram shown in Figure 8.12 can be generated programmatically. You can do it with these three changes to the program in Listing 8.5:

Set the number of clusters to 1 at line 31.
Insert this line of code at line 39: displayDendrogram(hc.graph());
Insert the displayDendrogram() method shown in Listing 8.6:
Listing 8.6: A method for displaying the dendrogram

The result is shown in Figure 8.13:

Figure 8.13: Dendrite generated programmatically

This is topologically the same as the tree in Figure 8.12.

The key code here is at line 65. The HierarchyVisualizer() constructor creates an object from the graph string that can be displayed by adding it to the frame's ContentPane object this way.

K-means clustering

As with hierarchical clustering, the K-means clustering algorithm requires the number of clusters, k, as input. (This version is also called the K-Means++ algorithm) Here is the algorithm:

Select k points from the dataset.
Create k clusters, each with one of the initial points as its centroid.
For each dataset point x that is not already a centroid:
- Find the centroid y that is closest to x
- Add x to that centroid's cluster
- Re-compute the centroid for that cluster

It also requires k points, one for each cluster, to initialize the algorithm. These initial points can be selected at random, or by some a priori method. One approach is to run hierarchical clustering on a small sample taken from the given dataset and then pick the centroids of those resulting clusters.

This algorithm is implemented in Listing 8.7:

Listing 8.7: K-means clustering

Its loadData() method is shown in the preceding Listing 8.2. Its other four methods are shown in Listing 8.8:

Listing 8.8: Methods used in Listing 8.7

Output from a run of the program is shown in Figure 8.14.

The program loads the data into the points set at line 22. Then it selects a point at random and removes it from that set. At lines 29-31, it creates a new point set named initSet and adds that random point to it. Then at lines 33-38, it repeats that process for K–1 more points, each selected as the farthest point from the initSet. This completes step 1 of the algorithm. Step 2 is implemented at lines 40-44, and step 3 at lines 46-51.

This implementation of step 1 begins by selecting a point at random. Consequently, different results are likely from separate runs of the program. Note that this output is quite different from the results that we got from hierarchical clustering in Figure 8.11 above.

The Apache Commons Math library implements this algorithm with its KMeansPlusPlusClusterer class, illustrated in Listing 8.9:

Listing 8.9: Apache Commons Math K-means++

The output is similar to what we got with our other clustering programs.

A different, more deterministic implementation of step 1 would be to apply hierarchical clustering first and then select the point in each cluster that is closest to its centroid. For our dataset, we can see from Figure 8.11, that would give us the initial set {(3,2), (7,1), (6,4)} or {(3,2), (7,1), (7,5)}, since (6,4) and (7,5) tie for being closest to the centroid (6.2,4.8).

The simplest version of K-means picks all the initial k clusters at random. This runs faster than the other two methods described here, but the results are usually not as satisfactory. Weka implements this version with its SimpleKMeans class, illustrated in Listing 8.10:

Listing 8.10: K-means with Weka

The output is shown in Figure 8-15.

Figure 8.15: Output from Listing 8.9

This program is very similar to that in Listing 8.5, where we applied the HierarchicalClusterer from the same weka.clusterers package. But the results seem less satisfactory. It clusters (7,1) with the four points above it, and it clusters (3,2) with (1,1) and (1,3), but not with (3,4), which is closer.

K-medoids clustering

The k-medoids clustering algorithm is similar to the k-means algorithm, except that each cluster center, called its medoid, is one of the data points instead of being the mean of its points. The idea is to minimize the average distances from the medoids to points in their clusters. The Manhattan metric is usually used for these distances. Since those averages will be minimal if and only if the distances are, the algorithm is reduced to minimizing the sum of all distances from the points to their medoids. This sum is called the cost of the configuration.

Here is the algorithm:

Select k points from the dataset to be medoids.
Assign each data point to its closest medoid. This defines the k clusters.
For each cluster C_j:
- Compute the sum , where each , and change the medoid y_j to whatever point in the cluster C_j that minimizes s
- If the medoid y_j was changed, re-assign each x to the cluster whose medoid is closest
Repeat step 3 until s is minimal.

Figure 8-14. Output from Listing 8-7

This is illustrated by the simple example in Figure 8.16. It shows 10 data points in 2 clusters. The two medoids are shown as filled points. In the initial configuration it is:

The sums are:

The algorithm at step 3 first part changes the medoid for C₁ to y₁ = x₃ = (3,2). This causes the clusters to change, at step 3 second part, to:

This makes the sums:

The resulting configuration is shown in the second panel of Figure 8-16:

Figure 8.16: K-medoid clustering

At step 3 of the algorithm, the process repeats for cluster C₂. The resulting configuration is shown in the third panel of Figure 8.16. The computations are:

The algorithm continues with two more changes, finally converging to the minimal configuration shown in the fifth panel of Figure 8.16.

This version of k-medoid clustering is also called partitioning around medoids (PAM).

Like K-means clustering, k-medoid clustering is ill-suited for large datasets. It does, however, overcome the problem of outliers, evident in Figure 8.14.

Affinity propagation clustering

One disadvantage of each of the clustering algorithms previously presented (hierarchical, k-means, k-medoids) is the requirement that the number of clusters k be determined in advance. The affinity propagation clustering algorithm does not have that requirement. Developed in 2007 by Brendan J. Frey and Delbert Dueck at the University of Toronto, it has become one of the most widely-used clustering methods. (B. J. Frey and D. Dueck, Clustering by Passing Messages Between Data Points Science 315, Feb 16, 2007 http://science.sciencemag.org/content/315/5814/972).

Like k-medoid clustering, affinity propagation selects cluster center points, called exemplars, from the dataset to represent the clusters. This is done by message-passing between the data points.

The algorithm works with three two-dimensional arrays:

r_ik = responsibility: message from x_i to x_k on how well-suited x_k is as an exemplar for x_i

a_ik = availability: message from x_k to x_i on how well-suited x_k is as an exemplar for x_i

We think of the r_ik as messages from x_i to x_k, and the a_ik as messages from x_k to x_i. By repeatedly re-computing these, the algorithm maximizes the total similarity between the data points and their exemplars.

Figure 8.17 shows how the message-passing works. Data point x_i sends the message r_ik to data point x_k, by updating the value of the array element r[i][k]. That value represents how well-suited (from the view of x_i) the candidate x_k would be as an exemplar (representative) of x_i. Later, x_k sends the message a_ik to data point xi, by updating the value of the array element a[i][k]. That value represents how well-suited (from the view of x_k) the candidate x_k would be as an exemplar (representative) of x_i. In both cases, the higher the value of the array element, the higher the level of suitability.

The algorithm begins by setting the similarity values to be s_ij = –d(x_i, x_j)², for i ≠ j, where d() is the Euclidean metric. Squaring the distance simply eliminates the unnecessary step of computing square roots. Changing the sign ensures that s_ij > s_ik when x_i is closer to x_j than to x_k; i.e., x_i is more similar to x_j than to x_k. For example, in Figure 8.17, x₁ = (2,4), x₂ = (4,1), and x₃ = (5,3). Clearly, x₂ is closer to x₃ than to x₁ and s₂₃ > s₂₁, because s₂₃ = -5 > -13 = s₂₁.

We also set each s_ii to the average of the s_ij, for which i ≠ j. To reduce the number of clusters, that common value can be reduced to the minimum instead of the average of the others. The algorithm then repeatedly updates all the responsibilities r_ik and the availabilities a_ik.

In general, the suitability of a candidate x_k being an exemplar of a point x_i will be determined by the sum:

This sum measures the suitability of such representation from the view of x_k (availability) combined with that from the view of x_i (responsibility). When that sum converges to a maximum value, it will have determined that representation.

Conversely, the higher a_ij+r_ij is for some other j ≠ k, the less suited x_k is as an exemplar of a point x_i. This leads to our update formula for r_ik:

Figure 8.17: Affinity Propagation

For x_k to represent a data point x_i, we want the two points to be similar (high s_ik), but we don't want any other x_j to be a better representative (low a_ij + s_ij for j ≠ k).

Note that, initially, all the a_ij (and the r_ij) will be zero. So, on the first iteration:

That is, each responsibility value is set equal to the corresponding similarity value minus the similarity value of the closest competitor.

Each candidate exemplar x_k measures its availability a_ik to represent another data point x_i by adding to its own self-responsibility r_kk the sum of the positive responsibilities r_jk that it receives from the other points:

Note that the sum is thresholded by zero, so that only non-positive values will be assigned to a_ik.

The self-availability a_kk measuring the confidence that x_k has in representing itself is updated separately:

This simply reflects that that self-confidence is accumulated positive confidence (responsibilities) that the other points have for x_k.

Here is the complete algorithm:

Initialize the
similarities:
- s_ij = –d(x_i, x_j)², for i ≠ j;
- s_ii = the average of those other s_ij values
Repeat until
convergence:
- Update the responsibilities:
- Update the availabilities:

A point x_k will be an exemplar for a point x_i if a_ik+ r_ik= max_j {a_ij+ r_ij}.

The algorithm is implemented in the program shown in Listing 8-11.

Listing 8.11: Affinity propagation clustering

It is run on the small dataset {(1,2), (2,3), (4,1), (4,4), (5,3)}, shown in Figure 8.18.

Figure 8.18: Sample input dataset

As the output shows, the five points are organized into two clusters, with exemplars x₁ = (2,3) and x₄ = (5,3).

The main() method initializes the similarities array s[][] at line 18. Then the main loop repeatedly updates the responsibilities array r[][] and the availabilities array a[][] at lines 19-22. Finally, the results are printed at line 23.

Listing 8.12. Initializing the similarities array

It implements step 1 of the algorithm. At line 30, the negation of the square of the Euclidean distance between the two points x_i and x_j is assigned to both s[i][j] and s[j][i]. That value is computed by the auxiliary method negSqEuclidDist(), which is defined at lines 68-74 (Listing 8.16). The sum of those values is accumulated in the variable sum, which is then used at line 33 to compute their mean average. (In their original 2007 paper, Frey and Dueck recommend using the median average here. In our implementation, we use the mean average instead). That average value is then re-assigned to all the diagonal elements s[i][i] at line 35, as directed by step 1 of the algorithm.

The initial value that is assigned to the diagonal elements s_ii at line 35 can be adjusted to affect the number of exemplars (clusters) that are generated. In their paper, Frey and Dueck show that, with their sample dataset of 25 points, they can obtain a range of results, from one cluster to 25 clusters, by varying that initial value from –100 to –0.1. So, a common practice is to run the algorithm using the average value on that diagonal and then re-run it after adjusting that initial value to generate a different number of clusters.

Note that the assignment to sum at line 30 executes (n² – n)/2 times as i iterates from 0 to n – 1 and j iterates from 0 to i – 1 (for example, if n = 5, then there will be 20 assignments to sum). So, at line 33, the average is assigned the sum divided by (n² – n)/2. This is the average of all the elements that lie below the diagonal. Then that constant is assigned to each diagonal element at line 35.

Note that, because of the double assignment at line 30, the array s[][] is (as a matrix) symmetric about its diagonal. So, the constant, average, is also the average of all the elements above the diagonal.

The updateResponsibilities() method is shown in Listing 8.13:

Listing 8.13: Updating the Responsibilities Array

It implements step 2 first part of the algorithm. At lines 43-48, the value of max{a_ij+s_ij:j≠k} is computed. That max value is then used to compute s_ik – max{a_ij+s_ij:j≠k} at line 49. That damped value is then assigned to r_ik at line 50.

Listing 8-14. Updating the Availabilities Array

The damping performed at line 50 and again at line 63 is recommended by Frey and Dueck to avoid numerical oscillations. They recommend a damping factor of 0.5, which is the value to which the DAMPER constant is initialized at line 15 in Listing 8.15.

Listing 8.15: Printing the Results

The updateAvailabilities() method is shown in Listing 8.14. It implements step 2 second part of the algorithm. The value of Affinity propagation clustering is computed at line 59. The sum in that expression is computed separately by the auxiliary method sumOfPos(), which is defined at lines 76-86 (Listing 8.16).

Listing 8.16: Auxiliary Methods

It is the sum of all the positive r_jk, excluding r_ik and r_kk. The element a_ik is then assigned that (damped) value at line 63. The diagonal elements a_kk are re-assigned the value of sumOfPos(k,k) at line 61, as required by step 2 second part of the algorithm.

The printResults() method is shown in Listing 8.15. It computes and prints the exemplar (cluster representative) for each point of the dataset. These are determined by the criterion specified in the algorithm: the point x_k is the exemplar for the point x_i if a_ik+ r_ik = max_j {a_ij+r_ij}. That index k is computed for each i at lines 90-98 and then printed at line 99.

In their original 2007 paper, Frey and Dueck recommend iterating until the exemplar assignments remain unchanged for 10 iterations. In this implementation, with such a small dataset, we made a total of only 10 iterations.

In his 2009 Ph.D. thesis, D. Dueck mentions that "one run of k-medoids may be needed to resolve contradictory solutions." (Affinity Propagation: Clustering Data by Passing Messages, U. of Toronto, 2009.)