Hierarchical clustering

Of the several clustering algorithms that we will examine in this chapter, hierarchical clustering is probably the simplest. The trade-off is that it works well only with small datasets in Euclidean space.

The general setup is that we have a dataset S of m points in Hierarchical clustering which we want to partition into a given number k of clusters C1, C2,..., Ck, where within each cluster the points are relatively close together. (B. J. Frey and D. Dueck, Clustering by Passing Messages Between Data Points Science 315, Feb 16, 2007 http://science.sciencemag.org/content/315/5814/972).

Here is the algorithm:

  1. Create a singleton cluster for each of the m data points.
  2. Repeat m – k times:
    • Find the two clusters whose centroids are closest
    • Replace those two clusters with a new cluster that contains their points

The centroid of a cluster is the point whose coordinates are the averages of the corresponding coordinates of the cluster points. For example, the centroid of the cluster C = {(2, 4), (3, 5), (6, 6), (9, 1)} is the point (5, 4), because (2 + 3 + 6 + 9)/4 = 5 and (4 + 5 + 6 + 1)/4 = 4. This is illustrated in Figure 8.6:

Hierarchical clustering

Figure 8.6: The centroid of a cluster

A Java implementation of this algorithm is shown in Listing 8.2, with partial output in Figure 8.7. It uses a Point class and a Cluster class, which are shown in Listing 8.3 and Listing 8.4, respectively.

The dataset is defined at lines 11-12 in Listing 8.2:

Hierarchical clustering

Listing 8.2. An implementation of hierarchical clustering

These 13 data points are loaded in the clusters set at line 17. Then the loop at lines 18-22 iterates m – k times, as specified in the algorithm.

A cluster, as defined by the Cluster class in Listing 8.4, contains two objects: a set of points and a centroid, which is a single point. The distance between two clusters is defined as the Euclidean distance between their centroids. Note that, for simplicity, some of the code in Listing 8.4 has been folded.

The program uses the HashSet<Cluster> class to implement the set of clusters. That is why the Cluster class overrides the hashCode() and equals() methods (at lines 52-68 in Listing 8.4). That, in turn, requires the Point class to override its corresponding methods (at lines 27-43 in Listing 8.3).

Note that the Point class defines private fields xb and yb of type long. These hold the 64 bit representations of the double values of x and y, providing a more faithful way to determine when they are equal.

The output shown in Figure 8.7 is generated by the code at line 19 and 21 of the program. The call to println() at line 21 implicitly invokes the overridden toString() method at lines 70-74 of the Cluster class.

The coalesce() method at lines 33-49 implements the two parts of step 2 of the algorithm. The double loop at lines 36-44 finds the two clusters that are closest to each other (step 2 first part). These are removed from the clusters set and their union is added to it at lines 45-47 (step 2, second part).

The output in Figure 8.7 shows the results of two iterations of the double loop: six clusters coalescing into five, and then into four.

Hierarchical clustering

Figure 8.7: Partial output from hierarchical clustering program

These three stages are illustrated in Figure 8.8, Figure 8.9, and Figure 8.10.

Hierarchical clustering

Figure 8.8: Output with six clusters

Hierarchical clustering

Figure 8.9: Output with five clusters

Hierarchical clustering

Figure 8.10: Output with four clusters

Among the six clusters shown in Figure 8.8, you can see the closest are the ones whose centroids are (6.00, 3.50) and (6.33, 5.67). They get replaced by their union, the five-element cluster whose centroid is (6.20, 4.80).

Hierarchical clustering

Figure 8.11: Output with three clusters

We can set the number K to any value from 1 to M (at line 14). Although setting K = 1 has no practical value, it can be used to generate the dendrogram for the original dataset, shown in Figure 8.12. It graphically displays the hierarchical structure of the entire clustering process, identifying each coalesce step.

Note that the dendrogram is not a complete transcript of the process. It shows, for example, that (3,4) and (4,3) get united before they unite with (3,2), but it does not show whether (1,5) and (2,6) get united before or after (1,1) and (1,3) do.

Although easy to understand and implement, hierarchical clustering is not very efficient. The basic version, shown in Listing 8.2, runs in O(n3) time, which means that the number of operations performed is roughly proportional to the cube of the number of points in the dataset. So, if you double the number of points, the program will take about eight times as long to run. That is not practical for large datasets.

You can see where the O(n3) comes from by looking at the code in Listing 8.2. There, n = 13. The main loop (lines 18-22) iterates nearly n times. Each iteration calls the coalesce() method, which has a double loop (lines 36-44), each iterating c times, where c is the number of clusters. That number decreases from n down to k, averaging about n/2. So, each call to the coalesce() method will execute about (n/2)2 times, which is proportional to n2. Being called nearly n times gives us the O(n3) runtime.

Hierarchical clustering

Listing 8.3: A Point class

This kind of complexity analysis is standard for computer algorithms. The main idea is to find some simple function f(n) that classifies the algorithm this way. The O(n3) classification means slow. (The letter O stands for "order of". So, O(n3) means "on the order of n3".)

We can bump up the runtime classification of the hierarchical clustering algorithm by using a more elaborate data structure. The idea is to maintain a priority queue (a balanced binary tree structure) of objects, where each object consists of a pair of points and the distance between them. Objects can be inserted into and removed from a priority queue in O(log n) time. Consequently, the whole algorithm can be run in O(n2log n), which is almost as good as O(n2).

This improvement is a good example of the classic alternative of speed vs. simplicity in computing: often, we can make an algorithm faster (more efficient) at the cost of making it more complicated. Of course, the same thing can be said about cars and airplanes.

The Point class in Listing 8.3 encapsulates the idea of a two-dimensional point in Euclidean space. The hashCode() and equals() methods must be included (overriding the default versions defined in the Object class) because we intend to use this class as the element type in a HashSet (in Listing 8.4).

The code at lines 26-29 defines a typical implementation; the expression new Double(x).hashCode() at line 26 returns the hashCode of the Double object that represents the value of x.

The code at lines 33-42 similarly defines a typical implementation of an equals() method. The first statement (lines 33-39) checks whether the explicit object is null, equals the implicit object (this), and is itself an instance of the Point class, taking the appropriate action in each case. If it passes those three tests, then it is recast as a Point object at line 40 so that we can access its x and y fields.

To check whether they match the corresponding type double fields of the implicit object, we use an auxiliary bits() method, which simply returns a long integer containing all the bits that represent the specified double value.

Hierarchical clustering

Listing 8.4: A Cluster class

The program in Listing 8.5 is equivalent to that in Listing 8.2:

You can see that the results are the same as illustrated in Figure 8.11: The first seven points are in cluster number 0, and all the others except (7,1) are in cluster number 1.

The load() method at lines 40-52 uses an ArrayList to specify the two attributes x and y. Then it creates the dataset as an Instances object at line 44, loads the 13 data points as Instance objects in the loop at lines 45-50, and returns it back to line 24.

The code at lines 25-26 specifies that the centroids of the clusters are to be used for computing the distances between the clusters.

The algorithm itself is run by the buildClusterer() method at line 29. Then, the loop at lines 30-34 prints the results. The clusterInstance() method returns the number of the cluster to which the specified instance belongs.

The dendrogram shown in Figure 8.12 can be generated programmatically. You can do it with these three changes to the program in Listing 8.5:

The result is shown in Figure 8.13:

This is topologically the same as the tree in Figure 8.12.

The key code here is at line 65. The HierarchyVisualizer() constructor creates an object from the graph string that can be displayed by adding it to the frame's ContentPane object this way.

A popular alternative to hierarchical clustering is the K-means algorithm. It is related to the K-Nearest Neighbor (KNN) classification algorithm that we saw in Chapter 7, Classification Analysis.

As with hierarchical clustering, the K-means clustering algorithm requires the number of clusters, k, as input. (This version is also called the K-Means++ algorithm) Here is the algorithm:

It also requires k points, one for each cluster, to initialize the algorithm. These initial points can be selected at random, or by some a priori method. One approach is to run hierarchical clustering on a small sample taken from the given dataset and then pick the centroids of those resulting clusters.

This algorithm is implemented in Listing 8.7:

Its loadData() method is shown in the preceding Listing 8.2. Its other four methods are shown in Listing 8.8:

Output from a run of the program is shown in Figure 8.14.

The program loads the data into the points set at line 22. Then it selects a point at random and removes it from that set. At lines 29-31, it creates a new point set named initSet and adds that random point to it. Then at lines 33-38, it repeats that process for K–1 more points, each selected as the farthest point from the initSet. This completes step 1 of the algorithm. Step 2 is implemented at lines 40-44, and step 3 at lines 46-51.

This implementation of step 1 begins by selecting a point at random. Consequently, different results are likely from separate runs of the program. Note that this output is quite different from the results that we got from hierarchical clustering in Figure 8.11 above.

The Apache Commons Math library implements this algorithm with its KMeansPlusPlusClusterer class, illustrated in Listing 8.9:

The output is similar to what we got with our other clustering programs.

A different, more deterministic implementation of step 1 would be to apply hierarchical clustering first and then select the point in each cluster that is closest to its centroid. For our dataset, we can see from Figure 8.11, that would give us the initial set {(3,2), (7,1), (6,4)} or {(3,2), (7,1), (7,5)}, since (6,4) and (7,5) tie for being closest to the centroid (6.2,4.8).

The simplest version of K-means picks all the initial k clusters at random. This runs faster than the other two methods described here, but the results are usually not as satisfactory. Weka implements this version with its SimpleKMeans class, illustrated in Listing 8.10:

The output is shown in Figure 8-15.

This program is very similar to that in Listing 8.5, where we applied the HierarchicalClusterer from the same weka.clusterers package. But the results seem less satisfactory. It clusters (7,1) with the four points above it, and it clusters (3,2) with (1,1) and (1,3), but not with (3,4), which is closer.

The k-medoids clustering algorithm is similar to the k-means algorithm, except that each cluster center, called its medoid, is one of the data points instead of being the mean of its points. The idea is to minimize the average distances from the medoids to points in their clusters. The Manhattan metric is usually used for these distances. Since those averages will be minimal if and only if the distances are, the algorithm is reduced to minimizing the sum of all distances from the points to their medoids. This sum is called the cost of the configuration.

Here is the algorithm:

This is illustrated by the simple example in Figure 8.16. It shows 10 data points in 2 clusters. The two medoids are shown as filled points. In the initial configuration it is:

K-medoids clustering

The sums are:

K-medoids clustering

The algorithm at step 3 first part changes the medoid for C1 to y1 = x3 = (3,2). This causes the clusters to change, at step 3 second part, to:

K-medoids clustering

This makes the sums:

K-medoids clustering

The resulting configuration is shown in the second panel of Figure 8-16:

At step 3 of the algorithm, the process repeats for cluster C2. The resulting configuration is shown in the third panel of Figure 8.16. The computations are:

K-medoids clustering

The algorithm continues with two more changes, finally converging to the minimal configuration shown in the fifth panel of Figure 8.16.

This version of k-medoid clustering is also called partitioning around medoids (PAM).

Like K-means clustering, k-medoid clustering is ill-suited for large datasets. It does, however, overcome the problem of outliers, evident in Figure 8.14.

One disadvantage of each of the clustering algorithms previously presented (hierarchical, k-means, k-medoids) is the requirement that the number of clusters k be determined in advance. The affinity propagation clustering algorithm does not have that requirement. Developed in 2007 by Brendan J. Frey and Delbert Dueck at the University of Toronto, it has become one of the most widely-used clustering methods. (B. J. Frey and D. Dueck, Clustering by Passing Messages Between Data Points Science 315, Feb 16, 2007 http://science.sciencemag.org/content/315/5814/972).

Like k-medoid clustering, affinity propagation selects cluster center points, called exemplars, from the dataset to represent the clusters. This is done by message-passing between the data points.

The algorithm works with three two-dimensional arrays:

sij = the similarity between xi and xj

rik = responsibility: message from xi to xk on how well-suited xk is as an exemplar for xi

aik = availability: message from xk to xi on how well-suited xk is as an exemplar for xi

We think of the rik as messages from xi to xk, and the aik as messages from xk to xi. By repeatedly re-computing these, the algorithm maximizes the total similarity between the data points and their exemplars.

Figure 8.17 shows how the message-passing works. Data point xi sends the message rik to data point xk, by updating the value of the array element r[i][k]. That value represents how well-suited (from the view of xi) the candidate xk would be as an exemplar (representative) of xi. Later, xk sends the message aik to data point xi, by updating the value of the array element a[i][k]. That value represents how well-suited (from the view of xk) the candidate xk would be as an exemplar (representative) of xi. In both cases, the higher the value of the array element, the higher the level of suitability.

The algorithm begins by setting the similarity values to be sij = –d(xi, xj)2, for i ≠ j, where d() is the Euclidean metric. Squaring the distance simply eliminates the unnecessary step of computing square roots. Changing the sign ensures that sij > sik when xi is closer to xj than to xk; i.e., xi is more similar to xj than to xk. For example, in Figure 8.17, x1 = (2,4), x2 = (4,1), and x3 = (5,3). Clearly, x2 is closer to x3 than to x1 and s23 > s21, because s23 = -5 > -13 = s21.

We also set each sii to the average of the sij, for which i ≠ j. To reduce the number of clusters, that common value can be reduced to the minimum instead of the average of the others. The algorithm then repeatedly updates all the responsibilities rik and the availabilities aik.

In general, the suitability of a candidate xk being an exemplar of a point xi will be determined by the sum:

Affinity propagation clustering

This sum measures the suitability of such representation from the view of xk (availability) combined with that from the view of xi (responsibility). When that sum converges to a maximum value, it will have determined that representation.

Conversely, the higher aij+rij is for some other j ≠ k, the less suited xk is as an exemplar of a point xi. This leads to our update formula for rik:

Affinity propagation clustering

For xk to represent a data point xi, we want the two points to be similar (high sik), but we don't want any other xj to be a better representative (low aij + sij for j ≠ k).

Note that, initially, all the aij (and the rij) will be zero. So, on the first iteration:

Affinity propagation clustering

That is, each responsibility value is set equal to the corresponding similarity value minus the similarity value of the closest competitor.

Each candidate exemplar xk measures its availability aik to represent another data point xi by adding to its own self-responsibility rkk the sum of the positive responsibilities rjk that it receives from the other points:

Affinity propagation clustering

Note that the sum is thresholded by zero, so that only non-positive values will be assigned to aik.

The self-availability akk measuring the confidence that xk has in representing itself is updated separately:

Affinity propagation clustering

This simply reflects that that self-confidence is accumulated positive confidence (responsibilities) that the other points have for xk.

Here is the complete algorithm:

A point xk will be an exemplar for a point xi if aik+ rik= maxj {aij+ rij}.

The algorithm is implemented in the program shown in Listing 8-11.

It is run on the small dataset {(1,2), (2,3), (4,1), (4,4), (5,3)}, shown in Figure 8.18.

As the output shows, the five points are organized into two clusters, with exemplars x1 = (2,3) and x4 = (5,3).

The main() method initializes the similarities array s[][] at line 18. Then the main loop repeatedly updates the responsibilities array r[][] and the availabilities array a[][] at lines 19-22. Finally, the results are printed at line 23.

The initSimilarities() method is shown in Listing 8.12.

It implements step 1 of the algorithm. At line 30, the negation of the square of the Euclidean distance between the two points xi and xj is assigned to both s[i][j] and s[j][i]. That value is computed by the auxiliary method negSqEuclidDist(), which is defined at lines 68-74 (Listing 8.16). The sum of those values is accumulated in the variable sum, which is then used at line 33 to compute their mean average. (In their original 2007 paper, Frey and Dueck recommend using the median average here. In our implementation, we use the mean average instead). That average value is then re-assigned to all the diagonal elements s[i][i] at line 35, as directed by step 1 of the algorithm.

The initial value that is assigned to the diagonal elements sii at line 35 can be adjusted to affect the number of exemplars (clusters) that are generated. In their paper, Frey and Dueck show that, with their sample dataset of 25 points, they can obtain a range of results, from one cluster to 25 clusters, by varying that initial value from –100 to –0.1. So, a common practice is to run the algorithm using the average value on that diagonal and then re-run it after adjusting that initial value to generate a different number of clusters.

Note that the assignment to sum at line 30 executes (n2 – n)/2 times as i iterates from 0 to n – 1 and j iterates from 0 to i – 1 (for example, if n = 5, then there will be 20 assignments to sum). So, at line 33, the average is assigned the sum divided by (n2 – n)/2. This is the average of all the elements that lie below the diagonal. Then that constant is assigned to each diagonal element at line 35.

Note that, because of the double assignment at line 30, the array s[][] is (as a matrix) symmetric about its diagonal. So, the constant, average, is also the average of all the elements above the diagonal.

The updateResponsibilities() method is shown in Listing 8.13:

It implements step 2 first part of the algorithm. At lines 43-48, the value of max{aij+sij:j≠k} is computed. That max value is then used to compute sik – max{aij+sij:j≠k} at line 49. That damped value is then assigned to rik at line 50.

The damping performed at line 50 and again at line 63 is recommended by Frey and Dueck to avoid numerical oscillations. They recommend a damping factor of 0.5, which is the value to which the DAMPER constant is initialized at line 15 in Listing 8.15.

The updateAvailabilities() method is shown in Listing 8.14. It implements step 2 second part of the algorithm. The value of Affinity propagation clustering is computed at line 59. The sum in that expression is computed separately by the auxiliary method sumOfPos(), which is defined at lines 76-86 (Listing 8.16).

It is the sum of all the positive rjk, excluding rik and rkk. The element aik is then assigned that (damped) value at line 63. The diagonal elements akk are re-assigned the value of sumOfPos(k,k) at line 61, as required by step 2 second part of the algorithm.

The printResults() method is shown in Listing 8.15. It computes and prints the exemplar (cluster representative) for each point of the dataset. These are determined by the criterion specified in the algorithm: the point xk is the exemplar for the point xi if aik+ rik = maxj {aij+rij}. That index k is computed for each i at lines 90-98 and then printed at line 99.

In their original 2007 paper, Frey and Dueck recommend iterating until the exemplar assignments remain unchanged for 10 iterations. In this implementation, with such a small dataset, we made a total of only 10 iterations.

In his 2009 Ph.D. thesis, D. Dueck mentions that "one run of k-medoids may be needed to resolve contradictory solutions." (Affinity Propagation: Clustering Data by Passing Messages, U. of Toronto, 2009.)