K-means clustering

With k-means, we will need to specify the exact number of clusters that we want. The algorithm will then iterate until each observation belongs to just one of the k-clusters. The algorithm's goal is to minimize the within-cluster variation as defined by the squared Euclidean distances. So, the kth-cluster variation is the sum of the squared Euclidean distances for all the pairwise observations divided by the number of observations in the cluster.

Due to the iteration process that is involved, one k-means result can differ greatly from another result even if you specify the same number of clusters. Let's see how this plays out for a situation where we will specify three clusters:

Each observation is randomly assigned by the algorithm to one of the three clusters.
Each cluster has a centroid calculated by the algorithm, which is a vector of the variable means for the observations. For example, if you have five input variables, your centroid would be a vector of five values.
Reshuffle the observations to the cluster with the centroid; this minimizes the Euclidean distance.
Iterate through steps 2 and 3 until the within-cluster variation improves.

As you can see, the final result will vary because of the initial assignment in step 1. Therefore, it is important to run multiple initial starts and let the software identify the best solution. In R, this can be a simple process as we will see.