Introduction to Hierarchical Clustering
Uncovering Structure in State-level Demographic Data in R
Clustering tries to find structure in data by creating groupings of data with similar characteristics. The most famous clustering algorithm is likely K-means, but there are a large number of ways to cluster observations. Hierarchical clustering is an alternative class of clustering algorithms that produce 1 to n clusters, where n is the number of observations in the data set. As you go down the hierarchy from 1 cluster (contains all the data) to n clusters (each observation is its own cluster), the clusters become more and more similar (almost always). There are two types of hierarchical clustering: divisive (top-down) and agglomerative (bottom-up).
Divisive
Divisive hierarchical clustering works by starting with 1 cluster containing the entire data set. The observation with the highest average dissimilarity (farthest from the cluster by some metric) is reassigned to its own cluster. Any observations in the old cluster closer to the new cluster are assigned to the new cluster. This process repeats with the largest cluster until each observation is its own cluster.
Agglomerative
Agglomerative clustering starts with each observation as its own cluster. The two closest clusters are joined into one cluster. The next closest clusters are grouped together and this process continues until there is only one cluster containing the entire data set.
What does it mean to be close?
In the section above, I neglected to define what “close” means. There are a variety of possible metrics, but I will list the 4 most popular: single-linkage, complete-linkage, average-linkage, and centroid-linkage.
Single-Linkage
Single-linkage (nearest neighbor) is the shortest distance between a pair of observations in two clusters. It can sometimes produce clusters where observations in different clusters are closer together than to observations within their own clusters. These clusters can appear spread-out.
Complete-Linkage
Complete-linkage (farthest neighbor) is where distance is measured between the farthest pair of observations in two clusters. This method usually produces tighter clusters than single-linkage, but these tight clusters can end up very close together. Along with average-linkage, it is one of the more popular distance metrics.
Average-Linkage
Average-linkage is where the distance between each pair of observations in each cluster are added up and divided by the number of pairs to get an average inter-cluster distance. Average-linkage and complete-linkage are the two most popular distance metrics in hierarchical clustering.
Centroid-Linkage
Centroid-linkage is the distance between the centroids of two clusters. As the centroids move with new observations, it is possible that the smaller clusters are more similar to the new larger cluster than to their individual clusters causing an inversion in the dendrogram. This problem doesn’t arise in the other linkage methods because the clusters being merged will always be more similar to themselves than to the new larger cluster.
Using Hierarchical Clustering on State-level Demographic Data in R
The conception of regions is strong in how we categorize states in the US. Regions are clusters of states defined by geography, but geography leads to additional economic, demographic, and cultural similarities between states. For example, Southern Florida is very close to Cuba making it the main destination of Cuban refugees going to the US by sea. Thus, South Florida has the largest concentration of Cuban Americans.
To study how similar states are to each other today (actually in 2017), I downloaded data containing info from the 2017 American Community Survey and used hierarchical clustering to group them. The data set has many variables, so I used “ eigenvector decomposition, a concept from quantum mechanics to tease apart the overlapping ‘notes’ in” demographic data (I know I’m late to the dog-pile, but I have to tell everyone that I took a linear algebra class too). The resulting dendrograms (with R code) are below.
Agglomerative Hierarchical Clustering with Complete Linkage
hc.complete = hclust(dist(pc.state.full$x[,1:5]),method=’complete’)
plot(hc.complete, labels = X_state$State, main=’Dendrogram of Regional Clusters using 2017 ACS Data (Agglomerative)’, xlab=’’, sub=’’,cex=0.7)

Divisive Hierarchical Clustering
library(cluster)
div.hc = diana(pc.state.full$x[,1:5], diss = inherits(pc.state.full$x[,1:5], “dist”), metric = “euclidean”)
plot(div.hc, labels = X_state$State, , main=’Dendrogram of Regional Clusters using 2017 ACS Data (Divisive)’, xlab=’’)

There are some differences between the clusters resulting from the agglomerative and divisive approaches. The agglomerative approach groups Georgia and North Carolina with Illinois, Delaware, Pennsylvania, and Rhode Island, while the divisive approach groups them with the South + Ohio, Michigan, Missouri, and Indiana. Broadly, they are the same, which is what we would expect.
A few findings from these dendrograms:
-Virginia is closer to the Mid-Atlantic than the South
-Alaska is closest to the Upper Plains states (rural, white, with comparatively large indigenous populations)
-Ohio, Michigan, Missouri, and Indiana are closer to the South than other Midwestern states
-Hawaii and DC are very different than the other states
I hope you learned something new and saw how easy it is to implement these techniques in R. hclust() is available in base R, while diana() is in the cluster library.
My data and code are available here.





