Hierarchal Clustering in R Hamilton Elkins November 14, 2013 Agglomerative Clustering1 Data analysis tool used to group items Builds a binary tree showing items in similar groups Provides visual representation of the process Based on distances between items 1. Blei, 2008 2 Agglomerative Clustering2 Works from bottom up Each observation starts in single group Closest groups are merged together Process is repeated until all observations are merged into a single group Visual example from Blei (2008) start on page 15 of pdf 2. Blei, 2008 R Packages Stat package Function hclust Cluster package Function agnes Function daisy Hclust3,4 Requires pre-calculated distance matrix Plotting results produces dendrogram Dendrogram is visual representation of agglomerative cluster method Group merges at greater height than subgroup merges can be natural clusters (Tibshirani et al, 2001) 3. stat.berkley 4. Blei, 2008 Dendrogram 50 Obs Distance Matrix5,6,7 Agglomerative clustering is distance dependent Distance matrix provides greater generality than clustering observations Function dist calculates distance matrix Options for distance between observations Euclidean, manhattan, maximum, canberra, binary, minkowski 5. Blei, 2008 6. stat.berkley 7. astrostatictics.psu Distance Matrix8 Function daisy in cluster package offers more options Provides gower distance option that can calculate non- numerical variable distance Allows individual treatment for variables ordratio-Treats ratio scaled as ordinal logratio- Log transforms variables asymm- Asymmetrical binary symm- binary 8. Maechler, 2013 Missing Values in hclust10 Function dist accepts missing values 10. astrostatistics.psu Number of Observations Hierarchal clustering is visual in nature11 Dendrogram shows entire tree from single observations to cluster of one Splits and heights matter for interpretation Being able to read the dendrogram is vital 11. Blei, 2008 Standardizing Data Makes all variables contribute to clusters equally 12. stat.berkley Linkage13,14 Method that links observations to form clusters Linkage is a measure of inter-cluster distance hclust default is complete but offers other options Average, ward, single, mcquitty, median, centroid Different linkages produce different dendrograms 13. ecology.msu 14. stat.ethz agnes (Agglomerative Nesting)16,17 Can use either pre-calculated distance or raw values Options on distance calculation Differences if data is standardized Produces an agglomerative coefficient Measures cluster structure Average of 1- (dissimilarity from first/ dissimilarity from last) Increases with sample size 16. Maechler, 2013 17. Glynn, 2005 Reading a Dendrogram19 Height – How distant clusters are prior to merge Height is determined by linkage method Smaller height jumps in between branches show poorly differentiated clusters Larger height jumps between last merged group and current indicate well-differentiated clusters 19. stat.berkley Pitfalls20,21 Choices matter and produce different results Different distance measures can produce vastly different distance matrices Different linkage choices can lead to vastly different clusters The algorithm finds the clusters and groupings even if are none in reality Better for descriptive purposes 20. Blei, 2008 21. stat.berkley References astrostatistics.psu.edu. “Distance Matrix Computation”. http://www.astrostatistics.psu.edu/su07/R/stats/html/dist.html Blei, D. 2008. “Hierarchal clustering, COS424”. Princeton University. http://www.cs.princeton.edu/courses/archive/spr08/cos424/slides/clustering-2.pdf ecology.msu.montana.edu. “Lab 13- Cluster Analysis”. http://ecology.msu.montana.edu/labdsv/R/labs/lab13/lab13.html Glynn, E.F. 2005. “Correlation ‘Distances’ and Hierarchal Clustering”. Stowers Institute for Medical Research. http://research.stowers-institute.org/efg/R/Visualization/cor-cluster/index.htm Jain, A.K., Murty, M.N., & Flynn, P.J. 1999. Data Clustering: A Review. ACM Computing Surveys 31 (3): 264-323. Maechler, M. 2013. Package ‘cluster’. http://cran.r-project.org/web/packages/cluster/cluster.pdf stat.berkley.edu. “Performing and Interpreting Cluster Analysis”. http://www.stat.berkeley.edu/users/spector/s133/Clus.html stat.ethz.ch. “Hierarchal Clustering”. http://stat.ethz.ch/R-manual/Rdevel/library/stats/html/hclust.html Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411-423 R code to run simulation shown during presentation #data set must be read in as object compsetr. CSV version is available from ISQS6348 library salary <- compsetr[,c(1:5, 7,8,10)] sc<-salary[complete.cases(salary),] sal_st<-data.frame(sapply(sc[,],scale)) scd<-dist(as.matrix(sc)) scc<-hclust(scd) plot(scc, labels=F) stdis<-dist(as.matrix(sal_st)) st_cl<-hclust(stdis) plot(st_cl, labels=F) st_cl2<-hclust(stdis, method='average') st_cl4<-hclust(stdis, method='single') st_cl3<-hclust(stdis, method='ward') plot(st_cl2, labels=F) plot(st_cl4, labels=F) plot(st_cl3, labels=F) snc<-agnes(stdis, diss=T, stand=T, method='ward') plot(snc,labels=F) st_clg3<-cutree(st_cl3,3) st_clg5<-cutree(st_cl3,5) st_clg7<-cutree(st_cl3,7) st_clg10<-cutree(st_cl3,10) table(st_clg3) table(st_clg5) table(st_clg7) table(st_clg10)