Hal`s agglomerative clustering presentation

Hierarchal Clustering in R Hamilton Elkins November 14, 2013 Agglomerative Clustering1  Data analysis tool used to group items  Builds a binary tree showing items in similar groups  Provides visual representation of the process  Based on distances between items 1. Blei, 2008 2 Agglomerative Clustering2  Works from bottom up  Each observation starts in single group  Closest groups are merged together  Process is repeated until all observations are merged into a single group  Visual example from Blei (2008) start on page 15 of pdf 2. Blei, 2008 R Packages  Stat package  Function hclust  Cluster package  Function agnes  Function daisy Hclust3,4  Requires pre-calculated distance matrix  Plotting results produces dendrogram  Dendrogram is visual representation of agglomerative cluster method  Group merges at greater height than subgroup merges can be natural clusters (Tibshirani et al, 2001) 3. stat.berkley 4. Blei, 2008 Dendrogram 50 Obs Distance Matrix5,6,7  Agglomerative clustering is distance dependent  Distance matrix provides greater generality than clustering observations  Function dist calculates distance matrix  Options for distance between observations  Euclidean, manhattan, maximum, canberra, binary, minkowski 5. Blei, 2008 6. stat.berkley 7. astrostatictics.psu Distance Matrix8  Function daisy in cluster package offers more options  Provides gower distance option that can calculate non- numerical variable distance  Allows individual treatment for variables  ordratio-Treats ratio scaled as ordinal  logratio- Log transforms variables  asymm- Asymmetrical binary  symm- binary 8. Maechler, 2013 Missing Values in hclust10  Function dist accepts missing values 10. astrostatistics.psu Number of Observations  Hierarchal clustering is visual in nature11  Dendrogram shows entire tree from single observations to cluster of one  Splits and heights matter for interpretation  Being able to read the dendrogram is vital 11. Blei, 2008 Standardizing Data  Makes all variables contribute to clusters equally 12. stat.berkley Linkage13,14  Method that links observations to form clusters  Linkage is a measure of inter-cluster distance  hclust default is complete but offers other options  Average, ward, single, mcquitty, median, centroid  Different linkages produce different dendrograms 13. ecology.msu 14. stat.ethz agnes (Agglomerative Nesting)16,17  Can use either pre-calculated distance or raw values  Options on distance calculation  Differences if data is standardized  Produces an agglomerative coefficient  Measures cluster structure  Average of 1- (dissimilarity from first/ dissimilarity from last)  Increases with sample size 16. Maechler, 2013 17. Glynn, 2005 Reading a Dendrogram19  Height – How distant clusters are prior to merge  Height is determined by linkage method  Smaller height jumps in between branches show poorly differentiated clusters  Larger height jumps between last merged group and current indicate well-differentiated clusters 19. stat.berkley Pitfalls20,21  Choices matter and produce different results  Different distance measures can produce vastly different distance matrices  Different linkage choices can lead to vastly different clusters  The algorithm finds the clusters and groupings even if are none in reality  Better for descriptive purposes 20. Blei, 2008 21. stat.berkley References  astrostatistics.psu.edu. “Distance Matrix Computation”. http://www.astrostatistics.psu.edu/su07/R/stats/html/dist.html  Blei, D. 2008. “Hierarchal clustering, COS424”. Princeton University. http://www.cs.princeton.edu/courses/archive/spr08/cos424/slides/clustering-2.pdf  ecology.msu.montana.edu. “Lab 13- Cluster Analysis”. http://ecology.msu.montana.edu/labdsv/R/labs/lab13/lab13.html  Glynn, E.F. 2005. “Correlation ‘Distances’ and Hierarchal Clustering”. Stowers Institute for Medical Research. http://research.stowers-institute.org/efg/R/Visualization/cor-cluster/index.htm  Jain, A.K., Murty, M.N., & Flynn, P.J. 1999. Data Clustering: A Review. ACM Computing Surveys 31 (3): 264-323.  Maechler, M. 2013. Package ‘cluster’. http://cran.r-project.org/web/packages/cluster/cluster.pdf  stat.berkley.edu. “Performing and Interpreting Cluster Analysis”. http://www.stat.berkeley.edu/users/spector/s133/Clus.html  stat.ethz.ch. “Hierarchal Clustering”. http://stat.ethz.ch/R-manual/Rdevel/library/stats/html/hclust.html  Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411-423 R code to run simulation shown during presentation #data set must be read in as object compsetr. CSV version is available from ISQS6348 library salary <- compsetr[,c(1:5, 7,8,10)] sc<-salary[complete.cases(salary),] sal_st<-data.frame(sapply(sc[,],scale)) scd<-dist(as.matrix(sc)) scc<-hclust(scd) plot(scc, labels=F) stdis<-dist(as.matrix(sal_st)) st_cl<-hclust(stdis) plot(st_cl, labels=F) st_cl2<-hclust(stdis, method='average') st_cl4<-hclust(stdis, method='single') st_cl3<-hclust(stdis, method='ward') plot(st_cl2, labels=F) plot(st_cl4, labels=F) plot(st_cl3, labels=F) snc<-agnes(stdis, diss=T, stand=T, method='ward') plot(snc,labels=F) st_clg3<-cutree(st_cl3,3) st_clg5<-cutree(st_cl3,5) st_clg7<-cutree(st_cl3,7) st_clg10<-cutree(st_cl3,10) table(st_clg3) table(st_clg5) table(st_clg7) table(st_clg10)

Hal`s agglomerative clustering presentation

Related documents

Products

Support

Hal`s agglomerative clustering presentation

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib