Hal`s agglomerative clustering presentation

advertisement
Hierarchal Clustering in R
Hamilton Elkins
November 14, 2013
Agglomerative Clustering1
 Data analysis tool used to group items
 Builds a binary tree showing items in similar groups
 Provides visual representation of the process
 Based on distances between items
1. Blei, 2008
2
Agglomerative Clustering2
 Works from bottom up
 Each observation starts in single group
 Closest groups are merged together
 Process is repeated until all observations are merged into a
single group
 Visual example from Blei (2008) start on page 15 of pdf
2. Blei, 2008
R Packages
 Stat package
 Function hclust
 Cluster package
 Function agnes
 Function daisy
Hclust3,4
 Requires pre-calculated distance matrix
 Plotting results produces dendrogram
 Dendrogram is visual representation of agglomerative cluster
method
 Group merges at greater height than subgroup merges can be
natural clusters (Tibshirani et al, 2001)
3. stat.berkley 4. Blei, 2008
Dendrogram 50 Obs
Distance Matrix5,6,7
 Agglomerative clustering is distance dependent
 Distance matrix provides greater generality than clustering
observations
 Function dist calculates distance matrix
 Options for distance between observations
 Euclidean, manhattan, maximum, canberra, binary, minkowski
5. Blei, 2008 6. stat.berkley 7.
astrostatictics.psu
Distance Matrix8
 Function daisy in cluster package offers more options
 Provides gower distance option that can calculate non-
numerical variable distance
 Allows individual treatment for variables
 ordratio-Treats ratio scaled as ordinal
 logratio- Log transforms variables
 asymm- Asymmetrical binary
 symm- binary
8. Maechler, 2013
Missing Values in hclust10
 Function dist accepts missing values
10. astrostatistics.psu
Number of Observations
 Hierarchal clustering is visual in nature11
 Dendrogram shows entire tree from single observations to
cluster of one
 Splits and heights matter for interpretation
 Being able to read the dendrogram is vital
11. Blei, 2008
Standardizing Data
 Makes all variables contribute to clusters equally
12. stat.berkley
Linkage13,14
 Method that links observations to form clusters
 Linkage is a measure of inter-cluster distance
 hclust default is complete but offers other options
 Average, ward, single, mcquitty, median, centroid
 Different linkages produce different dendrograms
13. ecology.msu 14. stat.ethz
agnes (Agglomerative Nesting)16,17
 Can use either pre-calculated distance or raw values
 Options on distance calculation
 Differences if data is standardized
 Produces an agglomerative coefficient
 Measures cluster structure
 Average of 1- (dissimilarity from first/ dissimilarity from last)
 Increases with sample size
16. Maechler, 2013 17. Glynn, 2005
Reading a Dendrogram19
 Height – How distant clusters are prior to merge
 Height is determined by linkage method
 Smaller height jumps in between branches show poorly
differentiated clusters
 Larger height jumps between last merged group and current
indicate well-differentiated clusters
19. stat.berkley
Pitfalls20,21
 Choices matter and produce different results
 Different distance measures can produce vastly different
distance matrices
 Different linkage choices can lead to vastly different clusters
 The algorithm finds the clusters and groupings even if are
none in reality
 Better for descriptive purposes
20. Blei, 2008 21. stat.berkley
References

astrostatistics.psu.edu. “Distance Matrix Computation”.
http://www.astrostatistics.psu.edu/su07/R/stats/html/dist.html

Blei, D. 2008. “Hierarchal clustering, COS424”. Princeton University.
http://www.cs.princeton.edu/courses/archive/spr08/cos424/slides/clustering-2.pdf

ecology.msu.montana.edu. “Lab 13- Cluster Analysis”.
http://ecology.msu.montana.edu/labdsv/R/labs/lab13/lab13.html

Glynn, E.F. 2005. “Correlation ‘Distances’ and Hierarchal Clustering”. Stowers Institute for Medical
Research. http://research.stowers-institute.org/efg/R/Visualization/cor-cluster/index.htm

Jain, A.K., Murty, M.N., & Flynn, P.J. 1999. Data Clustering: A Review. ACM Computing Surveys 31 (3):
264-323.

Maechler, M. 2013. Package ‘cluster’. http://cran.r-project.org/web/packages/cluster/cluster.pdf

stat.berkley.edu. “Performing and Interpreting Cluster Analysis”.
http://www.stat.berkeley.edu/users/spector/s133/Clus.html

stat.ethz.ch. “Hierarchal Clustering”. http://stat.ethz.ch/R-manual/Rdevel/library/stats/html/hclust.html

Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap
statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411-423
R code to run simulation shown during presentation
#data set must be read in as object compsetr. CSV version is available from ISQS6348 library
salary <- compsetr[,c(1:5, 7,8,10)]
sc<-salary[complete.cases(salary),]
sal_st<-data.frame(sapply(sc[,],scale))
scd<-dist(as.matrix(sc))
scc<-hclust(scd)
plot(scc, labels=F)
stdis<-dist(as.matrix(sal_st))
st_cl<-hclust(stdis)
plot(st_cl, labels=F)
st_cl2<-hclust(stdis, method='average')
st_cl4<-hclust(stdis, method='single')
st_cl3<-hclust(stdis, method='ward')
plot(st_cl2, labels=F)
plot(st_cl4, labels=F)
plot(st_cl3, labels=F)
snc<-agnes(stdis, diss=T, stand=T, method='ward')
plot(snc,labels=F)
st_clg3<-cutree(st_cl3,3)
st_clg5<-cutree(st_cl3,5)
st_clg7<-cutree(st_cl3,7)
st_clg10<-cutree(st_cl3,10)
table(st_clg3)
table(st_clg5)
table(st_clg7)
table(st_clg10)
Download