Clustering Approaches Advanced Biostatistics Dean C. Adams Lecture 9 EEOB 590C 1 Objectives of Exploratory Data Analyses PC II •Investigate data using only Y-matrix of variables •Objects are points in high-dimensional data space •Look for patterns and distributions of points in data space •Generate summary plots of data space (ordination) •Look for relationships of points (clustering) PC I 2 Ordination and Dimension Reduction •Visualize high dimensional data space as succinctly as possible •Describe variation in original data with new set of variables (typically orthogonal vectors) •Order new variables by variation explained (most – least) •Plot first few dimensions to summarize data •Principal Components Analysis (PCA) one approach (others include: PCoA, MDS, CA, etc.) 3 Ordination Protocols •Methods based on variables (VCV) or objects (distances) •Can view methods as flow chart of operations •PCA Y VCV eigen E PROJ PC Scores L R •PCoA Y S eigen PCo DCntr Scores L D •NMDS Y S MD ‘Guess’ Scores D NOTE: PCoA and NMDS can begin directly from D or S 4 Correspondence Analysis (CA) •Ordination for contingency table data (counts, frequencies) •Method VERY useful for ecological data (sites × species, individuals × prey they consumed, etc.) •CA preserves X2 distance among objects (a weighted Euclidean distance of conditional probabilities) y DEuclid ki ykj 2 k DX 2 k Where , yk yk y 1 yk yki ykj yi y j and yi yi 2 for rows i and j •CA provides test for the ‘independence’ of rows and columns CA also called reciprocal averaging 5 CA: Protocol 1. Calculate matrix (Q) of relative frequencies (proportions) from contingency table data: pij fij ftot p p p 2. Calculate elements of matrix Q as: q p p i ij j ij i j (matrix is centered by row and column means, hence ‘reciprocal averaging’) 3. Perform singular-value decomposition (SVD) on Q ˆ SVD(Q) UWU' • Eigen-analysis NOT used because Q is rectangular (not square) • Û factors (‘eigenvectors’) for rows • U' factors (‘eigenvectors’) for columns • W singular values (related to eigenvalues) • Ordination plot from scaled & projected row and column factors (see Legendre and Legendre, 1998 for math) 6 CA: Comments •Similarity of objects from frequency data can be viewed using ordination •Test of independence between objects and variables (significance implies some objects have higher frequencies on particular variables) •Ordination can be as a biplot (simultaneous plot of rows and columns: objects and variables) •Interesting Note: Eigenanalysis of QQ ' yields Û, and eigenanalysis of Q' Q yields U' •Advantage of SVD is that it decomposes rows AND columns simultaneously 7 CA Example: Buzzwords in Ecology •Use of 30 ecological terms (1982-1995) in 43 journals in 5 categories (for A) Ecologists, B) Students, C) Funding/Govt., D) Scientists, E) Biologists) 1 Alien species A 24 B 3 C 2 D 1 E 10 16 Equilibrium A 5 B 0 C 1 D 0 E 1 1 1 38 0 1 4 0 0 1 7 15 0 0 5 12 1 2 0 1 2 2 0 1 0 0 2 Altruism 6 1 0 0 3 17 3 11 2 3 2 1 18 4 Balance of nature Biodiversity 5 1 0 0 0 19 5 Biome 5 1 11 1 1 20 6 27 1 3 1 1 21 7 8 Carrying capacity Climax Community Exotic species Invasive species Limiting resources Limits on growth Niche 2 14 0 1 0 4 1 1 1 12 22 23 Pioneer Population 1 5 0 0 0 1 2 0 2 0 9 Competition 3 1 0 0 0 24 Sensitivity 33 2 0 1 8 10 Complexity 4 0 0 0 1 25 Stability 2 0 0 0 0 11 12 Diversity Dominance 5 0 0 0 0 2 0 0 0 0 26 27 Stress Succession 18 4 0 0 0 1 1 0 5 1 13 Ecosystem 1 0 0 0 0 28 Sustainability 0 0 6 0 2 14 Efficiency 1 1 1 0 1 29 Tragedy of commons 2 0 1 0 0 15 Entropy 19 10 92 9 1 30 Trophic level 2 2 0 4 9 Data from Adams et al. (1997). Oikos. 80:632-636. 8 CA Plot 1.5 Journals for funding agencies and general public Buzzwords frequently used: sustainability, tragedy of 0.8 commons, biodiversity 0.0 30 25 14 C 29 28 24 15 7 13 18 6 11 3 10 17 A 27 8 4 B 1 16 9 5 23 20 19 -0.8 D E 26 21 22 12 2 -1.5 -1.5 -0.8 2-D plot describes 89% of variation Data from Adams et al. (1997). Oikos. 80:632-636. 0.0 0.8 1.5 9 Ordination Approaches: Closing Comments •Extremely useful for obtaining low-dimensional view of data •Do NOT use subset of PCs for subsequent analyses (can get into trouble) •Don’t over-interpret axes (axes may be orthogonal, biological ‘factors’ are not!) •Don’t ‘correct’ for patterns to identify ‘real’ pattern •Example: the arch effect •Common in community data, or along environmental gradients •Some have proposed ‘corrections’ (detrending) to ‘reveal’ original gradient 10 The Arch Effect 0.71 1-5 15-19 9 8 14 7 6 6 Spec. 1 5 Spec. 2 4 Spec. 3 3 0.18 2 13 1 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 -0.35 Hypothetical abundance data of 3 species for sites along an ecological gradient 8 12 -0.88 11 -1.41 -1.22 Ordination (CA) of sites from above data 9 10 -0.61 0.00 0.61 1.22 From Legendre and Legendre (1998). Numerical Ecology. Note: similar curved patterns found in data in other fields (psychology: the ‘circumplex’ plot) 11 Detrended Correspondence Analysis (DCA) •‘Correct’ arch effect: two approaches 1: DCA by segments •Create DC1 segments •DC2 values are mean-centered within each segment 2: DCA by polynomials •Constrain CA so DC2 is orthogonal to DC1, DC12, DC13, etc. 12 DCA: Example DCA by Segments DCA by polynomials From Legendre and Legendre (1998). Numerical Ecology. 13 DCA: Comments •Methods completely arbitrary and border on absurd •How to choose segments? how many? what degree polynomial to use? •DC2 completely meaningless •Can’t interpret object locations, b/c DC2 is arbitrarily created •DCA eliminates arch, which IS the pattern! •Detrending should absolutely be avoided! •For detailed critique see Wartenberg, Ferson, and Rohlf, 1987. Am. Nat. 14 Clustering Methods •Obtain groupings (clusters) of data based on similarity •Clustering requires distances (or similarities) between points •A complementary view to ordination •Clustering is algorithmic, not algebraic (i.e., it is a procedure, or set of rules for connecting data) 15 Classifying Clustering Methods •Hierarchical: clusters are nested (higher-rank clusters contain lower-rank clusters) •Non-hierarchical: data are partitioned into clusters (no nesting) •Other classification schemes •Agglomerative vs. Divisive: start with every specimen in its own group vs. all in 1 group •Sequential vs. Simultaneous: calculate clusters all at once vs. one at a time •Monothetic vs. Polythetic (divisive methods only): single descriptor for partitions vs. multiple descriptors 16 SAHN Methods •SAHN: Sequential, agglomerative, hierarchical, nested •Cluster most similar objects first and recalculate if needed •Aggregate until all objects are part of largest cluster •Produces dendogram displaying similarities •Different criteria for determining when to join objects to clusters 17 SAHN: Single Linkage vs. Complete Linkage •Single linkage (nearest neighbor clustering): Add object when similarity to the first object is reached •Complete linkage (farthest neighbor clustering): Add object when similarity to the last object is reached •Represent extremes of SAHN clustering (& can be sensitive to noise in data) •Mostly ‘slides’ nodes towards tips (single ) or root (complete) •Cluster assignments may also change 18 Clustering Data: Example •Skull shape similarity among populations of European moles Shape residuals for 113 superimposed specimens 3 1 3 1 2 2 4.66 8.53 8.96 6.4 5.35 9.27 8.42 8.6 8.82 0 8.19 10.98 5.43 5.2 6.96 7.24 8.21 8.26 0 9.88 5.67 10.28 9.1 8.75 5.63 7.58 Data expressed as shape DISTANCE 0 9.91 10.32 11.93 12.01 10.85 12.09 Data from Rohlf et al. (1996). Syst. Biol. 45:344-362. 0 6.45 5.72 7.25 6.2 6.57 0 7.66 8.63 9.93 10.32 0 7.17 8.13 8.21 0 5.39 5.35 0 4.64 0 19 Single Linkage vs. Complete Linkage: Example 10.T_eur 9.T_eur 8.T_eur 3.M_touc 2.T_rom 1.T_rom 6.T_stan 5.T_occ 7.T_cae 4.P_bre 8 Cluster change 6 4 2 Nodes shifted towards root 10 Data from Rohlf et al. (1996). Syst. Biol. 45:344-362. 5 0 4.66 8.53 8.96 6.4 5.35 9.27 8.42 8.6 8.82 0 8.19 10.98 5.43 5.2 6.96 7.24 8.21 8.26 0 9.88 5.67 10.28 9.1 8.75 5.63 7.58 Data expressed as shape DISTANCE 0 9.91 10.32 11.93 12.01 10.85 12.09 0 6.45 5.72 7.25 6.2 6.57 0 7.66 8.63 9.93 10.32 0 7.17 8.13 8.21 0 5.39 5.35 0 4.64 0 Single linkage 0 10.T_eur 9.T_eur 8.T_eur 7.T_cae 5.T_occ 3.M_touc 2.T_rom 1.T_rom 6.T_stan 4.P_bre 15 T_romana1 T_romana2 M_touchei P_breweri T_occidentalis T_stankovici T_caeca T_europa1 T_europa2 T_europa3 Complete linkage 0 20 SAHN: UPGMA •Unweighted Pair-Group Method using Arithmetic Averages •Most commonly used SAHN method •Uses averages of clusters to join additional objects •Connect closest 2 objects in a cluster •Calculate their average similarity to objects not in cluster •Replace original similarity scores with averages •Add new object to cluster when distance to average is reached •Recalculate average for cluster and continue •Method unweighted because it gives same weight to original similarity scores (e.g., when 3rd object added, new average found by dividing by 3, etc.) 21 UPGMA: Example 5.T_occ 3.M_touc 10.T_eur 9.T_eur 8.T_eur 7.T_cae 2.T_rom 1.T_rom 6.T_stan 4.P_bre 10 8 6 4 2 10.T_eur 9.T_eur 8.T_eur 3.M_touc 2.T_rom 1.T_rom 6.T_stan 5.T_occ 7.T_cae 4.P_bre 8 6 Single 4 linkage 2 0 Complete linkage 10.T_eur 9.T_eur 8.T_eur 7.T_cae 5.T_occ 3.M_touc 2.T_rom 1.T_rom 6.T_stan 4.P_bre 0 15 Data from Rohlf et al. (1996). Syst. Biol. 45:344-362. 10 5 0 22 SAHN: WPGMA •Weighted Pair-Group Method using Arithmetic Averages •Same as UPGMA, but averages are weighted (always divided by 2) •Thus, gives different weights to original objects (when 3+ in cluster) 23 WPGMA: Example 5.T_occ 3.M_touc 10.T_eur 9.T_eur 8.T_eur 7.T_cae 2.T_rom 1.T_rom 6.T_stan 4.P_bre 10 8 6 4 Data from Rohlf et al. (1996). Syst. Biol. 45:344-362. 2 5.T_occ 3.M_touc 10.T_eur 9.T_eur 8.T_eur 7.T_cae 2.T_rom 1.T_rom 6.T_stan 4.P_bre 10 8 6 4 2 UPGMA 0 0 24 SAHN: UPGMC & WPGMC •Use centroids of clusters to join additional objects •Centroids obtained from similarity scores* •Centroid methods can find ‘reversals’ or negative branch lengths (e.g., when distance of 3rd object to centroid is smaller than distance between original pair) * For details see Legendre and Legendre (1998). Numerical Ecology. 25 Visualizing Centroid Clustering •Can think of centroid clustering as connecting dots in ordination space From Legendre and Legendre (1998). Numerical Ecology. 26 UPGMC & WPGMC: Examples 2.T_rom 1.T_rom 6.T_stan 5.T_occ 10.T_eur 9.T_eur 8.T_eur 7.T_cae 3.M_touc 4.P_bre 8 6 4 2 0 2.T_rom 1.T_rom 6.T_stan 5.T_occ 10.T_eur 9.T_eur 8.T_eur 7.T_cae 3.M_touc 4.P_bre 8 6 4 2 Data from Rohlf et al. (1996). Syst. Biol. 45:344-362. UPGMC WPGMC 0 27 Ward’s Minimum Variance Method •Use cluster variance (TESS: total error sum of squares) •Add object that increases TESS the least 2.T_rom 1.T_rom 6.T_stan 4.P_bre 5.T_occ 3.M_touc 7.T_cae 10.T_eur 9.T_eur 8.T_eur 15 10 5 5.T_occ 3.M_touc 10.T_eur 9.T_eur 8.T_eur 7.T_cae 2.T_rom 1.T_rom 6.T_stan 4.P_bre 0 10 8 6 4 2 UPGMA 0 28 Partition Methods: K-Means Clustering •Partitions data into groups that minimize TESS •Define # groups (k) •Assign specimens to groups, calculate centroid, and TESS •Repeat many times and choose solution with minimal TESS •Can iterate for k = 2,3,4 etc. to find optimal # groups •Does not yield dendogram (not hierarchical) ; only group membership 29 Clustering: Comments •Recall: these methods do NOT assume process!! •Careful in interpretation (not based on evolutionary history) •Change of metric/distance measure may alter results 4 •Useful to combine with ordination (are complementary) 1.T_rom PCoA 2 -2 0 2 -4 5.T_occ 3.M_touc 10.T_eur 9.T_eur 8.T_eur 7.T_cae 2.T_rom 1.T_rom 6.T_stan 4.P_bre 2.T_rom 7.T_cae 6.T_stan 5.T_occ 8.T_eur 10.T_eur 9.T_eur 3.M_touc 4.P_bre 10 -5 0 PCoA 1 5 8 6 4 2 UPGMA 0 •Other methods exist: minimum spanning tree (MST), neighbor-joining, flexible-link clustering, probabilistic clustering, evolutionary model-based ‘clustering’ (parsimony, ML, Bayesian, etc.) 30