Clustering Approaches Advanced Biostatistics Dean C. Adams Lecture 9

advertisement
Clustering Approaches
Advanced Biostatistics
Dean C. Adams
Lecture 9
EEOB 590C
1
Objectives of Exploratory Data Analyses
PC II
•Investigate data using only Y-matrix of variables
•Objects are points in high-dimensional data space
•Look for patterns and distributions of points in data space
•Generate summary plots of data space (ordination)
•Look for relationships of points (clustering)
PC I
2
Ordination and Dimension Reduction
•Visualize high dimensional data space as succinctly as possible
•Describe variation in original data with new set of variables
(typically orthogonal vectors)
•Order new variables by variation explained (most – least)
•Plot first few dimensions to summarize data
•Principal Components Analysis (PCA) one approach (others
include: PCoA, MDS, CA, etc.)
3
Ordination Protocols
•Methods based on variables (VCV) or objects (distances)
•Can view methods as flow chart of operations
•PCA
Y
VCV
eigen
E
PROJ PC
Scores
L
R
•PCoA
Y
S
eigen PCo
DCntr
Scores L
D
•NMDS
Y
S
MD
‘Guess’ Scores
D
NOTE: PCoA and NMDS can begin directly from D or S
4
Correspondence Analysis (CA)
•Ordination for contingency table data (counts, frequencies)
•Method VERY useful for ecological data (sites × species,
individuals × prey they consumed, etc.)
•CA preserves X2 distance among objects (a weighted Euclidean
distance of conditional probabilities)
 y
DEuclid 
ki
 ykj 
2
k
DX 2 

k
Where , yk   yk
y

1
yk
 yki ykj 



 yi y j 
and yi   yi
2
for rows i and j
•CA provides test for the ‘independence’ of rows and columns
CA also called reciprocal averaging
5
CA: Protocol
1. Calculate matrix (Q) of relative frequencies (proportions) from
contingency table data: pij  fij ftot
p p p 
2. Calculate elements of matrix Q as: q   p p 
i
ij
j
ij
i
j
(matrix is centered by row and column means, hence ‘reciprocal averaging’)
3. Perform singular-value decomposition (SVD) on Q
ˆ
SVD(Q)  UWU'
• Eigen-analysis NOT used because Q is rectangular (not square)
• Û factors (‘eigenvectors’) for rows
• U' factors (‘eigenvectors’) for columns
• W singular values (related to eigenvalues)
• Ordination plot from scaled & projected row and column
factors (see Legendre and Legendre, 1998 for math)
6
CA: Comments
•Similarity of objects from frequency data can be viewed using
ordination
•Test of independence between objects and variables (significance
implies some objects have higher frequencies on particular
variables)
•Ordination can be as a biplot (simultaneous plot of rows and
columns: objects and variables)
•Interesting Note: Eigenanalysis of QQ ' yields Û, and
eigenanalysis of Q' Q yields U'
•Advantage of SVD is that it decomposes rows AND columns
simultaneously
7
CA Example: Buzzwords in Ecology
•Use of 30 ecological terms (1982-1995) in 43 journals in 5
categories (for A) Ecologists, B) Students, C) Funding/Govt., D) Scientists, E) Biologists)
1 Alien species
A
24
B
3
C
2
D
1
E
10
16
Equilibrium
A
5
B
0
C
1
D
0
E
1
1
1
38
0
1
4
0
0
1
7
15
0
0
5
12
1
2
0
1
2
2
0
1
0
0
2
Altruism
6
1
0
0
3
17
3
11
2
3
2
1
18
4
Balance of
nature
Biodiversity
5
1
0
0
0
19
5
Biome
5
1
11
1
1
20
6
27
1
3
1
1
21
7
8
Carrying
capacity
Climax
Community
Exotic
species
Invasive
species
Limiting
resources
Limits on
growth
Niche
2
14
0
1
0
4
1
1
1
12
22
23
Pioneer
Population
1
5
0
0
0
1
2
0
2
0
9
Competition
3
1
0
0
0
24
Sensitivity
33
2
0
1
8
10
Complexity
4
0
0
0
1
25
Stability
2
0
0
0
0
11
12
Diversity
Dominance
5
0
0
0
0
2
0
0
0
0
26
27
Stress
Succession
18
4
0
0
0
1
1
0
5
1
13
Ecosystem
1
0
0
0
0
28 Sustainability
0
0
6
0
2
14
Efficiency
1
1
1
0
1
29
Tragedy of
commons
2
0
1
0
0
15
Entropy
19
10
92
9
1
30
Trophic level
2
2
0
4
9
Data from Adams et al. (1997). Oikos. 80:632-636.
8
CA Plot
1.5
Journals for funding agencies and
general public
Buzzwords
frequently used:
sustainability,
tragedy of 0.8
commons,
biodiversity
0.0
30
25
14
C
29
28
24
15
7 13
18
6
11
3
10
17
A
27
8
4
B
1
16
9
5
23
20
19
-0.8
D
E
26
21
22
12
2
-1.5
-1.5
-0.8
2-D plot describes 89% of variation
Data from Adams et al. (1997). Oikos. 80:632-636.
0.0
0.8
1.5
9
Ordination Approaches: Closing Comments
•Extremely useful for obtaining low-dimensional view of data
•Do NOT use subset of PCs for subsequent analyses (can get into trouble)
•Don’t over-interpret axes (axes may be orthogonal, biological ‘factors’ are not!)
•Don’t ‘correct’ for patterns to identify ‘real’ pattern
•Example: the arch effect
•Common in community data, or along environmental gradients
•Some have proposed ‘corrections’ (detrending) to ‘reveal’
original gradient
10
The Arch Effect
0.71
1-5
15-19
9
8
14
7
6
6
Spec. 1
5
Spec. 2
4
Spec. 3
3
0.18
2
13
1
7
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
-0.35
Hypothetical abundance data of 3
species for sites along an
ecological gradient
8
12
-0.88
11
-1.41
-1.22
Ordination (CA) of sites from
above data
9
10
-0.61
0.00
0.61
1.22
From Legendre and Legendre (1998). Numerical Ecology.
Note: similar curved patterns found in data in other fields (psychology: the ‘circumplex’ plot)
11
Detrended Correspondence Analysis (DCA)
•‘Correct’ arch effect: two approaches
1: DCA by segments
•Create DC1 segments
•DC2 values are mean-centered within each segment
2: DCA by polynomials
•Constrain CA so DC2 is orthogonal to DC1, DC12, DC13, etc.
12
DCA: Example
DCA by Segments
DCA by polynomials
From Legendre and Legendre (1998). Numerical Ecology.
13
DCA: Comments
•Methods completely arbitrary and border on absurd
•How to choose segments? how many? what degree polynomial to use?
•DC2 completely meaningless
•Can’t interpret object locations, b/c DC2 is arbitrarily created
•DCA eliminates arch, which IS the pattern!
•Detrending should absolutely be avoided!
•For detailed critique see Wartenberg, Ferson, and Rohlf, 1987. Am. Nat.
14
Clustering Methods
•Obtain groupings (clusters) of data based on similarity
•Clustering requires distances (or similarities) between points
•A complementary view to ordination
•Clustering is algorithmic, not algebraic (i.e., it is a procedure, or
set of rules for connecting data)
15
Classifying Clustering Methods
•Hierarchical: clusters are nested (higher-rank clusters contain lower-rank
clusters)
•Non-hierarchical: data are partitioned into clusters (no nesting)
•Other classification schemes
•Agglomerative vs. Divisive: start with every specimen in its own group vs.
all in 1 group
•Sequential vs. Simultaneous: calculate clusters all at once vs. one at a time
•Monothetic vs. Polythetic (divisive methods only): single descriptor
for partitions vs. multiple descriptors
16
SAHN Methods
•SAHN: Sequential, agglomerative, hierarchical, nested
•Cluster most similar objects first and recalculate if needed
•Aggregate until all objects are part of largest cluster
•Produces dendogram displaying similarities
•Different criteria for determining when to join objects to clusters
17
SAHN: Single Linkage vs. Complete Linkage
•Single linkage (nearest neighbor clustering): Add object when
similarity to the first object is reached
•Complete linkage (farthest neighbor clustering): Add object when
similarity to the last object is reached
•Represent extremes of SAHN clustering (& can be sensitive to
noise in data)
•Mostly ‘slides’ nodes towards tips (single ) or root (complete)
•Cluster assignments may also change
18
Clustering Data: Example
•Skull shape similarity among populations of European moles
Shape residuals for 113 superimposed
specimens
3
1
3
1
2
2
4.66
8.53
8.96
6.4
5.35
9.27
8.42
8.6
8.82
0
8.19
10.98
5.43
5.2
6.96
7.24
8.21
8.26
0
9.88
5.67
10.28
9.1
8.75
5.63
7.58
Data expressed as shape DISTANCE
0
9.91
10.32
11.93
12.01
10.85
12.09
Data from Rohlf et al. (1996). Syst. Biol. 45:344-362.
0
6.45
5.72
7.25
6.2
6.57
0
7.66
8.63
9.93
10.32
0
7.17
8.13
8.21
0
5.39
5.35
0
4.64
0
19
Single Linkage vs. Complete Linkage: Example
10.T_eur
9.T_eur
8.T_eur
3.M_touc
2.T_rom
1.T_rom
6.T_stan
5.T_occ
7.T_cae
4.P_bre
8
Cluster change
6
4
2
Nodes shifted
towards root
10
Data from Rohlf et al. (1996). Syst. Biol. 45:344-362.
5
0
4.66
8.53
8.96
6.4
5.35
9.27
8.42
8.6
8.82
0
8.19
10.98
5.43
5.2
6.96
7.24
8.21
8.26
0
9.88
5.67
10.28
9.1
8.75
5.63
7.58
Data expressed as shape DISTANCE
0
9.91
10.32
11.93
12.01
10.85
12.09
0
6.45
5.72
7.25
6.2
6.57
0
7.66
8.63
9.93
10.32
0
7.17
8.13
8.21
0
5.39
5.35
0
4.64
0
Single linkage
0
10.T_eur
9.T_eur
8.T_eur
7.T_cae
5.T_occ
3.M_touc
2.T_rom
1.T_rom
6.T_stan
4.P_bre
15
T_romana1
T_romana2
M_touchei
P_breweri
T_occidentalis
T_stankovici
T_caeca
T_europa1
T_europa2
T_europa3
Complete linkage
0
20
SAHN: UPGMA
•Unweighted Pair-Group Method using Arithmetic Averages
•Most commonly used SAHN method
•Uses averages of clusters to join additional objects
•Connect closest 2 objects in a cluster
•Calculate their average similarity to objects not in cluster
•Replace original similarity scores with averages
•Add new object to cluster when distance to average is reached
•Recalculate average for cluster and continue
•Method unweighted because it gives same weight to original similarity scores (e.g., when 3rd
object added, new average found by dividing by 3, etc.)
21
UPGMA: Example
5.T_occ
3.M_touc
10.T_eur
9.T_eur
8.T_eur
7.T_cae
2.T_rom
1.T_rom
6.T_stan
4.P_bre
10
8
6
4
2
10.T_eur
9.T_eur
8.T_eur
3.M_touc
2.T_rom
1.T_rom
6.T_stan
5.T_occ
7.T_cae
4.P_bre
8
6 Single
4 linkage
2
0
Complete linkage
10.T_eur
9.T_eur
8.T_eur
7.T_cae
5.T_occ
3.M_touc
2.T_rom
1.T_rom
6.T_stan
4.P_bre
0
15
Data from Rohlf et al. (1996). Syst. Biol. 45:344-362.
10
5
0
22
SAHN: WPGMA
•Weighted Pair-Group Method using Arithmetic Averages
•Same as UPGMA, but averages are weighted
(always divided by 2)
•Thus, gives different weights to original objects (when 3+ in cluster)
23
WPGMA: Example
5.T_occ
3.M_touc
10.T_eur
9.T_eur
8.T_eur
7.T_cae
2.T_rom
1.T_rom
6.T_stan
4.P_bre
10
8
6
4
Data from Rohlf et al. (1996). Syst. Biol. 45:344-362.
2
5.T_occ
3.M_touc
10.T_eur
9.T_eur
8.T_eur
7.T_cae
2.T_rom
1.T_rom
6.T_stan
4.P_bre
10
8
6
4
2
UPGMA
0
0
24
SAHN: UPGMC & WPGMC
•Use centroids of clusters to join additional objects
•Centroids obtained from similarity scores*
•Centroid methods can find ‘reversals’ or negative branch lengths
(e.g., when distance of 3rd object to centroid is smaller than distance between original pair)
* For details see Legendre and Legendre (1998). Numerical Ecology.
25
Visualizing Centroid Clustering
•Can think of centroid clustering as connecting dots in ordination
space
From Legendre and Legendre (1998). Numerical Ecology.
26
UPGMC & WPGMC: Examples
2.T_rom
1.T_rom
6.T_stan
5.T_occ
10.T_eur
9.T_eur
8.T_eur
7.T_cae
3.M_touc
4.P_bre
8
6
4
2
0
2.T_rom
1.T_rom
6.T_stan
5.T_occ
10.T_eur
9.T_eur
8.T_eur
7.T_cae
3.M_touc
4.P_bre
8
6
4
2
Data from Rohlf et al. (1996). Syst. Biol. 45:344-362.
UPGMC
WPGMC
0
27
Ward’s Minimum Variance Method
•Use cluster variance (TESS: total error sum of squares)
•Add object that increases TESS the least
2.T_rom
1.T_rom
6.T_stan
4.P_bre
5.T_occ
3.M_touc
7.T_cae
10.T_eur
9.T_eur
8.T_eur
15
10
5
5.T_occ
3.M_touc
10.T_eur
9.T_eur
8.T_eur
7.T_cae
2.T_rom
1.T_rom
6.T_stan
4.P_bre
0
10
8
6
4
2
UPGMA
0
28
Partition Methods: K-Means Clustering
•Partitions data into groups that minimize TESS
•Define # groups (k)
•Assign specimens to groups, calculate centroid, and TESS
•Repeat many times and choose solution with minimal TESS
•Can iterate for k = 2,3,4 etc. to find optimal # groups
•Does not yield dendogram (not hierarchical) ; only group membership
29
Clustering: Comments
•Recall: these methods do NOT assume process!!
•Careful in interpretation (not based on evolutionary history)
•Change of metric/distance measure may alter results
4
•Useful to combine with ordination (are complementary)
1.T_rom
PCoA 2
-2 0 2
-4
5.T_occ
3.M_touc
10.T_eur
9.T_eur
8.T_eur
7.T_cae
2.T_rom
1.T_rom
6.T_stan
4.P_bre
2.T_rom
7.T_cae
6.T_stan
5.T_occ 8.T_eur
10.T_eur
9.T_eur
3.M_touc
4.P_bre
10
-5
0
PCoA 1
5
8
6
4
2
UPGMA
0
•Other methods exist: minimum spanning tree (MST), neighbor-joining,
flexible-link clustering, probabilistic clustering, evolutionary model-based
‘clustering’ (parsimony, ML, Bayesian, etc.)
30
Download