Ordination Approaches Advanced Biostatistics Dean C. Adams Lecture 8

advertisement
Ordination Approaches
Advanced Biostatistics
Dean C. Adams
Lecture 8
EEOB 590C
1
Objectives of Exploratory Data Analyses
PC II
•Investigate data using only Y-matrix of variables
•Objects are points in high-dimensional data space
•Look for patterns and distributions of points in data space
•Generate summary plots of data space (ordination)
•Look for relationships of points (clustering)
PC I
2
Ordination and Dimension Reduction
•Visualize high dimensional data space as succinctly as possible
•Describe variation in original data with new set of variables
(typically orthogonal vectors)
•Order new variables by variation explained (most – least)
•Plot first few dimensions to summarize data
•Principal Components Analysis (PCA) one approach (others
include: PCoA, MDS, CA, etc.)
3
Principal Components Analysis (PCA)
•Rigid rotation of the data by variables (R-mode)
•Axes rank-ordered by %variation
•Euclidean distances among specimens preserved (i.e,. no distortion of relationships!)
•New axes (PC vectors) are linear combinations of original variables,
•PC axes uncorrelated with one another
•PCA accomplished via SVD (singular-value decomposition)*
*NOTE: PCA is commonly described using eigen-analysis, but in SVD computationally more efficient and more stable
4
PCA: What Does it Do?
•Rigid rotation of data
PC2
humerus
PC1
femur
5
PCA: SVD Computations
1. Start with data matrix (Ynxp)
2. Decompose Y via SVD: Y=UDVt
Vpxp = matrix of linear combinations of variables (rotated PC axes)
Unxp = matrix of left-singular vectors
Dpxp = diagonal matrix of singular values
3. D2 = % variation (l) explained by each PC axis
4. UD = PCA scores for each specimen
NOTE: U also found from eigen-analysis of YYt, and V found from eigen-analysis of YtY.
The ‘classic’ implementation of PCA is using eigen-analysis: S = ELEt , where E contains the eigenvectors and L the
eigenvalues. PCA scores are found as: P=YE
6
PCA: Comments
•PCA is a rigid rotation where new axes explain % variation
•Distances among objects preserved!
•PC axes are loadings of each variable on PC axis (variables with values
closer to -1 & +1 are more influential in that direction)
•How well does a particular PC plot represent relationships among
objects? Assess by:
• % variation explained by PC axes
•Mantel correlation and Shepard diagram (plot of distances in reduced PC
space vs. distances in full data space)
7
PCA: Interpretations
•PCA does NOTHING to the data, except rotate it
•PCA does not find a particular factor (e.g., group differences,
allometry): it identifies the direction of most variation, which may
be interpretable as a ‘factor’ (but may not)
•Careful interpreting ALL PC axes as biologically meaningful!!
Remember they are constrained to be orthogonal, but biological
variability is not
•Some criteria exist for how many PC axes to interpret (Kaiser-Guttman
criteria; broken stick model: see Ch. 9 of Numerical Ecology)
8
PCA: Dimensionality
•Sometimes PCA dimensions have no variation (l > 0)
•If p>N, # PC axes containing variation = N-1 (3 points always lie in a plane, etc.)
•If N>p and #PC with variation < p, then some variables redundant
(e.g., compositional data (A+B+C=1), or some other linear dependency)
9
PCA and the Influence of Scale
• Y variables with different scale or high variation can have
undue influence on PCA (akin to outliers in regression)
• Scaling traits first alleviates this problem (note: this is equivalent to
using the correlation matrix R in the eigen-analysis)
• Obtaining R
1: cor(Y)
2: cov(scale(Y)) *
*Recall that correlations are standardized variances!
10
PCA Example: Bumpus Data
> summary(pca.bumpus)
PC1
PC2
PC3
PC4
PC5
PC6
PC7
PC8
Standard deviation
0.0699 0.0365 0.0218 0.0192 0.0162 0.0149 0.01362 0.01098
Proportion of Variance 0.6220 0.1694 0.0605 0.0469 0.0335 0.0284 0.02360 0.01534
Cumulative Proportion 0.6220 0.7915 0.8521 0.8990 0.9325 0.9610 0.98466 1.00000
> pca.bumpus$rotation
PC1
PC2
PC3
TL 0.2023231 -0.09175940 0.50893954
AE 0.2477126 -0.04677641 0.25686306
BHL 0.2336021 0.05718419 0.20256651
HL 0.4019895 0.19154460 -0.05637123
FL 0.4165135 0.35321769 -0.12498579
TTL 0.4188801 0.45911607 -0.33681530
SW 0.2218929 0.07085675 0.65504837
SKL 0.5323137 -0.78029599 -0.26951186
PC1: loadings similar and positive (size)
PC2: relative shape: SKL vs. TTL & FL
11
0.00
-0.15
-0.10
-0.05
PC2
0.05
0.10
0.15
PCA Plot
2-D plot describes 72.5% of variation
-0.2
-0.1
0.0
0.1
PC1
Mantel correlation
from Shepard plot:
0.96
PCADist
FullDist
Males (red)
Females (blue)
12
Bi-Plot
•Ordination plot of objects (rows) and variables (columns)
•Look for sets of vectors with small angles, and clusters of points
•Can use to identify variables with high association with objects
79
-0.1
89
82
-0.2
102
41
0.4
39
85
115111 122 120
123
108
83
60 130 15125 114
98
97
119
117
TTL
37 74
104
11616
105 134 53
66 36
FL
92
18
63
118
113
12893
64
43
71
109131107
110
HL
112 38626
54 73
4 94
87
76
19
77
127
84
58
SW
67
1339
BHL
47
12655
78 62
75
101
72
5 50 13
40 AE
7
10
35 20
30
27 99
86
135 8142
88
132
136
49
3 2396 TL
68 31
11 46
106
95
61
2 598 100
34 57
8065
5644 12
17
124 51 22
21
28
48 24
121 4525
70
52
14
33
SKL
69
1
29
32
90
0.2
0.2
0.1
0.0
0.4
129
103
PC2
0.2
0.0
0.0
-0.2
-0.2
-0.4
-0.4
91
-0.2
-0.1
0.0
PC1
0.1
0.2
13
Principal Coordinates Analysis (PCoA)
•Ordination from distance matrix among objects (Q-mode)
•Preserves object distances from any distance measure (D , Jaccard, etc. )
E
•Useful for when PCA not appropriate (e.g,. binary data, species abundances)
•If DEuclid used, PCoA = PCA (Gower 1966)
•Sometimes called metric multidimensional scaling (MDS) because preserves relationships among objects
14
PCoA: Protocol
1. Start with distance matrix among objects (zeros down diagonal)
similarity converted as: Dij  1  Sij
1 2
A D
2. Transform the elements Dij to:
2
3. Double-center matrix (subtract row and column means from each
element, and add grand mean. Positions origin at centroid of scatter)
æ 1 tö æ 1 tö
G = ç I - 11 ÷ A ç I - 11 ÷
è n ø è n ø
4. Eigen-decomposition of double-centered matrix
• Eigenvectors are the COORDINATES for the ordination plot
(They don’t describe aspects of the variables since there are no variables,
only distances among objects)
15
PCoA: Comments
•PCoA ‘embeds’ a set of objects into a Euclidean space
•PCA vs. PCoA: Identical for continuous multivariate data
•PCoA Strength: Enables ordination for any data type where
distances are available (genetic distance, Hamming distance, geographic distance, quantitative,
semi-quantitative, qualitative, or mixed data)
•If PCoA yields negative eigenvalues, distances are semi-metric or
nonmetric
Properties of Distances
Metric (Euclidean): (1) d11 = 0, (2) d12 = d21, (3) triangle inequality
Semimetric (pseudometrics):
no triangle inequality
Non-Metric:
d12 may be negative
16
-0.15
-0.05
-0.15
-0.10
-0.10
0.00
PC2
0.05
-0.05
0.10
-0.2
-0.1
0.0
0.1
0.00
PCoA[,2]
0.15
0.05
0.10
0.15
PCoA Example: Bumpus Data
PC1
-0.2
-0.1
0.0
0.1
PCoA[,1]
17
Nonmetric Multi-Dimensional Scaling (NMDS)
•PCA and PCoA attempt to preserve distances among objects
•NMDS preserved relative order of the objects, not their distances
18
NMDS: Protocol
1. Start with distance matrix
2. Specify number of NMDS dimensions a priori
3. Construct initial configuration of objects (a ‘guess’).
(Important step: PCoA ordination often used)
4. Obtain D from DNMDS ~ Dactual
5. Estimate goodness of fit (stress)
Stress1 
d
fitted
 dˆ fitted
 d
2
2
fitted
6. Move objects in NMDS plot and repeat 4 & 5
7. Iterate until D stress is below threshold (i.e., convergence)
Note: other Stress
equations exist
19
NMDS: Comments
•NMDS seems arbitrary, but works rather well
•Positives:
•Generally yields fewer dimensions than PCA,PCoA
•Does not require full distance matrix (missing values ok)
•Negatives:
•Arbitrary optimization
•Results dependent on starting configuration (‘guess’)
•Does not preserve distances among objects (though that is not the
objective)
20
0.00
PC2
-0.05
-0.10
0.00
-0.15
-0.05
-0.10
-0.2
-0.1
0.0
0.1
PC1
PCA Plot for comparison
-0.15
bumpus.new[,2]
0.05
0.05
0.10
0.10
0.15
0.15
MDS Example: Bumpus Data
-0.2
-0.1
0.0
0.1
bumpus.new[,1]
Final Stress = 12.828
Males (red)
Females (blue)
21
Canonical Variates Analysis/Discriminant Analysis
•Ordination that maximally discriminates among known groups (g)
•Variation expressed as ratio CW1CB
( CW = pooled within VCV; CB = between-gp VCV)
•Decomposition of CW1CB results in canonical vector space
•Suggests which groups differ on which variables
•Within-group variation in CVA plot is circular
•METHOD COMMONLY MISUSED BY BIOLOGISTS
Historical note: Fisher developed DFA (1936), which was generalized to CVA by Rao (1948; 1952)
22
DFA/CVA: Protocol
1. Partition variation:

SSCPTot = Y - Y
  Y - Y
t
SSCPB = SSCPT - SSCPW
CB =

SSCPW =  Yi - Yi
SSCPB
g 1
CW =
 Y - Y 
t
i
i
SSCPW
n g
2. Obtain canonical axes of CW1CB
C
-1
W
CB  li I  ui  0
U  u1 u 2
u g 1 
U contains (g-1) eigenvectors of CW1CB .
However, they are NOT orthogonal, because
CW1CB is square, but not symmetric. (sometimes
Called the discriminant functions).
3. Calculate normalized canonical axes Cvectors = U  U Cw U 
t
-1/ 2
4. Obtain canonical variates (CVA scores: from Yc = centered data)
F = YcCvectors
23
CVA: What it Does
• Rotates and shears data space to space of normalized canonical
axes (group variation will be circular)
Data sapce (a) 
eigenvectors (b) 
canonical axes (c)
From Legendre and
Legendre (1998)
24
DFA/CVA: Classification
•Obtain CVA scores of specimens (and group means)
•Calculate Mahalanobis D2 to means,
•Assign objects to the group to which it is closest
•Determine % misclassified
•Note: ideally this is done with a second set of data NOT used to
generate CVA (called cross-validation)
25
3
CVA Example: Bumpus Data
1
2
4-groups
(male/female
alive/dead)
-2
-1
0
1
2
LD1
Note how groups are MORE
separated in CVA plot vs. PCA plot
3
0.00
-0.10
-3
-0.15
-3
Classification:
85% correct by sex
-0.05
PC2
-2
0.05
0.10
-1
0.15
0
LD2
Colored by sex
-0.2
-0.1
0.0
0.1
PC1
PCA Plot for comparison
26
Data Space Distortion With CVA
•Distances and directions among groups distorted with CVA
Original data:3
equidistant groups
CV1 through data
space
CVA space: groups
NOT equidistant
•CVA should NOT be used to describe patterns and variation in
data space, only for describing group differences
Adapted from Klingenberg and Monteiro. (2005). Syst. B iol.
27
Data Space Distortion With CVA
•CV axes NOT orthogonal in original data space
Original
data space
CVA data
space
•Linear discrimination ONLY forms linear plane IFF within-group covariances
identical (shown as ‘equal probability classification lines below)
Adapted from Mitteroecker and Bookstein. (2011). Evol. Biol.
28
Misleading Impression of Group Differences
•Increased number of variables increases discrimination…
EVEN for IDENTICAL GROUPS!
LD2
LD2
-5
-2
-2
-1
-1
0
0
LD2
1
0
1
2
2
3
5
Simulation of 50 specimens in each of 3 groups (150 variables using ‘rnorm’: identical mean & variance)
-3
-2
-1
0
1
2
3
-3
LD1
CVA with 4 variables
Adapted from Mitteroecker and Bookstein. (2011). Evol. Biol.
-2
-1
0
1
2
LD1
CVA with 50 variables
3
-5
0
5
LD1
CVA with 150 variables
29
DFA/CVA: Conclusions
•CVA ordination generally not useful
•Distorts distances and directions in data space
•Misrepresents within-group covariation and group distances
•Perceived group differences increase with additional variables
(even for identical groups)
•Not overly useful for most applications (better to use PCA or PCA
from group means for visualizing actual trends)
30
Download