Multivariate Association Advanced Biostatistics Dean C. Adams Lecture 10

advertisement
Multivariate Association
Advanced Biostatistics
Dean C. Adams
Lecture 10
EEOB 590C
1
Multivariate Tests of Association
•Correlation assesses association between 2 variables (X & Y)
•What if Y and X are MATRICES (i.e. multivariate)?
•If X & Y are sets of variables, can we assess how they covary?
•Four main approaches: Mantel tests, canonical correlation, twoblock partial least squares, Escoufier’s RV
2
Mantel Test
•General approach to assess relationship between two (or more)
data sets
•Compares DISTANCE (or similarity) matrices (Mantel, 1967)
•Can assess association, but can be used for other designs
•Assumptions: same set of objects used to generate both distance
matrices, matrices are independent
3
Mantel Test of Association
• Protocol
1. Calculate appropriate distance matrices X & Y
2. ‘Unfold’ matrices into vectors nof
length (n(n-1)/2)
1 n
3. Compute Mantel statistic zM    X ijYij and standardized
i 1 j i 1
Mantel coefficient:
zM
rM 
 n  n  1 2   1
4. Assess significance through permutation test
•
NOTE: permutation performed on OBJECTS in distance matrix,
not on unfolded vector (there is a difference)
4
Example: Mantel Test
•Associate head shape (X) and food use (Y) in salamanders
12
11
13
> mantel(food.dist,shape.dist)
8
3
5
9
7
6
10
1
4
2
Hymenoptera eggs
Mantel statistic r: 0.1511
Significance: 0.033
Collembola Gastropoda Chelonethida Isoptera Acarina
Diplopoda Araneida Isopoda Coleoptera Chilopoda
larvae
Diptera Orthoptera Oligochaeta
350
symp hoff
250
symp cin
200
150
100
50
Chilopoda
Diplopoda
Oligochaeta
Isopoda
Orthoptera
Diptera
Araneida
Coleoptera
larvae
Gastropoda
Chelonethida
Hymenoptera
Collembola
eggs
Isoptera
0
Acarina
frequency
300
Note: Chi-square distance used for food
as data are counts
Adams and Rohlf (2000). PNAS 97:4106-4111.
5
Mantel Tests with Design Matrices
•Mantel procedure can be generalized for ‘ANOVA’ designs
•Design matrices describe groups
Same group: 0
Different group: 1
•Mantel statistic then sums a subset of distances (e.g., only between group SS)
6
Example: Design Matrix Mantel Test
•Food Use (Y) differences by species
> mantel(food.dist,species.dist)
Mantel statistic r: 0.2988
Significance: 0.001
Note: fake species designations for example only
Hymenoptera eggs
Collembola Gastropoda Chelonethida Isoptera Acarina
Diplopoda Araneida Isopoda Coleoptera Chilopoda
larvae
Diptera Orthoptera Oligochaeta
350
symp hoff
250
symp cin
200
150
Adams and Rohlf (2000). PNAS 97:4106-4111.
100
50
Chilopoda
Diplopoda
Oligochaeta
Isopoda
Orthoptera
Diptera
Araneida
Coleoptera
larvae
Gastropoda
Chelonethida
Hymenoptera
Collembola
eggs
Isoptera
0
Acarina
frequency
300
Note: Chi-square distance used for food
as data are counts
7
Three-Way Mantel Test
•Allows a third data matrix to be incorporated (Smouse, Long,
Sokal, 1986)
•Addresses whether association of X & Y exists while holding
effects of Z constant
•Calculate a partial Mantel coefficient (analogous to partial
correlation coefficient)
•Partial Mantel coefficient found from pairwise Mantel
coefficients rXY, rXZ and rYZ
rXY .Z 
rXY  rXZ rYZ
2
1  rXZ
1  rYZ2
•Several protocols for assessing significance
8
Three-Way Mantel Test: Protocol 1
1. Calculate distance matrices X, Y, & Z
2. ‘Unfold’ matrices into vectors of length (n(n-1)/2)
3. Perform regressions of X on Z, and Y on Z, and calculate
residuals Xresid Yresid
4. Compute partial Mantel statistic of Xresid vs. Yresid
5. Assess significance through permutation test
•
Procedure extremely general and can be used for 4 (or more matrices)
9
Three-Way Mantel Test: Protocol 2
1.
2.
3.
4.
Calculate rXY, rXZ and rYZ
rXY  rXZ rYZ
rXY .Z 
2
Calculate partial Mantel coefficient:
1  rXZ
1  rYZ2
Permute X matrix and recalculate rXY & rXZ
Compare observed rXY.Z to randomly generated rXY.Z
•
In general, residual approach (#1) faster to implement, but can have
type I error rate > 0.05, particularly for small sample sizes (see
Legendre and Legendre, 1998)
10
Example: Three-Way Mantel Test
•Head Shape (X) vs. Food Use (Y) | Species (Z)
> mantel.partial(shape.dist,food.dist,species.dist,permutations=999)
Mantel statistic r: 0.0403
Significance: 0.286
Note: fake species designations for example only
12
11
13
10
Hymenoptera eggs
Collembola Gastropoda Chelonethida Isoptera Acarina
larvae
5
4
Diplopoda Araneida Isopoda Coleoptera Chilopoda
8
9
7
6
3
1
2
Diptera Orthoptera Oligochaeta
350
symp hoff
250
symp cin
200
150
100
50
Chilopoda
Diplopoda
Oligochaeta
Isopoda
Orthoptera
Diptera
Araneida
Coleoptera
larvae
Gastropoda
Chelonethida
Hymenoptera
Collembola
eggs
Isoptera
0
Acarina
frequency
300
Note: Chi-square distance used for food
as data are counts
Adams and Rohlf (2000). PNAS 97:4106-4111.
11
Mantel Test: Conclusions
•Exceedingly general and useful approach
•Datasets can have 1+ variables each
•Different distances can be used for different types of data
•HOWEVER,
•Can have low power (particularly when N < 10)
•Has lower power than GLM or permutational-MANOVA when
data can be analyzed as such (for details on both issues, see Legendre)
•General recommendation: use perm-MANOVA when possible
12
Complications: Different Matrix Types
•Sometimes, one data set contains variables, and another is in the
form of distances. What to do?
X
DY
•1: Convert X to distances, use Mantel test
•2: Convert D to data for GLM
•Generate Y variables with PCoA
•Run multiple regression (i.e. GLM)
(see Legendre and Anderson, 1999. Ecol. Monogr. 64:1-14)
t
ˆ
Y  X  X X X Y
t
1
13
Combining Data Types
•Sometimes, one has discrete AND continuous data
•Recall: ALL distance measures require data in commensurate units
• Deuclid requires all Y are continuous
• Dhamming requires all Y are 0/1
•DO NOT combine for single distance!
•Instead, consider
A: Separate PCoA analsyes on each data type
B: Combine these continuous variables for subsequent analyses*
*Considerable work must be done to evaluate efficacy of the approach for particular datasets. Be CAUTIOUS!
14
Canonical Correlation Analysis(CCorA)
•Identify maximal rXY from pairs of linear combinations LC of each
•LC constrained to be orthogonal within each dataset (like PC axes)
•LC constrained to be orthogonal between datasets (e.g., Y1 ┴ X2)
•CCorA = multiple regression (when 1 data set has a single variable)
•Canonical: simplest reduction of a set of functions that does not lose generality (e.g., canonical form of VCV are its
eigenvalues, which perfectly express the variation in VCV)
15
CCorA: Protocol
 R XX
1. Calculate R = 
 R YX
R XY 
R YY 
2. Calculate canonical axes for each set of variables
•
•
•
Calculate matrices: A  R XX R XY R YY R YX R XX
2
-1
-1/ 2
B  R -1/
R
R
R
R
YY
XY
XX YX YY
Linear combinations (U) for data set 1 are from eigenanalysis of A
Linear combinations (V) for data set 2 are from eigenanalysis of B
-1/ 2
-1
-1/ 2
3. Canonical correlations are l1/2 from eigenanalysis of:
C  R -1YY R YX R -1XX R XY
4. Statistical significance determined with Pillai’s trace of C (and
often randomization)
16
Example: Canonical Correlation
•Associate head shape (X) and food use (Y) in salamanders
> CCorA(shape,food,nperm=1000)
12
11
13
8
3
5
9
7
6
10
Pillai's trace:
1
4
2
5.800587
Significance of Pillai's trace:
based on 1000 permutations: 0.02297702
from F-distribution: 0.027531
> cor(res$Cx[,1],res$Cy[,1])
[1] 0.9167563
Hymenoptera eggs
350
Collembola Gastropoda Chelonethida Isoptera Acarina
Diplopoda Araneida Isopoda Coleoptera Chilopoda
larvae
Diptera Orthoptera Oligochaeta
symp hoff
250
symp cin
200
150
100
50
Chilopoda
Diplopoda
Oligochaeta
Isopoda
Orthoptera
Diptera
Araneida
Coleoptera
larvae
Gastropoda
Chelonethida
Hymenoptera
Collembola
eggs
Isoptera
0
Acarina
frequency
300
Note: Chi-square distance used for food
as data are counts
Adams and Rohlf (2000). PNAS 97:4106-4111.
17
Two-Block Partial Least Squares (2B-PLS)
•Identify maximal CovXY from pairs of LC of each
•LC ONLY constrained to be orthogonal within each dataset (not
between)
•Calculations less complicated (fewer mathematical constraints)
18
2B-PLS: Protocol
S XX
1. Calculate S = 
S YX
S XY 
S YY 
2. Linear Combinations for set X (U) and set Y (V) found from
t
S
=
UDV
SVD of:
XY
3. The % covariation explained by each combination is the square
of the singular values (diagonal of D)
4. Correlations determined by projecting data on U and V, and
computing Product-moment correlation (see Rohlf and Corti, 2000.
Syst. Biol. 49:740-753)
5. Statistical significance can be determined with randomization
19
Example: 2B-PLS
•Associate head shape (X) and food use (Y) in salamanders
> cor(pls.res$scores[,1],pls.res$Yscores[,1])
12
11
13
8
3
5
9
7
6
10
[1] 0.7590776
1
4
2
Prand = 0.0001
Hymenoptera eggs
350
Collembola Gastropoda Chelonethida Isoptera Acarina
Diplopoda Araneida Isopoda Coleoptera Chilopoda
larvae
Diptera Orthoptera Oligochaeta
symp hoff
250
symp cin
200
150
100
50
Chilopoda
Diplopoda
Oligochaeta
Isopoda
Orthoptera
Diptera
Araneida
Coleoptera
larvae
Gastropoda
Chelonethida
Hymenoptera
Collembola
eggs
Isoptera
0
Acarina
frequency
300
Note: Chi-square distance used for food
as data are counts
Adams and Rohlf (2000). PNAS 97:4106-4111.
20
Interpreting Association
•Use CA loadings to interpret each LC axis (as in PCA)
•Loadings on food axis (PLS):
oligo
0.09
gastro
-0.018
isopo
0.0424
diplo
0.0551
chilop
0.0933
acar
-0.507
aranei
0.2537
chelo
-0.112
coleo
collem
0.4949
-0.394
dipter
0.1658
hymen
0.3649
isopt
-0.119
orthop
0.1263
larvae
0.2158
eggs
-0.071
•Food axis mainly describes contrast of small vs. large prey items
21
Strength of Assocation: Escoufier’s RV
•Express CovXY relative to CovXX and CovYY (between relative to within)
1: Calculate S:
2: Estimate:
S XX
S=
S YX
S XY 
S YY 
tr (S XY S YX )
RV 
tr (S XXS XX )tr (S YY S YY )
3:Significance via permutation
•RV: relative covariation on scale of 01
•RV is thus analogous to squared correlation coefficient
See Klingenberg 2009
22
Example: Escoufier’s RV
•Associate head shape (X) and food use (Y) in salamanders
> sum(diag(S12%*%t(S12)))/
sqrt(sum(diag(S11%*%t(S11)))%*%sum(diag(S22%*%t(S22))))
12
11
13
8
3
5
[,1]
[1,] 0.3351979
9
7
6
10
1
4
2
Prand = 0.0001
Hymenoptera eggs
350
Collembola Gastropoda Chelonethida Isoptera Acarina
Diplopoda Araneida Isopoda Coleoptera Chilopoda
larvae
Diptera Orthoptera Oligochaeta
symp hoff
250
symp cin
200
150
100
Adams and Rohlf (2000). PNAS 97:4106-4111.
50
Chilopoda
Diplopoda
Oligochaeta
Isopoda
Orthoptera
Diptera
Araneida
Coleoptera
larvae
Gastropoda
Chelonethida
Hymenoptera
Collembola
eggs
Isoptera
0
Acarina
frequency
300
23
Summary: Multivariate Association
•Several ways to assess covariation between sets of variables
•Mantel Tests: associate distance matrices (can be design matrix)
•CCorA: maximum rXY using between & within ┴ constraints
•PLS: maximum CovXY using within ┴ constraints
•RV: CovXY/CovXX & CovYY
24
Canonical Ordination Approaches
•Considers association of two matrices (X & Y)
•Provides visualization (ordination) from Y~X (differing from PCA/PCoA)
•Several canonical ordination methods
•Canonical Variates Analysis (CVA)
•Redundancy Analysis (RDA)
•Canonical Correspondence Analysis (CCA)
*Mathematically, the canonical form (from Greek Greek κανων, ‘kanôn’) is the simplest
and most comprehensive representation of relationship, without losing generality
25
Canonical Analysis: General Comments
•Recall univariate multiple regression: Yi   0  1 X1i   2 X 2i    i
•Here, predicted values (Ŷ) are a 1-dimensional ‘ordination’ of original Y
data (along the regression line)
•Regression maximizes R2 between Y and Ŷ
•Represents the optimal LS relationship between X and Y
•Canonical analyses share this property for multivariate Y, and generate
ordinations of Y constrained by the maximal LS relationship to X
•Consider these different ways of dealing with joint-variation in X&Y
S XX
S=
 SYX
S XY 
SYY 
•RDA and CCA do so in a predictive sense (i.e., regression)
•Canonical Correlation and 2B-PLS also deal with this, but in a ‘correlational’ sense
26
Canonical Variates Analysis/Discriminant Analysis
•Ordination that maximally discriminates among known groups (g)
•Variation expressed as ratio CW1CB
( CW = pooled within VCV; CB = between-gp VCV)
•Decomposition of CW1CB results in canonical vector space
•Suggests which groups differ on which variables
•Within-group variation in CVA plot is circular
•METHOD COMMONLY MISUSED BY BIOLOGISTS
Historical note: Fisher developed DFA (1936), which was generalized to CVA by Rao (1948; 1952)
27
DFA/CVA: Protocol
1. Partition variation:

SSCPTot = Y - Y
  Y - Y
t
SSCPB = SSCPT - SSCPW
CB =

SSCPW =  Yi - Yi
SSCPB
g 1
CW =
 Y - Y 
t
i
i
SSCPW
n g
2. Obtain canonical axes of CW1CB
C
-1
W
CB  li I  ui  0
U  u1 u 2
u g 1 
U contains (g-1) eigenvectors of CW1CB .
However, they are NOT orthogonal, because
CW1CB is square, but not symmetric. (sometimes
Called the discriminant functions).
3. Calculate normalized canonical axes Cvectors = U  U Cw U 
t
-1/ 2
4. Obtain canonical variates (CVA scores: from Yc = centered data)
F = YcCvectors
28
CVA: What it Does
• Rotates and shears data space to space of normalized canonical
axes (group variation will be circular)
Data sapce (a) 
eigenvectors (b) 
canonical axes (c)
From Legendre and
Legendre (1998)
29
4
6
CVA Example: Pupfish Data
2
View with CVA
0
-6
0
-4
2
-2
PCA
PC2
(note group separation)
-5
0
5
10
-2
LD2
Actual data
-4
PC1
Salty ♀ ●
Salty ♂■
Fresh Water ♀ ●
Fresh Water ♂ ■
-4
Data courtesy of M. Collyer (Unpubl.).
-2
0
LD1
2
30
CVA: Comments
•Ordination is ‘canonical’ in that it provides plot of specimens (Y)
that maximally separates a priori groups (X)
•Canonical vectors describe linear combinations of variables
that maximally distinguish group identity (X)
•Distorts actual relationships in dataspace by shearing along axes
of discrimination among groups
•Groups appear more separated than they actually are
•Not a faithful representation of the dataspace!
31
Data Space Distortion With CVA
•Distances and directions among groups distorted with CVA
Original data:3
equidistant groups
CV1 through data
space
CVA space: groups
NOT equidistant
•CVA should NOT be used to describe patterns and variation in
data space, only for describing group differences
Adapted from Klingenberg and Monteiro. (2005). Syst. Biol.
32
Data Space Distortion With CVA
•CV axes NOT orthogonal in original data space
Original
data space
CVA data
space
•Linear discrimination ONLY forms linear plane IFF within-group covariances identical
(shown as ‘equal probability classification lines below)
Adapted from Mitteroecker and Bookstein. (2011). Evol. Biol.
33
Misleading Impression of Group Differences
•Increased number of variables increases discrimination…
EVEN for IDENTICAL GROUPS!
LD2
LD2
-5
-2
-2
-1
-1
0
0
LD2
1
0
1
2
2
3
5
Simulation of 50 specimens in each of 3 groups (150 variables using ‘rnorm’: identical mean & variance)
-3
-2
-1
0
1
2
3
-3
LD1
CVA with 4 variables
Adapted from Mitteroecker and Bookstein. (2011). Evol. Biol.
-2
-1
0
1
2
LD1
CVA with 50 variables
3
-5
0
5
LD1
CVA with 150 variables
34
DFA/CVA: Conclusions
•CVA ordination not useful
•Distorts distances and directions in data space
•Misrepresents within-group covariation and group distances
•Perceived group differences increase with additional variables
(even for identical groups)
•Undistorted (i.e. ‘pure’) view of multivariate space PCA
(or PCA from predicted values [e.g., group means] for visualizing actual trends)
35
Redundancy Analysis
•Direct extension of multiple regression for multivariate Y
•Redundancy synonymous with ‘explained variance’
•RDA is a constrained ordination of Y such that ordination vectors are
linear combinations of Y and linear combinations of X
•RDA is eigenanalysis of VCV from ( Ŷ) multivariate multiple regression
•Thus, RDA preserves Euclidean distances of objects in space of
predicted values Ŷ (appropriate for continuous Y variables)
36
RDA: Computations
•Center X and Y variables and standardize*
•Perform multivariate multiple regression and obtain predicted values
1
t
ˆ
Y  X  X X  Xt Y
•Calculate VCV of predicted values
S Yˆ t Yˆ
1 ˆtˆ

Y Y  S YXS -1 XXS t YX
n 1
S
S =  XX
 SYX
•Ordination from PCA of
S XY 
SYY 
SYˆ t Yˆ =CΛCt
•NOTE: If X or Y begins as a distance matrix, first perform PCoA to generate
a set of ‘variables’ for RDA (see Legendre and Anderson, 1999. Ecol. Monogr. 64:1-14)
*steps are not absolutely necessary, but simplify computations, and place variables in context for direct comparison
37
RDA: What it Does
•Ordination provides plot of objects (Y) as maximally described by
independent variables (X)
•RDA: ordination of ‘fitted’ values based on GLM ( Ŷ)
•Eigenvector loadings describe relative contributions of each variable to
ordination on that canonical axis (interpret like PC loadings)
•Ordination can be shown as biplot of X (as vectors) in PCA of Ŷ
•Reflects relative importance of variables on ordination
•Ordination is ‘standardized’ for regression of Y on X
38
6
RDA Example: Pupfish
4
0
0
5
PC1
PCA
Salty ♀ ●
Salty ♂■
Fresh Water ♀ ●
Fresh Water ♂ ■
10
-6
-5
0
-2
Actual data
-4
-6
-4
RDA 2
2
-2
PC2
6
2
4
NOTE: in this case, RDA has done exactly
what one should NOT do: Fit common slope
when groups are diverging.
-4
-2
0
2
4
6
8
10
RDA 1
RDA: X: groups and SVL
Y: 3 body depth measurements
View with RDA
39
RDA: Comments
•Y~X step is useful, though just GLM
•Ordination step can lead to biological misinterpretation!
•If wrong model of Y~X, ordination not representative of pattern
•RDA should NOT be used to describe patterns and variation in data
•RDA only shows predicted patterns (based on a model, X)
•Exceptionally easy to misuse (knowingly or unknowingly)
•Provides false sense of pattern relative to error (noise)
•An incorrect model (X) results in incorrect ordination
40
Canonical Correspondence Analysis
•Extends correspondence analysis of Y to predictive framework
•CCA is a constrained ordination of Y such that ordination vectors are
linear combinations of Y and linear combinations of X
•Conceptually, CCA is the same as RDA, but with computational
adjustments for the nature of the Y data (which is not continuous)
•CCA is eigenanalysis of VCV from (Ŷ) weighted form of RDA
•Thus, CCA preserves Chi-square distances of objects in space of
predicted values Ŷ (appropriate for frequency or presence/absence Y
variables)
41
CCA: Computations
1. Calculate matrix (Q) of relative frequencies (proportions) from
contingency table data: pij  fij ftot
p p p 
2. Calculate elements of matrix Q as: q   p p 


i
ij
j
ij
i
j
(matrix is centered by row and column means, hence ‘reciprocal averaging’)
3. Perform weighted regression, where Dpi+ are weights

B  X Dpi+ X
t

1
ˆ  D0.5 XB
Y
pi+
Xt Dp0.5i+ Q
4. Calculate VCV of predicted values
ˆ tY
ˆ
SYˆ t Yˆ  Y
5. Ordination from PCA of
SYˆ t Yˆ =CΛCt
For details see Legendre and Legendre 1998
42
CCA: Comments
•Ordination provides plot of objects (Q) as maximally described by
independent variables (X)
•CCA: ordination of ‘fitted’ values NOT of Y itself (based on Ŷ )
•Plot interpreted as in CA: ordination in ‘frequency space’
•CCA ordination can be shown as biplot of X (as vectors) in PCA of Ŷ
•Reflects relative importance of variables on ordination
•Ordination is ‘standardized’ for regression of Y on X
43
Conclusions
•CCorA & 2B-PLS provide useful multivariate extension to correlation
•Canonical ordination approaches (CVA, RDA, CCA) provide adjusted
view of data space (based on X)
•Elegant mathematically, but do NOT provide true view of data space
•Canonical ordinations should not be used to describe patterns of
variation
•Actual patterns should always be viewed first (e.g., through PCA)
to understand true biological variation
•Use EXTREME CAUTION when interpreting canonical ordination
plots: incorrect model leads to misleading plot!!!!!!
44
Download