Multivariate Association Advanced Biostatistics Dean C. Adams Lecture 10 EEOB 590C 1 Multivariate Tests of Association •Correlation assesses association between 2 variables (X & Y) •What if Y and X are MATRICES (i.e. multivariate)? •If X & Y are sets of variables, can we assess how they covary? •Four main approaches: Mantel tests, canonical correlation, twoblock partial least squares, Escoufier’s RV 2 Mantel Test •General approach to assess relationship between two (or more) data sets •Compares DISTANCE (or similarity) matrices (Mantel, 1967) •Can assess association, but can be used for other designs •Assumptions: same set of objects used to generate both distance matrices, matrices are independent 3 Mantel Test of Association • Protocol 1. Calculate appropriate distance matrices X & Y 2. ‘Unfold’ matrices into vectors nof length (n(n-1)/2) 1 n 3. Compute Mantel statistic zM X ijYij and standardized i 1 j i 1 Mantel coefficient: zM rM n n 1 2 1 4. Assess significance through permutation test • NOTE: permutation performed on OBJECTS in distance matrix, not on unfolded vector (there is a difference) 4 Example: Mantel Test •Associate head shape (X) and food use (Y) in salamanders 12 11 13 > mantel(food.dist,shape.dist) 8 3 5 9 7 6 10 1 4 2 Hymenoptera eggs Mantel statistic r: 0.1511 Significance: 0.033 Collembola Gastropoda Chelonethida Isoptera Acarina Diplopoda Araneida Isopoda Coleoptera Chilopoda larvae Diptera Orthoptera Oligochaeta 350 symp hoff 250 symp cin 200 150 100 50 Chilopoda Diplopoda Oligochaeta Isopoda Orthoptera Diptera Araneida Coleoptera larvae Gastropoda Chelonethida Hymenoptera Collembola eggs Isoptera 0 Acarina frequency 300 Note: Chi-square distance used for food as data are counts Adams and Rohlf (2000). PNAS 97:4106-4111. 5 Mantel Tests with Design Matrices •Mantel procedure can be generalized for ‘ANOVA’ designs •Design matrices describe groups Same group: 0 Different group: 1 •Mantel statistic then sums a subset of distances (e.g., only between group SS) 6 Example: Design Matrix Mantel Test •Food Use (Y) differences by species > mantel(food.dist,species.dist) Mantel statistic r: 0.2988 Significance: 0.001 Note: fake species designations for example only Hymenoptera eggs Collembola Gastropoda Chelonethida Isoptera Acarina Diplopoda Araneida Isopoda Coleoptera Chilopoda larvae Diptera Orthoptera Oligochaeta 350 symp hoff 250 symp cin 200 150 Adams and Rohlf (2000). PNAS 97:4106-4111. 100 50 Chilopoda Diplopoda Oligochaeta Isopoda Orthoptera Diptera Araneida Coleoptera larvae Gastropoda Chelonethida Hymenoptera Collembola eggs Isoptera 0 Acarina frequency 300 Note: Chi-square distance used for food as data are counts 7 Three-Way Mantel Test •Allows a third data matrix to be incorporated (Smouse, Long, Sokal, 1986) •Addresses whether association of X & Y exists while holding effects of Z constant •Calculate a partial Mantel coefficient (analogous to partial correlation coefficient) •Partial Mantel coefficient found from pairwise Mantel coefficients rXY, rXZ and rYZ rXY .Z rXY rXZ rYZ 2 1 rXZ 1 rYZ2 •Several protocols for assessing significance 8 Three-Way Mantel Test: Protocol 1 1. Calculate distance matrices X, Y, & Z 2. ‘Unfold’ matrices into vectors of length (n(n-1)/2) 3. Perform regressions of X on Z, and Y on Z, and calculate residuals Xresid Yresid 4. Compute partial Mantel statistic of Xresid vs. Yresid 5. Assess significance through permutation test • Procedure extremely general and can be used for 4 (or more matrices) 9 Three-Way Mantel Test: Protocol 2 1. 2. 3. 4. Calculate rXY, rXZ and rYZ rXY rXZ rYZ rXY .Z 2 Calculate partial Mantel coefficient: 1 rXZ 1 rYZ2 Permute X matrix and recalculate rXY & rXZ Compare observed rXY.Z to randomly generated rXY.Z • In general, residual approach (#1) faster to implement, but can have type I error rate > 0.05, particularly for small sample sizes (see Legendre and Legendre, 1998) 10 Example: Three-Way Mantel Test •Head Shape (X) vs. Food Use (Y) | Species (Z) > mantel.partial(shape.dist,food.dist,species.dist,permutations=999) Mantel statistic r: 0.0403 Significance: 0.286 Note: fake species designations for example only 12 11 13 10 Hymenoptera eggs Collembola Gastropoda Chelonethida Isoptera Acarina larvae 5 4 Diplopoda Araneida Isopoda Coleoptera Chilopoda 8 9 7 6 3 1 2 Diptera Orthoptera Oligochaeta 350 symp hoff 250 symp cin 200 150 100 50 Chilopoda Diplopoda Oligochaeta Isopoda Orthoptera Diptera Araneida Coleoptera larvae Gastropoda Chelonethida Hymenoptera Collembola eggs Isoptera 0 Acarina frequency 300 Note: Chi-square distance used for food as data are counts Adams and Rohlf (2000). PNAS 97:4106-4111. 11 Mantel Test: Conclusions •Exceedingly general and useful approach •Datasets can have 1+ variables each •Different distances can be used for different types of data •HOWEVER, •Can have low power (particularly when N < 10) •Has lower power than GLM or permutational-MANOVA when data can be analyzed as such (for details on both issues, see Legendre) •General recommendation: use perm-MANOVA when possible 12 Complications: Different Matrix Types •Sometimes, one data set contains variables, and another is in the form of distances. What to do? X DY •1: Convert X to distances, use Mantel test •2: Convert D to data for GLM •Generate Y variables with PCoA •Run multiple regression (i.e. GLM) (see Legendre and Anderson, 1999. Ecol. Monogr. 64:1-14) t ˆ Y X X X X Y t 1 13 Combining Data Types •Sometimes, one has discrete AND continuous data •Recall: ALL distance measures require data in commensurate units • Deuclid requires all Y are continuous • Dhamming requires all Y are 0/1 •DO NOT combine for single distance! •Instead, consider A: Separate PCoA analsyes on each data type B: Combine these continuous variables for subsequent analyses* *Considerable work must be done to evaluate efficacy of the approach for particular datasets. Be CAUTIOUS! 14 Canonical Correlation Analysis(CCorA) •Identify maximal rXY from pairs of linear combinations LC of each •LC constrained to be orthogonal within each dataset (like PC axes) •LC constrained to be orthogonal between datasets (e.g., Y1 ┴ X2) •CCorA = multiple regression (when 1 data set has a single variable) •Canonical: simplest reduction of a set of functions that does not lose generality (e.g., canonical form of VCV are its eigenvalues, which perfectly express the variation in VCV) 15 CCorA: Protocol R XX 1. Calculate R = R YX R XY R YY 2. Calculate canonical axes for each set of variables • • • Calculate matrices: A R XX R XY R YY R YX R XX 2 -1 -1/ 2 B R -1/ R R R R YY XY XX YX YY Linear combinations (U) for data set 1 are from eigenanalysis of A Linear combinations (V) for data set 2 are from eigenanalysis of B -1/ 2 -1 -1/ 2 3. Canonical correlations are l1/2 from eigenanalysis of: C R -1YY R YX R -1XX R XY 4. Statistical significance determined with Pillai’s trace of C (and often randomization) 16 Example: Canonical Correlation •Associate head shape (X) and food use (Y) in salamanders > CCorA(shape,food,nperm=1000) 12 11 13 8 3 5 9 7 6 10 Pillai's trace: 1 4 2 5.800587 Significance of Pillai's trace: based on 1000 permutations: 0.02297702 from F-distribution: 0.027531 > cor(res$Cx[,1],res$Cy[,1]) [1] 0.9167563 Hymenoptera eggs 350 Collembola Gastropoda Chelonethida Isoptera Acarina Diplopoda Araneida Isopoda Coleoptera Chilopoda larvae Diptera Orthoptera Oligochaeta symp hoff 250 symp cin 200 150 100 50 Chilopoda Diplopoda Oligochaeta Isopoda Orthoptera Diptera Araneida Coleoptera larvae Gastropoda Chelonethida Hymenoptera Collembola eggs Isoptera 0 Acarina frequency 300 Note: Chi-square distance used for food as data are counts Adams and Rohlf (2000). PNAS 97:4106-4111. 17 Two-Block Partial Least Squares (2B-PLS) •Identify maximal CovXY from pairs of LC of each •LC ONLY constrained to be orthogonal within each dataset (not between) •Calculations less complicated (fewer mathematical constraints) 18 2B-PLS: Protocol S XX 1. Calculate S = S YX S XY S YY 2. Linear Combinations for set X (U) and set Y (V) found from t S = UDV SVD of: XY 3. The % covariation explained by each combination is the square of the singular values (diagonal of D) 4. Correlations determined by projecting data on U and V, and computing Product-moment correlation (see Rohlf and Corti, 2000. Syst. Biol. 49:740-753) 5. Statistical significance can be determined with randomization 19 Example: 2B-PLS •Associate head shape (X) and food use (Y) in salamanders > cor(pls.res$scores[,1],pls.res$Yscores[,1]) 12 11 13 8 3 5 9 7 6 10 [1] 0.7590776 1 4 2 Prand = 0.0001 Hymenoptera eggs 350 Collembola Gastropoda Chelonethida Isoptera Acarina Diplopoda Araneida Isopoda Coleoptera Chilopoda larvae Diptera Orthoptera Oligochaeta symp hoff 250 symp cin 200 150 100 50 Chilopoda Diplopoda Oligochaeta Isopoda Orthoptera Diptera Araneida Coleoptera larvae Gastropoda Chelonethida Hymenoptera Collembola eggs Isoptera 0 Acarina frequency 300 Note: Chi-square distance used for food as data are counts Adams and Rohlf (2000). PNAS 97:4106-4111. 20 Interpreting Association •Use CA loadings to interpret each LC axis (as in PCA) •Loadings on food axis (PLS): oligo 0.09 gastro -0.018 isopo 0.0424 diplo 0.0551 chilop 0.0933 acar -0.507 aranei 0.2537 chelo -0.112 coleo collem 0.4949 -0.394 dipter 0.1658 hymen 0.3649 isopt -0.119 orthop 0.1263 larvae 0.2158 eggs -0.071 •Food axis mainly describes contrast of small vs. large prey items 21 Strength of Assocation: Escoufier’s RV •Express CovXY relative to CovXX and CovYY (between relative to within) 1: Calculate S: 2: Estimate: S XX S= S YX S XY S YY tr (S XY S YX ) RV tr (S XXS XX )tr (S YY S YY ) 3:Significance via permutation •RV: relative covariation on scale of 01 •RV is thus analogous to squared correlation coefficient See Klingenberg 2009 22 Example: Escoufier’s RV •Associate head shape (X) and food use (Y) in salamanders > sum(diag(S12%*%t(S12)))/ sqrt(sum(diag(S11%*%t(S11)))%*%sum(diag(S22%*%t(S22)))) 12 11 13 8 3 5 [,1] [1,] 0.3351979 9 7 6 10 1 4 2 Prand = 0.0001 Hymenoptera eggs 350 Collembola Gastropoda Chelonethida Isoptera Acarina Diplopoda Araneida Isopoda Coleoptera Chilopoda larvae Diptera Orthoptera Oligochaeta symp hoff 250 symp cin 200 150 100 Adams and Rohlf (2000). PNAS 97:4106-4111. 50 Chilopoda Diplopoda Oligochaeta Isopoda Orthoptera Diptera Araneida Coleoptera larvae Gastropoda Chelonethida Hymenoptera Collembola eggs Isoptera 0 Acarina frequency 300 23 Summary: Multivariate Association •Several ways to assess covariation between sets of variables •Mantel Tests: associate distance matrices (can be design matrix) •CCorA: maximum rXY using between & within ┴ constraints •PLS: maximum CovXY using within ┴ constraints •RV: CovXY/CovXX & CovYY 24 Canonical Ordination Approaches •Considers association of two matrices (X & Y) •Provides visualization (ordination) from Y~X (differing from PCA/PCoA) •Several canonical ordination methods •Canonical Variates Analysis (CVA) •Redundancy Analysis (RDA) •Canonical Correspondence Analysis (CCA) *Mathematically, the canonical form (from Greek Greek κανων, ‘kanôn’) is the simplest and most comprehensive representation of relationship, without losing generality 25 Canonical Analysis: General Comments •Recall univariate multiple regression: Yi 0 1 X1i 2 X 2i i •Here, predicted values (Ŷ) are a 1-dimensional ‘ordination’ of original Y data (along the regression line) •Regression maximizes R2 between Y and Ŷ •Represents the optimal LS relationship between X and Y •Canonical analyses share this property for multivariate Y, and generate ordinations of Y constrained by the maximal LS relationship to X •Consider these different ways of dealing with joint-variation in X&Y S XX S= SYX S XY SYY •RDA and CCA do so in a predictive sense (i.e., regression) •Canonical Correlation and 2B-PLS also deal with this, but in a ‘correlational’ sense 26 Canonical Variates Analysis/Discriminant Analysis •Ordination that maximally discriminates among known groups (g) •Variation expressed as ratio CW1CB ( CW = pooled within VCV; CB = between-gp VCV) •Decomposition of CW1CB results in canonical vector space •Suggests which groups differ on which variables •Within-group variation in CVA plot is circular •METHOD COMMONLY MISUSED BY BIOLOGISTS Historical note: Fisher developed DFA (1936), which was generalized to CVA by Rao (1948; 1952) 27 DFA/CVA: Protocol 1. Partition variation: SSCPTot = Y - Y Y - Y t SSCPB = SSCPT - SSCPW CB = SSCPW = Yi - Yi SSCPB g 1 CW = Y - Y t i i SSCPW n g 2. Obtain canonical axes of CW1CB C -1 W CB li I ui 0 U u1 u 2 u g 1 U contains (g-1) eigenvectors of CW1CB . However, they are NOT orthogonal, because CW1CB is square, but not symmetric. (sometimes Called the discriminant functions). 3. Calculate normalized canonical axes Cvectors = U U Cw U t -1/ 2 4. Obtain canonical variates (CVA scores: from Yc = centered data) F = YcCvectors 28 CVA: What it Does • Rotates and shears data space to space of normalized canonical axes (group variation will be circular) Data sapce (a) eigenvectors (b) canonical axes (c) From Legendre and Legendre (1998) 29 4 6 CVA Example: Pupfish Data 2 View with CVA 0 -6 0 -4 2 -2 PCA PC2 (note group separation) -5 0 5 10 -2 LD2 Actual data -4 PC1 Salty ♀ ● Salty ♂■ Fresh Water ♀ ● Fresh Water ♂ ■ -4 Data courtesy of M. Collyer (Unpubl.). -2 0 LD1 2 30 CVA: Comments •Ordination is ‘canonical’ in that it provides plot of specimens (Y) that maximally separates a priori groups (X) •Canonical vectors describe linear combinations of variables that maximally distinguish group identity (X) •Distorts actual relationships in dataspace by shearing along axes of discrimination among groups •Groups appear more separated than they actually are •Not a faithful representation of the dataspace! 31 Data Space Distortion With CVA •Distances and directions among groups distorted with CVA Original data:3 equidistant groups CV1 through data space CVA space: groups NOT equidistant •CVA should NOT be used to describe patterns and variation in data space, only for describing group differences Adapted from Klingenberg and Monteiro. (2005). Syst. Biol. 32 Data Space Distortion With CVA •CV axes NOT orthogonal in original data space Original data space CVA data space •Linear discrimination ONLY forms linear plane IFF within-group covariances identical (shown as ‘equal probability classification lines below) Adapted from Mitteroecker and Bookstein. (2011). Evol. Biol. 33 Misleading Impression of Group Differences •Increased number of variables increases discrimination… EVEN for IDENTICAL GROUPS! LD2 LD2 -5 -2 -2 -1 -1 0 0 LD2 1 0 1 2 2 3 5 Simulation of 50 specimens in each of 3 groups (150 variables using ‘rnorm’: identical mean & variance) -3 -2 -1 0 1 2 3 -3 LD1 CVA with 4 variables Adapted from Mitteroecker and Bookstein. (2011). Evol. Biol. -2 -1 0 1 2 LD1 CVA with 50 variables 3 -5 0 5 LD1 CVA with 150 variables 34 DFA/CVA: Conclusions •CVA ordination not useful •Distorts distances and directions in data space •Misrepresents within-group covariation and group distances •Perceived group differences increase with additional variables (even for identical groups) •Undistorted (i.e. ‘pure’) view of multivariate space PCA (or PCA from predicted values [e.g., group means] for visualizing actual trends) 35 Redundancy Analysis •Direct extension of multiple regression for multivariate Y •Redundancy synonymous with ‘explained variance’ •RDA is a constrained ordination of Y such that ordination vectors are linear combinations of Y and linear combinations of X •RDA is eigenanalysis of VCV from ( Ŷ) multivariate multiple regression •Thus, RDA preserves Euclidean distances of objects in space of predicted values Ŷ (appropriate for continuous Y variables) 36 RDA: Computations •Center X and Y variables and standardize* •Perform multivariate multiple regression and obtain predicted values 1 t ˆ Y X X X Xt Y •Calculate VCV of predicted values S Yˆ t Yˆ 1 ˆtˆ Y Y S YXS -1 XXS t YX n 1 S S = XX SYX •Ordination from PCA of S XY SYY SYˆ t Yˆ =CΛCt •NOTE: If X or Y begins as a distance matrix, first perform PCoA to generate a set of ‘variables’ for RDA (see Legendre and Anderson, 1999. Ecol. Monogr. 64:1-14) *steps are not absolutely necessary, but simplify computations, and place variables in context for direct comparison 37 RDA: What it Does •Ordination provides plot of objects (Y) as maximally described by independent variables (X) •RDA: ordination of ‘fitted’ values based on GLM ( Ŷ) •Eigenvector loadings describe relative contributions of each variable to ordination on that canonical axis (interpret like PC loadings) •Ordination can be shown as biplot of X (as vectors) in PCA of Ŷ •Reflects relative importance of variables on ordination •Ordination is ‘standardized’ for regression of Y on X 38 6 RDA Example: Pupfish 4 0 0 5 PC1 PCA Salty ♀ ● Salty ♂■ Fresh Water ♀ ● Fresh Water ♂ ■ 10 -6 -5 0 -2 Actual data -4 -6 -4 RDA 2 2 -2 PC2 6 2 4 NOTE: in this case, RDA has done exactly what one should NOT do: Fit common slope when groups are diverging. -4 -2 0 2 4 6 8 10 RDA 1 RDA: X: groups and SVL Y: 3 body depth measurements View with RDA 39 RDA: Comments •Y~X step is useful, though just GLM •Ordination step can lead to biological misinterpretation! •If wrong model of Y~X, ordination not representative of pattern •RDA should NOT be used to describe patterns and variation in data •RDA only shows predicted patterns (based on a model, X) •Exceptionally easy to misuse (knowingly or unknowingly) •Provides false sense of pattern relative to error (noise) •An incorrect model (X) results in incorrect ordination 40 Canonical Correspondence Analysis •Extends correspondence analysis of Y to predictive framework •CCA is a constrained ordination of Y such that ordination vectors are linear combinations of Y and linear combinations of X •Conceptually, CCA is the same as RDA, but with computational adjustments for the nature of the Y data (which is not continuous) •CCA is eigenanalysis of VCV from (Ŷ) weighted form of RDA •Thus, CCA preserves Chi-square distances of objects in space of predicted values Ŷ (appropriate for frequency or presence/absence Y variables) 41 CCA: Computations 1. Calculate matrix (Q) of relative frequencies (proportions) from contingency table data: pij fij ftot p p p 2. Calculate elements of matrix Q as: q p p i ij j ij i j (matrix is centered by row and column means, hence ‘reciprocal averaging’) 3. Perform weighted regression, where Dpi+ are weights B X Dpi+ X t 1 ˆ D0.5 XB Y pi+ Xt Dp0.5i+ Q 4. Calculate VCV of predicted values ˆ tY ˆ SYˆ t Yˆ Y 5. Ordination from PCA of SYˆ t Yˆ =CΛCt For details see Legendre and Legendre 1998 42 CCA: Comments •Ordination provides plot of objects (Q) as maximally described by independent variables (X) •CCA: ordination of ‘fitted’ values NOT of Y itself (based on Ŷ ) •Plot interpreted as in CA: ordination in ‘frequency space’ •CCA ordination can be shown as biplot of X (as vectors) in PCA of Ŷ •Reflects relative importance of variables on ordination •Ordination is ‘standardized’ for regression of Y on X 43 Conclusions •CCorA & 2B-PLS provide useful multivariate extension to correlation •Canonical ordination approaches (CVA, RDA, CCA) provide adjusted view of data space (based on X) •Elegant mathematically, but do NOT provide true view of data space •Canonical ordinations should not be used to describe patterns of variation •Actual patterns should always be viewed first (e.g., through PCA) to understand true biological variation •Use EXTREME CAUTION when interpreting canonical ordination plots: incorrect model leads to misleading plot!!!!!! 44