Center for Biofilm Engineering Some statistical considerations in molecular methods Al Parker Statistician and Research Engineer Montana State University BSTM– July 2009 Acknowledgments Colleagues in the CBE: James Moberly, Seth D’Imperio, Brent Peyton Markus Dieser Marty Hamilton The problem How to extract useful information from hundreds to thousands of response variables (eg. microarray analysis) measured from only a few replicates (experiments or environmental samples) Statistical thinking Multivariate Statistics attempts to organize and summarize data sets with large numbers of response variables “organize and summarize” = dimension reduction In this talk, I will focus on abundance data, estimated for example from micro-array or clone analysis of PCR Statistical thinking Hierarchical Clustering Principle Components Canonical Correlation Hierarchical Clustering (38 variables, 9 replicates) Hierarchical Clustering (38 variables, 9 replicates) Similarity or Distance Linkage: How the similarity measure determines clusters Two different ways to generate clusters with the same similarity measure A Distance or Similarity Measure Correlation measures the strength and direction of a linear relationship between paired variables x and y Corr(x,y) = Σ(xi – mean(x))(yi – mean(y)) (n-1)SxSy Unitless Values between -1 and 1 An example (2 variables, 9 replicates) Corr(Actinobacteria, Acidobacteria) = .7833 Another (made up) example Corr(species 1, species 2) = 0.000 A matrix of scatterplots for 6 variables A correlation matrix of 6 variables Acidobacteria Acidobacteria Actinobacteria Bacteroidetes Chloroflexi Proteobacteria Verrucomicrobia 1 0.7833 0.7589 0.8556 0.8444 0.7975 Actinobacteria Bacteroidetes Chloroflexi 0.7833 0.7589 0.8556 1 0.8993 0.8257 0.8993 1 0.7901 0.8257 0.7901 1 0.9698 0.9393 0.8704 0.8230 0.8392 0.9699 Proteobacteria 0.8444 0.9698 0.9393 0.8704 1 0.8621 Verrucomicrobia 0.7975 0.8230 0.8392 0.9699 0.8621 1 Principle Components Analysis (PCA) PCA uses the correlation matrix formed by the original variables to optimally construct a smaller number of new variables which capture the maximum amount of variability in the original variables PCA applied to the correlation matrix is not affected by disparate units between the different variables The number of new variables is only as large as the number of replicates Original variable #1 PCA with 2 (standardized) responses Original variable #2 PCA with 2 (standardized) responses 1st PC - 78% Original variable #1 1st PC is loaded by Orig Var #1 2nd PC – 22% 2nd PC is loaded by Orig Var #2 Original variable #2 PCA terminology The new variables are called principle components The amount of variability of the original data captured by each component is given The correlation between the original variables and the principle components are principle component loadings Reducing 7 original variables to 2 PCs Original variables: 1. Water depth 2. Core depth 3. Fe 4. Mn New variables = Principle Components 1st PC: Metals 2nd PC: Water depth and Core depth 55% 5. Cu 6. Pb 7. Zn 18% Reducing 7 original variables to 2 PCs 1st PC - 55% 2nd PC – 18% Total: 73% PCA is another way to cluster Canonical Correlation Analysis (CCA) CCA uses the correlation matrix to determine the (linear) relationship between input variables (eg. environmental variables) and response variables (eg. phylogenic data) CCA simultaneously finds new variables from the input and response variables which have maximal correlation The number of new variables (canonical components) can be no larger than the number of replicates CCA Example (7 inputs, 6 outputs, 9 replicates) Original environmental variables: 1. Water depth 2. Core depth 3. Fe 4. Mn 5. Cu 6. Pb 7. Zn Original microbial variables: 1. Acidobacteria 2. Actinobacteria 3. Bacteroidetes 4. Chloroflexi 5. Proteobacteria 6. Verrucomicrobia CCA (7 inputs, 6 outputs, 9 replicates) Original environmental variables: 1. Water depth 2. Core depth 3. Fe 4. Mn 5. Cu 6. Pb Original microbial variables: 1. Acidobacteria 2. Actinobacteria 3. Bacteroidetes 4. Chloroflexi 5. Proteobacteria 6. Verrucomicrobia 7. Zn 1st CC: Water depth and Core depth 1st CC: Acidobacteria,…, Verucomicrobia 2nd CC: Metals 2nd CC: Bacteroidetes CCA (7 inputs, 6 outputs, 9 replicates) 1st CC: Water depth and Core depth 1st CC: Acidobacteria,…, Verucomicrobia 2nd CC: Metals 2nd CC: Bacteroidetes Summary PROBLEM: Lots of variables measured from a few samples SOME APPROACHES: Cluster similar variables together Principle component analysis creates a few new variables which optimally represent the data Canonical correlation analysis describes the optimal (linear) relationship between input and output variables Fin • Principal Component Analysis: water depth , core depth (, Mn-Total, Fe-Total, C • Eigenanalysis of the Correlation Matrix • • • Eigenvalue 3.8467 1.2443 1.0043 0.6628 0.1567 0.0830 0.0023 Proportion 0.550 0.178 0.143 0.095 0.022 0.012 0.000 Cumulative 0.550 0.727 0.871 0.965 0.988 1.000 1.000 • • • • • • • • Variable PC1 PC2 PC3 PC4 PC5 PC6 PC7 water depth (cm) 0.090 -0.529 -0.732 0.338 0.131 0.201 -0.062 core depth (cm) -0.193 0.702 -0.154 0.558 0.194 0.313 0.009 Mn-Total 0.488 0.163 -0.171 -0.016 -0.366 0.084 0.752 Fe-Total 0.477 0.228 -0.126 -0.057 -0.504 0.154 -0.651 Cu-Total 0.227 -0.358 0.608 0.633 -0.119 0.188 0.004 Zn-Total 0.463 0.019 0.147 -0.326 0.634 0.505 -0.026 Pb-Total 0.474 0.142 -0.055 0.253 0.376 -0.735 -0.080 CCA (7 input variables, 9 replicates) 1st CC: Water depth Core depth 2nd CC: Metals CCA (6 response variables, 9 replicates) 1st CC: Acidobacteria,…, Verucomicrobia 2nd CC: Bacteroidetes Hierarchical Clustering The large number of variables are organized into a smaller number of similar clusters One can choose a representative variable from each cluster (eg. a mean)