High Dimensional Molecular Data Tutorial in R An introduction to the HDMD package Prepared for the ICMSB Workshop Shanghai, China 2009 Lisa McFerrin and William Atchley HDMD Tutorial 2 Table of Contents FIGURE INDEX 3 TABLE INDEX 3 INSTALLING THE HDMD PACKAGE 4 INTRODUCTION 5 IN THIS TUTORIAL NOTATION 5 5 PRINCIPAL COMPONENTS ANALYSIS (PCA) 6 METHODOLOGY INPUT DATA FUNCTIONS CALL RESULTING OUTPUT 7 9 10 10 11 FACTOR ANALYSIS (FA) 14 METHODOLOGY ESTIMATE COMMUNALITY VARIMAX ROTATION (ORTHOGONAL) PROMAX ROTATION (OBLIQUE) INPUT DATA FUNCTIONS CALL RESULTING OUTPUT PCA AND FA COMPARISON DIFFERENCES IN FA FOR R AND SAS USING 6 FACTORS 16 17 18 19 20 20 20 20 24 26 27 METRIC TRANSFORMATION 27 METHODOLOGY INPUT DATA FUNCTIONS CALL RESULTING OUTPUT 28 28 29 29 30 DISCRIMINANT FUNCTION ANALYSIS (DFA) 30 METHODOLOGY INPUT DATA FUNCTIONS CALL 31 33 34 35 HDMD Tutorial 3 RESULTING OUTPUT 35 REFERENCES 40 Figure Index Figure 1: Scree Plot for Principal Components (screeplot) Figure 2: Latent Variable Structure Figure 3:Visualization of Rotation Figure 4: Scree Plot for Factor Analysis (VSS.scree) Figure 5: PCA - FA Scores for Factor 1 and 2 (and 3) Figure 6: Sorted Communality Estimates for FA Figure 7: Metric Solution Conversion Figure 8: LDA for 2 classes Figure 9: LDA for Factor1 using R Metric Transformation Figure 10: LDA for Factor2 (pss) using R Metric Transformation Figure 11: LDA for Factor3 (ms) using R Metric Transformation Figure 12: LDA for Factor4 (cc) using R Metric Transformation Figure 13: LDA for Factor5 (ec) using R Metric Transformation 11 15 15 21 25 25 28 31 38 38 39 39 40 Table Index Table 1: Amino Acid Values for 54 Quantifiable Attributes (Standardized) Table 2: Amino Acid Factor Scores from R Table 3: Correlation of PCA and FA Score Estimates Table 4: Factor Scores computed with SAS (Atchley et al 2005) Table 5: Correlation between R (row) and SAS (col) factor scores K=5 Table 6: Correlation between R (row) and SAS (col) factor scores K=6 Table 7: SAS Factor Scores (K=6) Table 8:R Factor Scores (K=6) Table 9:SAS Score Correlation for FA6 & FA5 Table 10:R Score Correlation for FA6 & FA5 Table 11: representative sample of bHLH Amino Acid Sequences with group designation Table 12: Metric Transformation of bHLH Sequences using AAMetric (subset shown) 9 23 24 26 26 27 27 27 29 34 HDMD Tutorial 4 Installing the HDMD package A CRAN package is in preparation and will be submitted to CRAN at http://cran.rproject.org/. Currently, the HDMD package along with supplementary materials is available at www4.ncsu.edu/~lgmcferr/ICMSBWorkshop. Updates and package instructions may be found there. To run the HDMD functions, download HDMD_1.0.tar.gz located at www4.ncsu.edu/~lgmcferr/ICMSBWorkshop into a directory and unpack it from the command line using "tar -xvf HDMD_1.0.tar.gz". From the same directory, use the command "R CMD INSTALL HDMD". The HDMD package can then be loaded using the R package manager. Several other packages are required for full functionality of the HDMD package. While MASS, stats, and psych are necessary for HDMD internal computation, scatterplot3d is optional and simply recommended in the examples for viewing purposes. Since psych is not a general package included in the R library, it must be downloaded. The initial call install.packages("psych") will attempt to install this from an external source. Routines implemented in HDMD are appropriate for PCA, FA, and DFA when the number of variables is less than the number of observations. HDMD also encodes a function to transform amino acid sequence into a metric conversion. For more information, see www4.ncsu.edu/~lgmcferr/ICMSBWorkshop Version: 1.0 Depends: MASS, psych Suggests: scatterplot3d Published: 2009-06-08 Author: Lisa McFerrin Maintainer: Lisa McFerrin <lgmcferr at ncsu dot edu> URL: www4.ncsu.edu/~lgmcferr/ICMSBWorkshop, Package source: HDMD_1.0.tar.gz Examples: HDMD_FunctionCalls.R HDMD Tutorial 5 Introduction This tutorial introduces the concept and relevant statistical methodology for “Latent Variables ” as they relate to analyses of high dimensional molecular data (HDMD). A latent variable model relates a set of observed or manifest variables to a set of underlying latent variables. Latent variables are not directly observed but are rather inferred through a mathematical model constructed from variables that are observed and directly measured. HDMD typically contain thousands of data points arising from a substantially smaller number of sampling units. Such data present many complexities. They are typically highly interdependent, exhibit complex underlying components of variability, and meaningful replication is rare. The workshop will focus on three multivariate statistical methods that can facilitate description and analyses of latent variables and latent structure inherent to HDMD. These statistical methods are intended to reduce the dimensionality of HDMD such that patterns of biologically meaningful patterns of multidimensional covariability are exposed so that relevant biological questions can be explored. In this Tutorial The tutorial and HDMD package was prepared for a workshop on HDMD to introduce principal components analysis, common factor analysis and discriminant analysis. A metric transformation converting amino acid residues into numerical biologically informative values will also be covered. Using exemplar published datasets, the relative suitability for exploring a series of important biological questions will be briefly touched upon. These three methods will be demonstrated using a set of R programs together with annotated output and documentation. Specific examples will be covered, with explicit function calls and output shown to provide a complete walkthrough of HDMD analysis. Annotations of HDMD package functions will NOT be covered, but can be found in the HDMD manual. This software has not been stringently tested. Please report any errors or bugs to Lisa McFerrin at lgmcferr@ncsu.edu. More information and package instructions can be found at www4.ncsu.edu/~lgmcferr/ICMSBWorkshop. Notation ... ... ... indicates only a portion of the table or output is shown p number of variables N number of observations K number of components, factors, or discriminants X Data Matrix (N x p) X' Normalized Data Matrix, centered and/or scaled (N x p) HDMD Tutorial diagz1 6 j mean of variable j Covariance Matrix (p x p) i2 Variance for variable i ij2 Covariance for variables i and j R Correlation Matrix (p x p) Eigenvalue vector of p eigenvalues (1 x p) V Eigenvector matrix (N x p) zm mxm matrix with diagonal elements set to z and off diagonal elements set to 0 Principal Components Analysis (PCA) Principal Component Analysis (PCA) is a data reduction tool. Generally speaking, PCA provides a framework for minimizing data dimensionality by identifying linear combinations of variables called principal components that maximally represent variation in the data. Principal axes linearly fit the original data so the first principal axis minimizes the sum of squares for all observational values and maximally reduces residual variation. Each subsequent principal axis maximally accounts for variation in residual data and acts as the line of best fit directionally orthogonal to previously defined axes. Principal components represent the correlation between variables and the corresponding principal axes. Conceptually, PCA is a greedy algorithm fitting each axes to the data while conditioning upon all previous axes definitions. Principal components project the original data onto these axes, where axes are ordered such that Principal Component 1 (PC1) accounts for the most variation, followed by PC1, PC2, PC3, ...PCp for p variables (dimensions). Since each PC is orthogonal, each component independently accounts for data variability and the Percent of Total Variation Explained (PTV) is cumulative. PCA offers as many principal components as variables in order to explain all variability in the data. However, only a subset of these principal components are notably informative. Since variability is shifted into leading PCs, many of the remaining PCs account for little variation and can be disregarded to retain maximal variability with reduced dimensionality. While Singular Value Decomposition (SVD) can be used for PCA, we will focus on the p eigenvector decomposition method employed in R. Eigenvalue k aik2 is the variance i1 explained by PCk where aik is the loading for variable i on factor k. As an example, if 90% of total variation should be retained in the model for p dimensional data, the first K principal components should be kept such that HDMD Tutorial 7 K k PTV k1 K .90 . k k1 PTV acts as the signal to noise ratio, which flattens with additional components. Typically, the number of informative components K is chosen using one of three methods: 1) Kaiser's eigenvalue > 1; 2) Cattell's scree plot; or 3) Bartlett test of sphericity. Often K << p, as seen in the following example. Thus PCA is extremely useful in data compression and dimensionality reduction since the process optimizes over the total variance. However, the reduced dataset of loading values relating variables and PCs may be difficult to interpret. While PCA provides an unique solution, the loading coefficients may not distinguish variables that contribute to variation along a particular principal axis. As an additional source, Shlen provides a very useful PCA tutorial, including basic methods and explanations. Methodology The first step in PCA is to create a mean-centered data matrix X' by subtracting variable means from each data element in X. This centers the data so each column of X' has mean 0. x11 X x N1 x1p x Np 1 p x11 1 X' x N1 1 x ij j x1p p x Np p Next a covariance matrix is calculated for X'. 112 1p2 1 X' where ij2 2ji N 2 2pp p1 Using eigenvector decomposition, the resulting transformation aims to diagonalize the covariance matrix creating a set of uncorrelated eigenvectors that span the data. Eigenvectors arevariable weightings used as coefficients in a linear combination so that each successive eigenvector accounts for the most residual variability. The set of eigenvectors comprise a principal component matrix where corresponding eigenvalues quantify the variance explained by each vector (PC). Each PC value v ij is called a loading and represents the correlation of variable i to principal component j. HDMD Tutorial 8 V variable principalcomponents v 11 v p1 v1p v pp 1 p Since principal components account for decreasing amounts of variation in redundant data, it is not necessary to retain all p components. Several methods propose cutoff values for reducing thenumber of components. The most direct is Kaiser's method of choosing all components with 1. This ensures that each component explains at least as much as a single variable. Another frequently used method is Cattell's scree plot, which plots the eigenvalues in decreasing order. The number of components retained is determined by the elbow where the curve becomes asymptotic and additional components provide little additional information. Principal component scores are determined by using loadings as coefficients and weighting observations in a linear combination to project the data onto the principal axes. Observations x 11 1 S X'V x N1 1 x1p p v11 x Np p v p1 principal components s s1p 11 s sNp N1 v1p v pp HDMD Tutorial 9 Data observations can be estimated using scores and loadings similar to linear regression models. The first principal component gives the best estimate of observations, as it accounts for the most variation. Observation i having p original explanatory variables and K informative components can thus be estimated with the following equation: x i V Si T Input Data Taken from AAIndex (http://www.genome.jp/aaindex/), amino acids can be quantified by multiple structural, chemical, and functional attributes. Following (Atchley, Zhao et al. 2005), a subset of 54 informative indices describe the similarity and variability among amino acids. Variables in AA54 have been centered and scaled so mean is zero and variance is one for each of the 54 amino acid indices. Observations: Amino Acids (N=20) Variables: Quantifiable Attributes or indices taken from AAIndex (p=54) Table 1: Amino Acid Values for 54 Quantifiable Attributes (Standardized) X' 20x54 HDMD Tutorial 10 Functions package stats stats function princomp(X, covmat = cov.wt(X)) screeplot Implementation Given a dataset X of N observations and p variables, princomp will return an error if N<p stating "'princomp' can only be used with more units than variables". Since a pxp covariance matrix has more elements than the Nxp data matrix, the covariance matrix is singular. This is a very common problem with HDMD. While the covariance can still be calculated, princomp does not internally permit it. Typically a generalized inverse solution is used to circumvent this problem. A workaround for this is to supply the weighted covariance matrix in the princomp function call. The weighted covariance cov.wt returns a list with both the covariance and centers (means) of each column (variable). Using R function cov simply returns the covariance matrix and does not define the centers. In order to calculate the scores in princomp, centers must be defined and cov.wt must be used. Call AA54_PCA = princomp(AA54, covmat = cov.wt(AA54)) screeplot(AA54_PCA, type="lines", npcs=length(AA54_PCA$sdev), main="Principal Components Scree Plot") PTV = AA54_PCA$sdev^2 / sum(AA54_PCA$sdev^2) CTV = cumsum(PTV) TV = rbind(Lambda=round(AA54_PCA$sdev^2, digits=5), PTV=round(PTV, digits=5), CTV=round(CTV, digits=5)) TV AA54_PCA$sdev AA54_PCA$loadings AA54_PCA$scores PC3d =scatterplot3d(AA54_PCA$scores[,1:3], pch = AminoAcids, main="Principal Component Scores", box = F, grid=F) PC3d$plane3d(c(0,0,0), col="grey") PC3d$points3d(c(0,0), c(0,0), c(-3,2), lty="solid", type="l" ) PC3d$points3d(c(0,0), c(-1.5,2), c(0,0), lty="solid", type="l" ) PC3d$points3d(c(-1.5,2), c(0,0), c(0,0), lty="solid", type="l" ) PC3d$points3d(AA54_PCA$scores[hydrophobic,1:3], col="blue", cex = 2.7, lwd=1.5) PC3d$points3d(AA54_PCA$scores[polar,1:3], col="green", cex = 3.3, lwd=1.5) PC3d$points3d(AA54_PCA$scores[small,1:3], col="orange", cex = 3.9, lwd=1.5) legend(x=5, y=4.5, legend=c("hydrophobic", "polar", "small"), col=c("blue", "green", "orange"), pch=21, box.lty =0) HDMD Tutorial 11 Resulting Output > AA54_PCA = princomp(AA54, covmat = cov.wt(AA54)) Warning message: In princomp.default(AA54, covmat = cov.wt(AA54)) : both 'x' and 'covmat' were supplied: 'x' will be ignored >screeplot(AA54_PCA, type="lines", npcs=length(AA54_PCA$sdev), main="Principal Components Scree Plot") Figure 1: Scree Plot for Principal Components (screeplot) > AA5_PCA$sdev HDMD Tutorial 12 V54x54 > AA54_PCA$loadings HDMD Tutorial 13 Loading coefficients < 0.10 are suppressed in printing by default. In order to specify a different tolerance threshold, set the decimal precision or sort the loading coefficients use >print(AA54_PCA$loadings,digits=2, cutoff=0, sort=TRUE ) >AA54_PCA$scores S 20x54 ... ... ... >TV PCA Score Plots >plot(AA54_PCA$scores[,1:2], pch = AminoAcids, main="Principal Component Scores") >points(AA54_PCA$scores[hydrophobic,1:2], col="blue", cex = 2.7, lwd=1.5) >points(AA54_PCA$scores[polar,1:2], col="green", cex = 3.3, lwd=1.5) >points(AA54_PCA$scores[small,1:2], col="orange", cex = 3.9, lwd=1.5) >legend(x=2, y=8, legend=c("hydrophobic", "polar", "small"), col=c("blue", "green", "orange"), pch=21) HDMD Tutorial 14 >PC3d =scatterplot3d(AA54_PCA$scores[,1:3], pch = AminoAcids, main="Principal Component Scores", box = FALSE, grid=F) >PC3d$plane3d(c(0,0,0), col="grey") >PC3d$points3d(c(0,0), c(0,0), c(-3,2), lty="solid", type="l" ) >PC3d$points3d(c(0,0), c(-1.5,2), c(0,0), lty="solid", type="l" ) >PC3d$points3d(c(-1.5,2), c(0,0), c(0,0), lty="solid", type="l" ) >PC3d$points3d(AA54_PCA$scores[hydrophobic,1:3], col="blue", cex = 2.7, lwd=1.5) >PC3d$points3d(AA54_PCA$scores[polar,1:3], col="green", cex = 3.3, lwd=1.5) >PC3d$points3d(AA54_PCA$scores[small,1:3], col="orange", cex = 3.9, lwd=1.5) >legend(x=5, y=4.5, legend=c("hydrophobic", "polar", "small"), col=c("blue", "green", "orange"), pch=21, box.lty =0) As seen in the plots of PC scores, amino acids are grouped by similarity and 35% and 57% of the cumulative total variance is explained with just two and three components, respectively. Including K=7 components explains over 90% of the total variance. Factor Analysis (FA) Factor Analysis (FA) is a dimension reduction tool that estimates the latent variable structure of data by partitioning variability into that common to all variables and a residual value unique or specific to each variable. FA differs from PCA by estimating the communality of each variable so to distinguish variation unrelated to other variables or due to error from that which can be explained by a common factor. By separating these sources of variation, FA decomposes the HDMD into an interpretable structure comprised of explanatory factors acting on multiple variables. Each factor represents a latent variable with loadings or coefficients relating observed variables to the factor. For model estimation, the number of factors must first be defined. Similar to PCA, FA can use Cattell's scree plot to determine the number of informative factors. HDMD Tutorial 15 A simplistic diagram is shown in Figure 2 where two factors affect the four observed variables each with their own amount of unique variability. Loadings are calculated for all factor and variable combinations, although certain variables may be more closely associated to a particular factor than others. In this example, Factor1 has high correlation to variables 1, 2, and 3 while Factor2 is highly correlated to only variables 3 and 4. Figure 2: Latent Variable Structure Variable1 unique variability Variable2 unique variability Variable3 unique variability Variable4 unique variability Factor1 (Latent variable) Factor2 (Latent variable) Note also that factors may be correlated among themselves. In a Varimax rotation, factors are defined to be uncorrelated and represented by orthogonal vectors. Contrastingly, Promax implements an oblique rotation so that factors can be related. Conventionally, when a Promax rotation is applied, a Varimax rotation is first implemented. Because of this rotation procedure, Factor Analysis does not produce an unique solution. Figure 3 <http://www.mega.nu/ampp/rummel/ufa.htm> displays how factors 1 and 2 form orthogonal axes to account for maximal variation among the 8 example variables. Orthogonal and oblique rotations further fit the axes to the variables creating sparsity. This emphasizes some variable-factor relationships while reducing other variable-factor associations. While the variance explained by each factor does not change, rotating the loadings by an orthogonal matrix alters the coefficients and can lead to slightly different interpretations. To this regard, it is important to infer results carefully. Figure 3:Visualization of Rotation HDMD Tutorial 16 Methodology Factor Analysis differs from PCA in that it separates and estimates common and unique variability. The logistic equation that optimizes this is then x i Si T x i1 1 11 x ip p p1 1p si1 i1 pp sip ip where is the estimated unique variation including error. Since the unique and common components must be estimated, factor loadings are not the initial eigenvector solution as in PCA. Factor Analysis first standardizes the data matrix X by both centering and scaling each element so that each column of X' has mean 0 and variance 1. N x x11 X x N1 x1p x Np 1 p x 11 1 1 X' x N1 1 1 x1p p x ij j j p x Np p p is the Root Mean Square. The covariance matrix of X' is then the typical i1 correlation matrix with diagonal elements scaled to 1. j 2 i 1 R X' 2 p1 2 1p where ij2 2ji 1 In order to determine the amount of variability that can be explained by the factor structure, the diagonal elements of R are replaced with the estimated communality h2 of default method initializes h2 by the squared multiple correlation (SMC) each variable. The value. SMC simply estimates the correlation of variable j with all other variables so that h 2j 1 1 1 where R 1 If the jj is the diagonal element j from the inverse correlation matrix. R jj number of factors K is half of the number of variables or imaginary eigenvalues are encounteredin the first iteration, then communality is initialized to 1. Total communality HDMD Tutorial 17 p is simply defined as a sum of communalities for each variable, h2 h 2 j . Iteratively j1 decomposing the correlation matrix into its eigenvector structure and updating diagonal elements with the sum of squares of each vector estimates the common variance. The process is as follows: Estimate Communality 1) Initialize Communality and Correlation Matrix R p Com m0 h 2 j where j1 2) 3) h 2j 1 1 1 h2 1 R jj or j Solve eigenvector Structure of R v 11 Loadings v p1 1 0 v pK 0 v1K 0 0 0 1 v11 0 K 1 v p1 K h 2j h p2 K v1K K v pK j p v 2jk R jj h 2j Com mt k1 2 1p Determine and update communality & diagonal of R 4) h 2 1 R 2 p1 h 2 j j1 Iterate 2-4 until communality converges Commt Commt1 c where c is a threshold of convergence Once the common and unique variance estimates have been stabilized using the above procedure, the factor loadings can be transformed using orthogonal (Varimax) or oblique (Promax) rotations. Typically when applying a Promax rotation, the loadings are prerotated using Varimax. This establishes an orthogonal rotated basis that can then be updated according to factor correlations. During Varimax rotation calculations, factor loadings are normalized by dividing variable communality. Iteratively an orthogonal rotation matrix T is determined and updated through variance reduction using Singular Value Decomposition (SVD) on transformed loadings . HDMD Tutorial 18 Varimax Rotation (Orthogonal) 1) Normalize Loadings 2) K v1K 1 v11 K v pK 1 v p1 d0 0 Transform Loadings ˆ 'T 4) Fit axes p T ˆ 3 ˆ 1 ˆ2 B ' diag j1 p j1 5) h1 K v pK h p K v1K Initialize transformation matrix T and convergence distance d. T0 I KxK (no rotation) 3) v 1 11 h1 ' 1 v p1 hp ˆ2jK j1 p Update rotation matrix T through Singular Value Decomposition (SVD) B UDVT where U and V are orthogonal matrices and D is diagonal Tt UV T 6) Iterate 3-5 until Convergence dt K D kk k1 7) Converged if dt dt1(1 c) for some threshold value c. Finalize ˆ diag h 1 h p 'T Promax rotation fits the loadings without stipulating orthogonality among axes. The coefficients describe the best fit line of factors in their respective directions. Since there is no restriction on axes orthogonality during this step, changes in axis direction may result in HDMD Tutorial 19 correlated factors. A factor correlation matrix similar to the identity matrix implies orthogonal, uncorrelated factors. Promax Rotation (Oblique) 1) Fit axes Q 2) 3) m with elementsretained m1 Fit and Q U = KxK matrix of coefficients fitting and Q Weight coefficients d diagU T U 4) 0 0 0 dK 0 Rotated Loadings and Factor Correlations ˆ U 1 d 1 U U 0 0 U 1 U 1 T As previously stated, FA separates and estimates common and unique variability. Given the data and means, loadings are estimated from communality optimization, eigenvector decomposition and rotation procedures. However, the scores S and error term must still be determined for data approximation. x i Si T Several methods exist for estimating FA scores, including regression and Bartlett. In the regression equation scores are estimated such that score . ˆ S X ' R 1 (NxK ) Nxp pxp pxK KxK The regression method projects the data using loadings while accounting for correlations between variables (R) and factors (). HDMD Tutorial 20 Input Data See Principal Components Input Data. Functions package HDMD psych function factor.pa.ginv(X) VSS.scree While many factor analysis methods have been implemented in R, factanal and factor.pa functions do not allow for singular covariance matrices. Several parameters for factor rotations are also hidden in these methods. Although it is standard to prerotate the loadings according to an orthogonal varimax rotation prior to an oblique promax rotation, these steps are separated in factor.pa.ginv and allows for greater flexibility. In addition the power for the promax rotation was previously fixed at m=4 in factor.pa, but can be specified in the function call in factor.pa.ginv with default m=4. For comparision with the SAS implementation employed by Atchley et al 2005, m=3 is used in the following example. Call VSS.scree(AA54, main="Subset of 54 AA attributes scree Plot") Factor54 = factor.pa.ginv(AA54, nfactors = 5, m=3, prerotate=TRUE, rotate="Promax", scores="regression") row.names(Factor54$scores) = names(AA54) Factor54$loadings[order(Factor54$loadings[,1]),] Factor54$scores Resulting Output >VSS.scree(AA54, main="Subset of 54 AA attributes Scree Plot") HDMD Tutorial 21 Figure 4: Scree Plot for Factor Analysis (VSS.scree) >Factor54 = factor.pa.ginv(AA54, nfactors = 5, m=3, prerotate=TRUE, rotate="Promax", scores="regression") Could not solve for inverse correlation. Using general inverse ginv(r) Warning message: In smc(r) : Correlation matrix not invertible, smc's returned as 1s HDMD Tutorial 22 >Factor54$loadings[order(Factor54$loadings[,1]),] ˆ 54x5 HDMD Tutorial 23 Following factor rotations, certain variable-factor relationships are accentuated while other variable-factor associations are minimized. Factor 1 has an abundance of values with high associations to variablies whose descriptions are related to polarity, accessibility, and hydrophobicity (PAH). Similarly, variables emphasized in Factor 2 are related to the propensity for secondary structure (PSS), Factor 3 to molecular size (MS), Factor 4 to codon composition (CC), and Factor 5 to electrostatic charge (EC). Thus each factor can be represented by distinct amino acid attributes. Scores then confer the relationship among amino acids for that factor. As expected, isoleucine and leucine have similar scores for each factor while glycine and arginine have dissimilar values. >Factor54$scores S 20x5 Table 2: Amino Acid Factor Scores from R HDMD Tutorial 24 PCA and FA Comparison As seen above, PCA and FA utilize similar computational methods in determining major axes of variation and reduce data dimensionality. The correlation between these methods relies heavily on the communality of variables. For communalities close to 1, the majority of variation of the variable can be explained by the factor structure and the diagonal of the correlation matrix in FA will be the same as PCA. In this case, FA essentially reduces to PCA as the unique variability approaches zero. Table 3: Correlation of PCA and FA Score Estimates HDMD Tutorial 25 Figure 5: PCA - FA Scores for Factor 1 and 2 (and 3) Figure 6: Sorted Communality Estimates for FA From the examples provided above, it seems PCA and FA produce similar scores for the first factor/component and only marginally so for the remaining factors/components. The similarity in scores is likely due to communality estimates close to 1.0. The correlation matrix in FA will be similar to that in PCA resulting in analogous eigenvector decomposition. This however is not generally true, and PCA should not routinely be used instead of FA to describe latent structure. HDMD Tutorial 26 Differences in FA for R and SAS While R and SAS use the same methods to calculate factor loadings, their implementation to estimate scores is slightly different. For the case when N<p the covariance matrix is singular and a general inverse matrix must be estimated. Atchley et. al 2005 used SAS for their calculations in determining latent structure of amino acids and thus have slightly different factor scores, as seen in Table 4. The correlation between the five factors for R and SAS implementation is shown in Table 5 where rows correspond to R values and columns SAS values. Clearly, the scores are similar with the exception of Factors 3 and 5. This is possibly due to the treatment of factors with variables highly correlated to each other and multiple factors. Table 4: Factor Scores computed with SAS (Atchley et al 2005) Table 5: Correlation between R (row) and SAS (col) factor scores K=5 HDMD Tutorial 27 Using 6 Factors When SAS and R Factor Analysis scores are compared with both methods estimating 6 factors, the correlation between R and SAS methods is much more evident (see Table 6). The particular factors may have new interpretation, although the correlation of R factor scores for K=5 and K=6 factors as well as SAS scores for K=5 and K=6 show that the factors are highly similar. The sixth factor may further explain the association between the inferred molecular size (ms) of factor 3 and polarity, accessibility, and hydrophobicity (pah) scores in factor 1 when K=5. Table 6: Correlation between R (row) and SAS (col) factor scores K=6 Table 7: SAS Factor Scores (K=6) Table 9:SAS Score Correlation for FA6 & FA5 Table 8:R Factor Scores (K=6) Table 10:R Score Correlation for FA6 & FA5 Metric Transformation The alphabetic nature of amino acid representation is a discrete method allowing for direct differentiation between residues. However, amino acids codes are alphabetic and have no general underlying metric making statistical analyses very difficult. Transforming the alphabetic letters to a more realistic, biological set of numerical values greatly facilitates HDMD Tutorial 28 computation. Further incorporating the correlation among amino acids allows for sophisticated statistical analyses. By evaluating 54 amino acid indices, Atchley et al discovered that 5 factors explain 83% of the analyzed variation. Converting a single vector of amino acid residues into 5 numeric vectors representing Polarity, Accessibility, and Hydrophobicity (pah), Propensity for Secondary Structure (pss), Molecular Size (ms), Codon Composition (cc) and Electrostatic Charge (ec) establishes a platform capable of handling rigorous statistical techniques such as analysis of variance, regression, discriminant analysis, etc. Methodology Using Factor Analysis, Atchley et al identified 5 factors quantifying amino acid variability. For a single amino acid sequence, each residue is associated with 5 uncorrelated and informative numeric values pah, pss, ms, cc, and ec. Each metric transformation can then be independently analyzed in subsequent analysis as seen in the Discriminant Analysis section below. metric values pah pss amino acids ms cc ec Figure 7: Metric Solution Conversion Input Data bHLH288 contains 288 named sequences grouped into 5 categories representing the DNA binding affinities. The 5 groups are designated by their E-box specificity and presence of additional domains where Group A binds to CAGCTG E-box motif, Group B binds to CACGTG E-box motif and is most prevalent, Group C has an additional PAS domain, Group D lacks a basic region, and Group E binds to CACG[C/A]G N-box motif. Each bHLH sequence has 51 sites with no gaps. A representative subset is displayed here to show sequence variability among the groups. HDMD Tutorial 29 Table 11: representative sample of bHLH Amino Acid Sequences with group designation Functions package HDMD function FactorTransform Call AA54_MetricList_Factor1 = FactorTransform(as.vector(bHLH288[,2]), SeqName=names(bHLH288), Replace =AAMetric ) AA54_MetricFactor1 = matrix(unlist(AA54_MetricList_Factor1), nrow = length(AA54_MetricList_Factor1), byrow = TRUE, dimnames = list(names(AA54_MetricList_Factor1))) AA54_MetricFactor1 HDMD Tutorial 30 Resulting Output While the entire transformation from amino acid characters to the PAH metric is stored in AA54_MetricFactor1, a representative subset is shown here using the following commands Subset = AASeqs[c(20:25, 137:147, 190:196, 220:229, 264:273),] AA54_MetricSubset_Factor1 = matrix(unlist(AA54_MetricSubset), nrow = length(AA54_MetricSubset), byrow = TRUE, dimnames = list(names(AA54_MetricList_Factor1))) > AA54_MetricSubset_Factor1 Discriminant Function Analysis (DFA) Discriminant Function Analysis (DFA) can be used for both exploratory and confirmatory classification of high dimensional correlated data. Similar to PCA and FA, DFA uses a linear combination of variables to summarize patterns of variation in the data. In DFA coefficients are estimated so to minimize within class variation and maximize between class variation. Figure 8 shows how two variables cannot discriminate Group A from Group B independently, but can easily separate the groups using a linear function weighting the variables. The coefficients quantify the relative importance of each variable. One method for determining the closeness of groups is to measure the mahalanobis distance, which accounts for the correlation among variables. When variables are uncorrelated the mahalanobis distance is simply the Euclidean distance. HDMD Tutorial 31 Figure 8: LDA for 2 classes Methodology Since the goal in DFA is to determine a set of linear functions that discriminate groups, most of the DFA procedures are group centric. This means that variables, and in this case sites, will be normalized and transformed according to which group they belong. First, to standardize the data, each element is centered by the group mean. g1 x11 g2 x i1 X gm x N1 g1 1 gm 1 A scaling factor is defined by x1p x ip x Np p p g1 x11 g1 g2 x i1 g 2 X' gm x N1 g m x1p g1 x ip g 2 x Np g m HDMD Tutorial 32 1 1 F 0 0 0 N 2 2 0 where j x' ij j 1 i1 p 0 0 is the variance of each variable for the centered data. Using the standard estimators of mean and variance, the data is normalized so x11 g 1 1 1 Z N m x N1 g m 1 x1p g1 p x Np g m p Singular Value Decomposition (SVD) decomposes the data matrix Z into two orthonormal matrices and a unique diagonalized scaling matrix so Z UDV T . The rank r is determined by the number of elements in D larger than some tolerance value (t= 0.0001). Scaling coefficients are updated by the SVD decomposition so F' diag 1 1 1 1 0 0 0 0 V diag 1 p d1 1 0 v11 0 1 v p1 p 1 v1r 1 d 1 0 v pr 0 dr 0 0 0 0 1 d r This scaling minimizes the variance among variables within groups. Group mean values are similarly decomposed by SVD to determine linear discriminates that maximize the variance between group means. First the group matrix is initialized and scaled so g x1 1 G g m x1 m 1N m1 m 1N m m m m 1N m1 F' where x j c jgc g m x p c1 m 1N m m g1 x p m c is the number of observations in group c, and c is the prior probability of group c. mc Unless otherwise specified, c N by default. A second round of SVD is performed, HDMD Tutorial 33 T this time on group matrix G so G UG DGVG . The final matrix transformation maximizes the between group distances through additional scaling to result in the coefficient matrix Fˆ F'VG . Thus the scaling factor is normalized so within group covariance is spherical. The observations are transformed by scaling coefficient matrix Fˆ so score S FˆX maximizes the variance between groups while minimizing the variance within groups. To quantify group similarity, the mahalanobis function was used to measure the distance between group means while accounting for the correlation of variables. Ingeneral Dm x x T where is the covariance matrix for centered and scaled data X with mean making the correlation matrix. To calculate the mahalanobis distance between groups b and c the means of each group are compared over all K variables while accounting for variable correlations. D 2 gb ,gc kg b kg c kg b kg c 1g b 1g c T 1 r Kg b Kg c 21 rK1 r12 rKi r1K 1g b 1g c riK Kg b Kg c 1 If R is the Identity matrix with off diagonal elements equal to zero, there is no correlation between variables and the mahalanobis distance reduces to the Euclidean distance. Input Data In this example, the bHLH288 data of 288 sequences over 51 sites is transformed using the PAH Amino Acid Factor Transformation Matrix described in the previous section. Sequence names have been dropped and instead are numbered for simplicity. Graphs for the PSS, MS, CC, and EC metrics were calculated similarly. HDMD Tutorial 34 Table 12: Metric Transformation of bHLH Sequences using AAMetric (subset shown) ... ... ... Functions package MASS HDMD function lda pairwise.mahalanobis(X, grouping = NULL) The mahalanobis function in the stats package determines the mahalanobis distance between a vector and mean of the data. In many instances multiple distance measurements are desired, such as all pairwise distances among a set of groups. pairwise.mahalanobis takes a data matrix X and determines all pairwise distances between groups. If a separate grouping vector is not specified, the function assumes the first column groups observations. HDMD Tutorial 35 Call Based on Factor Scores determined by R calculations in FA and Metric Transformation above AA54_MetricList_Factor1 = FactorTransform(as.vector(bHLH288[,2]), Replace = AAMetric) grouping = bHLH288[,1] AA54_MetricFactor1 = matrix(unlist(AA54_MetricList_Factor1), nrow = length(AA54_MetricList_Factor1), byrow = TRUE, dimnames = list(names(AA54_MetricList_Factor1))) AA54_lda_Metric1 = lda(AA54_MetricFactor1, grouping) AA54_lda_RawMetric1 = as.matrix(AA54_MetricFactor1) %*% AA54_lda_Metric1$scaling AA54_lda_RawMetric1Centered = scale(AA54_lda_RawMetric1, center = TRUE, scale = FALSE) plot(-1*AA54_lda_RawMetric1Centered[,1], -1*AA54_lda_RawMetric1Centered[,2], pch = grouping, xlab="Canonical Variate 1", ylab="Canonical Variate 2", main="DA Scores (Centered Raw Coefficients)\nusing Factor1 (pah) from R transformation") lines(c(0,0), c(-15,15), lty="dashed") lines(c(-35,25), c(0,0), lty="dashed") Mahala_1 = pairwise.mahalanobis(AA54_lda_RawMetric1Centered, grouping) D = sqrt(Mahala_1$distance) rownames(D) = colnames(D) = c("A", "B", "C", "D", "E") round(D, digits=3) Resulting Output >AA54_lda_Metric1 HDMD Tutorial 36 FpxK HDMD Tutorial 37 >Mahala_1 2 DKxK >round(D, digits=3) >plot(-1*AA54_MetricRlda1_Centerprojection[,1], 1*AA54_MetricRlda1_Centerprojection[,2], pch = letter_grouping, xlab="Canonical Variate 1", ylab="Canonical Variate 2", main="DA Scores (Centered Raw Coefficients)\nusing Factor1 (pah) from R transformation", xlim = c(-30,20), ylim=c(-11,10)) > lines(c(0,0), c(-15,15), lty="dashed") > lines(c(-35,25), c(0,0), lty="dashed") HDMD Tutorial 38 Figure 9: LDA for Factor1 using R Metric Transformation Figure 10: LDA for Factor2 (pss) using R Metric Transformation HDMD Tutorial 39 Figure 11: LDA for Factor3 (ms) using R Metric Transformation Figure 12: LDA for Factor4 (cc) using R Metric Transformation HDMD Tutorial 40 Figure 13: LDA for Factor5 (ec) using R Metric Transformation References Atchley, W. R. and A. D. Fernandes (2005). "Sequence signatures and the probabilistic identification of proteins in the Myc-Max-Mad network." Proc Natl Acad Sci USA 102(18): 6401-6. Atchley, W. R., J. Zhao, et al. (2005). "Solving the protein sequence metric problem." Proc Natl Acad Sci U S A 102(18): 6395-400. Shlen, Jonathon. A Tutorial on Principal Component Analysis. Center for Neural Science, New York University Dated: April 22, 2009; Version 3.01. <http://www.snl.salk.edu/~shlens/notes.html> Nakai, K., Kidera, A., and Kanehisa, M.; Cluster analysis of amino acid indices for prediction of protein structure and function. Protein Eng. 2, 93-100 (1988). [PMID:3244698] Tomii, K. and Kanehisa, M.; Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. Protein Eng. 9, 27-36 (1996). [PMID:9053899] HDMD Tutorial 41 Kawashima, S., Ogata, H., and Kanehisa, M.; AAindex: amino acid index database. Nucleic Acids Res. 27, 368-369 (1999). [PMID:9847231] Kawashima, S. and Kanehisa, M.; AAindex: amino acid index database. Nucleic Acids Res. 28, 374 (2000). [PMID:10592278] Kawashima, S., Pokarowski, P., Pokarowska, M., Kolinski, A., Katayama, T., and Kanehisa, M.; AAindex: amino acid index database, progress report 2008. Nucleic Acids Res. 36, D202D205 (2008). [PMID:17998252]