docx

advertisement
High Dimensional Molecular Data
Tutorial in R
An introduction to the HDMD package
Prepared for the ICMSB Workshop
Shanghai, China 2009
Lisa McFerrin and William Atchley
HDMD Tutorial
2
Table of Contents
FIGURE INDEX
3
TABLE INDEX
3
INSTALLING THE HDMD PACKAGE
4
INTRODUCTION
5
IN THIS TUTORIAL
NOTATION
5
5
PRINCIPAL COMPONENTS ANALYSIS (PCA)
6
METHODOLOGY
INPUT DATA
FUNCTIONS
CALL
RESULTING OUTPUT
7
9
10
10
11
FACTOR ANALYSIS (FA)
14
METHODOLOGY
ESTIMATE COMMUNALITY
VARIMAX ROTATION (ORTHOGONAL)
PROMAX ROTATION (OBLIQUE)
INPUT DATA
FUNCTIONS
CALL
RESULTING OUTPUT
PCA AND FA COMPARISON
DIFFERENCES IN FA FOR R AND SAS
USING 6 FACTORS
16
17
18
19
20
20
20
20
24
26
27
METRIC TRANSFORMATION
27
METHODOLOGY
INPUT DATA
FUNCTIONS
CALL
RESULTING OUTPUT
28
28
29
29
30
DISCRIMINANT FUNCTION ANALYSIS (DFA)
30
METHODOLOGY
INPUT DATA
FUNCTIONS
CALL
31
33
34
35
HDMD Tutorial
3
RESULTING OUTPUT
35
REFERENCES
40
Figure Index
Figure 1: Scree Plot for Principal Components (screeplot)
Figure 2: Latent Variable Structure
Figure 3:Visualization of Rotation
Figure 4: Scree Plot for Factor Analysis (VSS.scree)
Figure 5: PCA - FA Scores for Factor 1 and 2 (and 3)
Figure 6: Sorted Communality Estimates for FA
Figure 7: Metric Solution Conversion
Figure 8: LDA for 2 classes
Figure 9: LDA for Factor1 using R Metric Transformation
Figure 10: LDA for Factor2 (pss) using R Metric Transformation
Figure 11: LDA for Factor3 (ms) using R Metric Transformation
Figure 12: LDA for Factor4 (cc) using R Metric Transformation
Figure 13: LDA for Factor5 (ec) using R Metric Transformation
11
15
15
21
25
25
28
31
38
38
39
39
40
Table Index
Table 1: Amino Acid Values for 54 Quantifiable Attributes (Standardized)
Table 2: Amino Acid Factor Scores from R
Table 3: Correlation of PCA and FA Score Estimates
Table 4: Factor Scores computed with SAS (Atchley et al 2005)
Table 5: Correlation between R (row) and SAS (col) factor scores K=5
Table 6: Correlation between R (row) and SAS (col) factor scores K=6
Table 7: SAS Factor Scores (K=6)
Table 8:R Factor Scores (K=6)
Table 9:SAS Score Correlation for FA6 & FA5 Table 10:R Score Correlation for FA6 & FA5
Table 11: representative sample of bHLH Amino Acid Sequences with group designation
Table 12: Metric Transformation of bHLH Sequences using AAMetric (subset shown)
9
23
24
26
26
27
27
27
29
34
HDMD Tutorial
4
Installing the HDMD package
A CRAN package is in preparation and will be submitted to CRAN at http://cran.rproject.org/.
Currently, the HDMD package along with supplementary materials is available at
www4.ncsu.edu/~lgmcferr/ICMSBWorkshop. Updates and package instructions may be
found there.
To run the HDMD functions, download HDMD_1.0.tar.gz located at
www4.ncsu.edu/~lgmcferr/ICMSBWorkshop into a directory and unpack it from the
command line using "tar -xvf HDMD_1.0.tar.gz". From the same directory, use the
command "R CMD INSTALL HDMD". The HDMD package can then be loaded using the R
package manager.
Several other packages are required for full functionality of the HDMD package. While
MASS, stats, and psych are necessary for HDMD internal computation, scatterplot3d is
optional and simply recommended in the examples for viewing purposes. Since psych is
not a general package included in the R library, it must be downloaded. The initial call
install.packages("psych") will attempt to install this from an external source.
Routines implemented in HDMD are appropriate for PCA, FA, and DFA when the number of
variables is less than the number of observations. HDMD also encodes a function to
transform amino acid sequence into a metric conversion. For more information, see
www4.ncsu.edu/~lgmcferr/ICMSBWorkshop
Version:
1.0
Depends:
MASS, psych
Suggests:
scatterplot3d
Published:
2009-06-08
Author:
Lisa McFerrin
Maintainer:
Lisa McFerrin <lgmcferr at ncsu dot edu>
URL: www4.ncsu.edu/~lgmcferr/ICMSBWorkshop,
Package source:
HDMD_1.0.tar.gz
Examples:
HDMD_FunctionCalls.R
HDMD Tutorial
5
Introduction
This tutorial introduces the concept and relevant statistical methodology for “Latent
Variables ” as they relate to analyses of high dimensional molecular data (HDMD). A latent
variable model relates a set of observed or manifest variables to a set of underlying latent
variables. Latent variables are not directly observed but are rather inferred through a
mathematical model constructed from variables that are observed and directly measured.
HDMD typically contain thousands of data points arising from a substantially
smaller number of sampling units. Such data present many complexities. They are
typically highly interdependent, exhibit complex underlying components of variability, and
meaningful replication is rare. The workshop will focus on three multivariate statistical
methods that can facilitate description and analyses of latent variables and latent structure
inherent to HDMD. These statistical methods are intended to reduce the dimensionality of
HDMD such that patterns of biologically meaningful patterns of multidimensional
covariability are exposed so that relevant biological questions can be explored.
In this Tutorial
The tutorial and HDMD package was prepared for a workshop on HDMD to introduce
principal components analysis, common factor analysis and discriminant analysis. A metric
transformation converting amino acid residues into numerical biologically informative
values will also be covered. Using exemplar published datasets, the relative suitability for
exploring a series of important biological questions will be briefly touched upon. These
three methods will be demonstrated using a set of R programs together with annotated
output and documentation. Specific examples will be covered, with explicit function calls
and output shown to provide a complete walkthrough of HDMD analysis. Annotations of
HDMD package functions will NOT be covered, but can be found in the HDMD manual.
This software has not been stringently tested. Please report any errors or bugs to Lisa
McFerrin at lgmcferr@ncsu.edu. More information and package instructions can be found
at www4.ncsu.edu/~lgmcferr/ICMSBWorkshop.
Notation
... ... ...
indicates only a portion of the table or output is shown
p
number of variables
N
number of observations
K
number of components, factors, or discriminants
X
Data Matrix (N x p)
X'
Normalized Data Matrix, centered and/or scaled (N x p)
HDMD Tutorial



diagz1

6
j
mean of variable j

Covariance Matrix (p x p)
 i2
Variance for variable i
 ij2
Covariance for variables i and j
R
Correlation Matrix (p x p)

Eigenvalue vector of p eigenvalues (1 x p)
V
Eigenvector matrix (N x p)
zm 
mxm matrix with diagonal elements set to z and off diagonal
elements set to 0
Principal Components Analysis (PCA)
Principal Component Analysis (PCA) is a data reduction tool. Generally speaking,
PCA provides a framework for minimizing data dimensionality by identifying linear
combinations of variables called principal components that maximally represent variation
in the data. Principal axes linearly fit the original data so the first principal axis minimizes
the sum of squares for all observational values and maximally reduces residual variation.
Each subsequent principal axis maximally accounts for variation in residual data and acts
as the line of best fit directionally orthogonal to previously defined axes. Principal
components represent the correlation between variables and the corresponding principal
axes. Conceptually, PCA is a greedy algorithm fitting each axes to the data while
conditioning upon all previous axes definitions. Principal components project the original
data onto these axes, where axes are ordered such that Principal Component 1 (PC1)
accounts for the most variation, followed by PC1, PC2, PC3, ...PCp for p variables
(dimensions). Since each PC is orthogonal, each component independently accounts for
data variability and the Percent of Total Variation Explained (PTV) is cumulative. PCA
offers as many principal components as variables in order to explain all variability in the
data. However, only a subset of these principal components are notably informative. Since
variability is shifted into leading PCs, many of the remaining PCs account for little variation
and can be disregarded to retain maximal variability with reduced dimensionality.
While Singular Value Decomposition (SVD) can be used for PCA, we will focus on the
p
eigenvector decomposition method employed in R. Eigenvalue k   aik2 is the variance
i1
explained by PCk where aik is the loading for variable i on factor k. As an example, if 90% of
total variation should be retained in the model for p dimensional data, the first K principal
components should be kept such that


HDMD Tutorial
7
K

k
PTV  k1
K

 .90
.
k
k1
PTV acts as the signal to noise ratio, which flattens with additional components. Typically,

the number of informative components
K is chosen using one of three methods: 1) Kaiser's
eigenvalue > 1; 2) Cattell's scree plot; or 3) Bartlett test of sphericity. Often K << p, as seen
in the following example. Thus PCA is extremely useful in data compression and
dimensionality reduction since the process optimizes over the total variance. However, the
reduced dataset of loading values relating variables and PCs may be difficult to interpret.
While PCA provides an unique solution, the loading coefficients may not distinguish
variables that contribute to variation along a particular principal axis.
As an additional source, Shlen provides a very useful PCA tutorial, including basic methods
and explanations.
Methodology
The first step in PCA is to create a mean-centered data matrix X' by subtracting
variable means from each data element in X. This centers the data so each column of X' has
mean 0.
x11

X  

x N1
x1p 


x Np 

   1
p 
x11  1

X' 

x N1  1
x ij   j
x1p   p 


x Np   p 

Next a covariance matrix  is calculated for X'.

112
1p2 



1
X'  
where  ij2   2ji

N 2

 2pp 
 p1

Using eigenvector decomposition, the resulting transformation aims to diagonalize the
covariance matrix creating a set of uncorrelated eigenvectors that span the data.
Eigenvectors arevariable weightings used as coefficients in a linear combination so that
each successive eigenvector accounts for the most residual variability. The set of
eigenvectors comprise a principal component matrix where corresponding eigenvalues
quantify the variance explained by each vector (PC). Each PC value v ij is called a loading
and represents the correlation of variable i to principal component j.

HDMD Tutorial
8
V
variable
principalcomponents
v
 11

v
 p1
v1p 


v pp 

   1
p 
Since principal components account for decreasing amounts of variation in redundant data,
it is not necessary to retain all p components.
Several methods propose cutoff values for

reducing thenumber of components. The most direct is Kaiser's method of choosing all
components with   1. This ensures that each component explains at least as much as a
single variable. Another frequently used method is Cattell's scree plot, which plots the
eigenvalues in decreasing order. The number of components retained is determined by the
elbow where the curve becomes asymptotic and additional components provide little
additional information.

Principal component scores are determined by using loadings as coefficients and weighting
observations in a linear combination to project the data onto the principal axes.


Observations
x  
 11 1
S  X'V  
x  
 N1 1
x1p   p v11



x Np   p 
v p1
principal components
s
s1p 
 11



s
sNp 
 N1

v1p 


v pp 

HDMD Tutorial
9
Data observations can be estimated using scores and loadings similar to linear regression
models. The first principal component gives the best estimate of observations, as it
accounts for the most variation. Observation i having p original explanatory variables and
K informative components can thus be estimated with the following equation:
x i   V Si 
T

Input Data
Taken from AAIndex (http://www.genome.jp/aaindex/), amino acids can be quantified by
multiple structural, chemical, and functional attributes. Following (Atchley, Zhao et al.
2005), a subset of 54 informative indices describe the similarity and variability among
amino acids. Variables in AA54 have been centered and scaled so mean is zero and
variance is one for each of the 54 amino acid indices.
Observations: Amino Acids (N=20)
Variables:
Quantifiable Attributes or indices taken from AAIndex (p=54)
Table 1: Amino Acid Values for 54 Quantifiable Attributes (Standardized)

 X' 20x54
HDMD Tutorial
10
Functions
package
stats
stats
function
princomp(X, covmat = cov.wt(X))
screeplot
Implementation
Given a dataset X of N observations and p variables, princomp will return an error if N<p
stating "'princomp' can only be used with more units than variables". Since a pxp
covariance matrix has more elements than the Nxp data matrix, the covariance matrix is
singular. This is a very common problem with HDMD. While the covariance can still be
calculated, princomp does not internally permit it. Typically a generalized inverse solution
is used to circumvent this problem. A workaround for this is to supply the weighted
covariance matrix in the princomp function call. The weighted covariance cov.wt returns a
list with both the covariance and centers (means) of each column (variable). Using R
function cov simply returns the covariance matrix and does not define the centers. In order
to calculate the scores in princomp, centers must be defined and cov.wt must be used.
Call
AA54_PCA = princomp(AA54, covmat = cov.wt(AA54))
screeplot(AA54_PCA, type="lines", npcs=length(AA54_PCA$sdev), main="Principal
Components Scree Plot")
PTV = AA54_PCA$sdev^2 / sum(AA54_PCA$sdev^2)
CTV = cumsum(PTV)
TV = rbind(Lambda=round(AA54_PCA$sdev^2, digits=5), PTV=round(PTV, digits=5),
CTV=round(CTV, digits=5))
TV
AA54_PCA$sdev
AA54_PCA$loadings
AA54_PCA$scores
PC3d =scatterplot3d(AA54_PCA$scores[,1:3], pch = AminoAcids, main="Principal
Component Scores", box = F, grid=F)
PC3d$plane3d(c(0,0,0), col="grey")
PC3d$points3d(c(0,0), c(0,0), c(-3,2), lty="solid", type="l" )
PC3d$points3d(c(0,0), c(-1.5,2), c(0,0), lty="solid", type="l" )
PC3d$points3d(c(-1.5,2), c(0,0), c(0,0), lty="solid", type="l" )
PC3d$points3d(AA54_PCA$scores[hydrophobic,1:3], col="blue", cex = 2.7, lwd=1.5)
PC3d$points3d(AA54_PCA$scores[polar,1:3], col="green", cex = 3.3, lwd=1.5)
PC3d$points3d(AA54_PCA$scores[small,1:3], col="orange", cex = 3.9, lwd=1.5)
legend(x=5, y=4.5, legend=c("hydrophobic", "polar", "small"), col=c("blue", "green",
"orange"), pch=21, box.lty =0)
HDMD Tutorial
11
Resulting Output
> AA54_PCA = princomp(AA54, covmat = cov.wt(AA54))
Warning message:
In princomp.default(AA54, covmat = cov.wt(AA54)) :
both 'x' and 'covmat' were supplied: 'x' will be ignored
>screeplot(AA54_PCA, type="lines", npcs=length(AA54_PCA$sdev), main="Principal
Components Scree Plot")
Figure 1: Scree Plot for Principal Components (screeplot)
> AA5_PCA$sdev

 
HDMD Tutorial
12
 V54x54
> AA54_PCA$loadings

HDMD Tutorial
13
Loading coefficients < 0.10 are suppressed in printing by default. In order to specify a
different tolerance threshold, set the decimal precision or sort the loading coefficients use
>print(AA54_PCA$loadings,digits=2, cutoff=0, sort=TRUE )
>AA54_PCA$scores
 S 20x54

... ... ...
>TV
PCA Score Plots
>plot(AA54_PCA$scores[,1:2], pch = AminoAcids, main="Principal Component Scores")
>points(AA54_PCA$scores[hydrophobic,1:2], col="blue", cex = 2.7, lwd=1.5)
>points(AA54_PCA$scores[polar,1:2], col="green", cex = 3.3, lwd=1.5)
>points(AA54_PCA$scores[small,1:2], col="orange", cex = 3.9, lwd=1.5)
>legend(x=2, y=8, legend=c("hydrophobic", "polar", "small"), col=c("blue", "green", "orange"),
pch=21)
HDMD Tutorial
14
>PC3d =scatterplot3d(AA54_PCA$scores[,1:3], pch = AminoAcids, main="Principal Component
Scores", box = FALSE, grid=F)
>PC3d$plane3d(c(0,0,0), col="grey")
>PC3d$points3d(c(0,0), c(0,0), c(-3,2), lty="solid", type="l" )
>PC3d$points3d(c(0,0), c(-1.5,2), c(0,0), lty="solid", type="l" )
>PC3d$points3d(c(-1.5,2), c(0,0), c(0,0), lty="solid", type="l" )
>PC3d$points3d(AA54_PCA$scores[hydrophobic,1:3], col="blue", cex = 2.7, lwd=1.5)
>PC3d$points3d(AA54_PCA$scores[polar,1:3], col="green", cex = 3.3, lwd=1.5)
>PC3d$points3d(AA54_PCA$scores[small,1:3], col="orange", cex = 3.9, lwd=1.5)
>legend(x=5, y=4.5, legend=c("hydrophobic", "polar", "small"), col=c("blue", "green", "orange"),
pch=21, box.lty =0)
As seen in the plots of PC scores, amino acids are grouped by similarity and 35% and 57%
of the cumulative total variance is explained with just two and three components,
respectively. Including K=7 components explains over 90% of the total variance.
Factor Analysis (FA)
Factor Analysis (FA) is a dimension reduction tool that estimates the latent variable
structure of data by partitioning variability into that common to all variables and a residual
value unique or specific to each variable. FA differs from PCA by estimating the
communality of each variable so to distinguish variation unrelated to other variables or due
to error from that which can be explained by a common factor. By separating these sources
of variation, FA decomposes the HDMD into an interpretable structure comprised of
explanatory factors acting on multiple variables. Each factor represents a latent variable
with loadings or coefficients relating observed variables to the factor. For model
estimation, the number of factors must first be defined. Similar to PCA, FA can use Cattell's
scree plot to determine the number of informative factors.
HDMD Tutorial
15
A simplistic diagram is shown in Figure 2 where two factors affect the four observed
variables each with their own amount of unique variability. Loadings are calculated for all
factor and variable combinations, although certain variables may be more closely
associated to a particular factor than others. In this example, Factor1 has high correlation
to variables 1, 2, and 3 while Factor2 is highly correlated to only variables 3 and 4.
Figure 2: Latent Variable Structure
Variable1
unique variability
Variable2
unique variability
Variable3
unique variability
Variable4
unique variability
Factor1
(Latent variable)
Factor2
(Latent variable)
Note also that factors may be correlated among themselves. In a Varimax rotation, factors
are defined to be uncorrelated and represented by orthogonal vectors. Contrastingly,
Promax implements an oblique rotation so that factors can be related. Conventionally,
when a Promax rotation is applied, a Varimax rotation is first implemented. Because of this
rotation procedure, Factor Analysis does not produce an unique solution.
Figure 3 <http://www.mega.nu/ampp/rummel/ufa.htm> displays how factors 1 and 2
form orthogonal axes to account for maximal variation among the 8 example variables.
Orthogonal and oblique rotations further fit the axes to the variables creating sparsity.
This emphasizes some variable-factor relationships while reducing other variable-factor
associations. While the variance explained by each factor does not change, rotating the
loadings by an orthogonal matrix alters the coefficients and can lead to slightly different
interpretations. To this regard, it is important to infer results carefully.
Figure 3:Visualization of Rotation
HDMD Tutorial
16
Methodology
Factor Analysis differs from PCA in that it separates and estimates common and unique
variability. The logistic equation that optimizes this is then
x i   Si   
T
x    
 i1   1   11
    

x    
 ip   p   p1
1p si1   i1 
   
   
   
 pp 
sip   ip 
where  is the estimated unique variation including error. Since the unique and common
components must be estimated, factor loadings are not the initial eigenvector solution as in
PCA.

Factor Analysis first standardizes the data matrix X by both centering and scaling each
element so that each column of X' has mean 0 and variance 1.
N
x
x11

X  

x N1
x1p 


x Np 

   1
p 
x  
 11 1 
1


X'

x  
 N1 1 
1

x1p   p
x ij   j
j

 p 




x Np   p
 p 


is the Root Mean Square. The covariance matrix of X' is then the typical
i1

correlation matrix with diagonal elements scaled to 1.
j
2
i
 1

R  X'  
 2


 p1

2 
 1p

where  ij2   2ji

1 

In order to determine the amount of variability that can be explained by the factor
structure, the diagonal elements of R are replaced with the estimated communality h2 of
 default method initializes h2 by the squared multiple correlation (SMC)
each variable. The
value. SMC simply estimates the correlation of variable j with all other variables so that
h 2j  1 1 1 where R 1
If the
jj is the diagonal element j from the inverse correlation matrix.
R jj

number of factors K is half of the number of variables or imaginary eigenvalues are
encounteredin the first iteration, then communality is initialized to 1. Total communality
HDMD Tutorial
17
p
is simply defined as a sum of communalities for each variable,
h2 
h
2
j
.
Iteratively
j1
decomposing the correlation matrix into its eigenvector structure and updating diagonal
elements with the sum of squares of each vector estimates the
common variance. The

process is as follows:
Estimate Communality
1)
Initialize Communality and Correlation Matrix R
p
Com m0 
h
2
j
where
j1
2)

3)
h 2j  1 1 1
h2  1
R jj or j
Solve eigenvector Structure of R




v
 11
Loadings   
v
 p1

 1
0


v pK 
 0
v1K
0
0
0   1 v11

 
0  

K 
  1 v p1
K
h 2j 


h p2 

K v1K 


K v pK 

j

p
v 2jk
R jj  h 2j
Com mt 
k1

2 
 1p

Determine and update communality & diagonal of R

4)
h 2
 1
R  
 2


 p1
h
2
j
j1
Iterate 2-4 until communality converges

Commt 
Commt1  c where
c is a threshold of convergence
 Once the common and unique variance estimates have been stabilized using the above
procedure, the factor loadings can be transformed using orthogonal (Varimax) or oblique
(Promax) rotations. Typically when applying a Promax rotation, the loadings are
prerotated using Varimax. This establishes an orthogonal rotated basis that can then be
updated according to factor correlations. During Varimax rotation calculations, factor
loadings are normalized by dividing variable communality. Iteratively an orthogonal
rotation matrix T is determined and updated through variance reduction using Singular
Value Decomposition (SVD) on transformed loadings .
HDMD Tutorial
18
Varimax Rotation (Orthogonal)
1)
Normalize Loadings


  


2)

K v1K 
1 v11


K v pK 

1 v p1


d0  0
Transform Loadings
ˆ 'T

4)

Fit axes

 p

T ˆ 3 ˆ 1
ˆ2
B  '     diag 
j1
p


 j1



5)



h1 

K v pK 
h p 

K v1K
Initialize transformation matrix T and convergence distance d.
T0  I KxK (no rotation)
3)
  v
 1 11
h1

' 
 1 v p1

hp



ˆ2jK 

j1

p

Update rotation matrix T through Singular Value Decomposition (SVD)
B UDVT where U and V are orthogonal matrices and D is diagonal
Tt  UV T
 6)
Iterate 3-5 until Convergence

dt 
K
D
kk
k1

7)
Converged if dt  dt1(1 c) for some threshold value c.
Finalize

ˆ diag h

1

h p 'T
Promax rotation fits the loadings without stipulating orthogonality among axes. The
coefficients describe the best fit line of factors in their respective directions. Since there is
 no restriction on axes orthogonality during this step, changes in axis direction may result in
HDMD Tutorial
19
correlated factors. A factor correlation matrix  similar to the identity matrix implies
orthogonal, uncorrelated factors.
Promax Rotation (Oblique)
1)
Fit axes
Q   
2)

3)
 m with  elementsretained
m1
Fit  and Q
U = KxK matrix of coefficients fitting  and Q
Weight coefficients
 
d  diagU T U
4)

0
0
0
dK
0





Rotated Loadings and Factor Correlations
ˆ U


1
d
 1
U  U  0
 0

 

  U 1 U 1
T
As previously stated, FA separates and estimates common and unique variability. Given the
data and means, loadings  are estimated from communality optimization, eigenvector
decomposition and
 rotation procedures. However, the scores S and error  term must still
be determined for data approximation.
x i   Si   
T
Several methods exist for estimating FA scores, including regression and Bartlett. In the
regression equation scores are estimated such that score

.
ˆ
S  X ' R 1 
 
(NxK )
Nxp
pxp
pxK
KxK
The regression method projects the data using loadings while accounting for correlations
between variables (R) and factors ().

HDMD Tutorial
20
Input Data
See Principal Components Input Data.
Functions
package
HDMD
psych
function
factor.pa.ginv(X)
VSS.scree
While many factor analysis methods have been implemented in R, factanal and factor.pa
functions do not allow for singular covariance matrices. Several parameters for factor
rotations are also hidden in these methods. Although it is standard to prerotate the
loadings according to an orthogonal varimax rotation prior to an oblique promax rotation,
these steps are separated in factor.pa.ginv and allows for greater flexibility. In addition the
power for the promax rotation was previously fixed at m=4 in factor.pa, but can be
specified in the function call in factor.pa.ginv with default m=4. For comparision with the
SAS implementation employed by Atchley et al 2005, m=3 is used in the following example.
Call
VSS.scree(AA54, main="Subset of 54 AA attributes scree Plot")
Factor54 = factor.pa.ginv(AA54, nfactors = 5, m=3, prerotate=TRUE, rotate="Promax",
scores="regression")
row.names(Factor54$scores) = names(AA54)
Factor54$loadings[order(Factor54$loadings[,1]),]
Factor54$scores
Resulting Output
>VSS.scree(AA54, main="Subset of 54 AA attributes Scree Plot")
HDMD Tutorial
21
Figure 4: Scree Plot for Factor Analysis (VSS.scree)
>Factor54 = factor.pa.ginv(AA54, nfactors = 5, m=3, prerotate=TRUE,
rotate="Promax", scores="regression")
Could not solve for inverse correlation. Using general inverse ginv(r)
Warning message:
In smc(r) : Correlation matrix not invertible, smc's returned as 1s
HDMD Tutorial
22
>Factor54$loadings[order(Factor54$loadings[,1]),]

ˆ
 
54x5
HDMD Tutorial
23
Following factor rotations, certain variable-factor relationships are accentuated while
other variable-factor associations are minimized. Factor 1 has an abundance of values with
high associations to variablies whose descriptions are related to polarity, accessibility, and
hydrophobicity (PAH). Similarly, variables emphasized in Factor 2 are related to the
propensity for secondary structure (PSS), Factor 3 to molecular size (MS), Factor 4 to
codon composition (CC), and Factor 5 to electrostatic charge (EC). Thus each factor can be
represented by distinct amino acid attributes. Scores then confer the relationship among
amino acids for that factor. As expected, isoleucine and leucine have similar scores for each
factor while glycine and arginine have dissimilar values.
>Factor54$scores
 S 20x5

Table 2: Amino Acid Factor Scores from R
HDMD Tutorial
24
PCA and FA Comparison
As seen above, PCA and FA utilize similar computational methods in determining
major axes of variation and reduce data dimensionality. The correlation between these
methods relies heavily on the communality of variables. For communalities close to 1, the
majority of variation of the variable can be explained by the factor structure and the
diagonal of the correlation matrix in FA will be the same as PCA. In this case, FA essentially
reduces to PCA as the unique variability approaches zero.
Table 3: Correlation of PCA and FA Score Estimates
HDMD Tutorial
25
Figure 5: PCA - FA Scores for Factor 1 and 2 (and 3)
Figure 6: Sorted Communality Estimates for FA
From the examples provided above, it seems PCA and FA produce similar scores for the
first factor/component and only marginally so for the remaining factors/components. The
similarity in scores is likely due to communality estimates close to 1.0. The correlation
matrix in FA will be similar to that in PCA resulting in analogous eigenvector
decomposition. This however is not generally true, and PCA should not routinely be used
instead of FA to describe latent structure.
HDMD Tutorial
26
Differences in FA for R and SAS
While R and SAS use the same methods to calculate factor loadings, their implementation to
estimate scores is slightly different. For the case when N<p the covariance matrix is
singular and a general inverse matrix must be estimated. Atchley et. al 2005 used SAS for
their calculations in determining latent structure of amino acids and thus have slightly
different factor scores, as seen in Table 4. The correlation between the five factors for R
and SAS implementation is shown in Table 5 where rows correspond to R values and
columns SAS values. Clearly, the scores are similar with the exception of Factors 3 and 5.
This is possibly due to the treatment of factors with variables highly correlated to each
other and multiple factors.
Table 4: Factor Scores computed with SAS (Atchley et al 2005)
Table 5: Correlation between R (row) and SAS (col) factor scores K=5
HDMD Tutorial
27
Using 6 Factors
When SAS and R Factor Analysis scores are compared with both methods estimating 6
factors, the correlation between R and SAS methods is much more evident (see Table 6).
The particular factors may have new interpretation, although the correlation of R factor
scores for K=5 and K=6 factors as well as SAS scores for K=5 and K=6 show that the factors
are highly similar. The sixth factor may further explain the association between the
inferred molecular size (ms) of factor 3 and polarity, accessibility, and hydrophobicity
(pah) scores in factor 1 when K=5.
Table 6: Correlation between R (row) and SAS (col) factor scores K=6
Table 7: SAS Factor Scores (K=6)
Table 9:SAS Score Correlation for FA6 & FA5
Table 8:R Factor Scores (K=6)
Table 10:R Score Correlation for FA6 & FA5
Metric Transformation
The alphabetic nature of amino acid representation is a discrete method allowing for direct
differentiation between residues. However, amino acids codes are alphabetic and have no
general underlying metric making statistical analyses very difficult. Transforming the
alphabetic letters to a more realistic, biological set of numerical values greatly facilitates
HDMD Tutorial
28
computation. Further incorporating the correlation among amino acids allows for
sophisticated statistical analyses. By evaluating 54 amino acid indices, Atchley et al
discovered that 5 factors explain 83% of the analyzed variation. Converting a single vector
of amino acid residues into 5 numeric vectors representing Polarity, Accessibility, and
Hydrophobicity (pah), Propensity for Secondary Structure (pss), Molecular Size (ms),
Codon Composition (cc) and Electrostatic Charge (ec) establishes a platform capable of
handling rigorous statistical techniques such as analysis of variance, regression,
discriminant analysis, etc.
Methodology
Using Factor Analysis, Atchley et al identified 5 factors quantifying amino acid variability.
For a single amino acid sequence, each residue is associated with 5 uncorrelated and
informative numeric values pah, pss, ms, cc, and ec. Each metric transformation can then
be independently analyzed in subsequent analysis as seen in the Discriminant Analysis
section below.
metric values
pah
pss
amino acids
ms
cc
ec
Figure 7: Metric Solution Conversion
Input Data
bHLH288 contains 288 named sequences grouped into 5 categories representing the DNA
binding affinities. The 5 groups are designated by their E-box specificity and presence of
additional domains where Group A binds to CAGCTG E-box motif, Group B binds to CACGTG
E-box motif and is most prevalent, Group C has an additional PAS domain, Group D lacks a
basic region, and Group E binds to CACG[C/A]G N-box motif. Each bHLH sequence has 51
sites with no gaps. A representative subset is displayed here to show sequence variability
among the groups.
HDMD Tutorial
29
Table 11: representative sample of bHLH Amino Acid Sequences with group designation
Functions
package
HDMD
function
FactorTransform
Call
AA54_MetricList_Factor1 = FactorTransform(as.vector(bHLH288[,2]),
SeqName=names(bHLH288), Replace =AAMetric )
AA54_MetricFactor1 = matrix(unlist(AA54_MetricList_Factor1), nrow =
length(AA54_MetricList_Factor1), byrow = TRUE, dimnames =
list(names(AA54_MetricList_Factor1)))
AA54_MetricFactor1
HDMD Tutorial
30
Resulting Output
While the entire transformation from amino acid characters to the PAH metric is stored in
AA54_MetricFactor1, a representative subset is shown here using the following commands
Subset = AASeqs[c(20:25, 137:147, 190:196, 220:229, 264:273),]
AA54_MetricSubset_Factor1 = matrix(unlist(AA54_MetricSubset), nrow =
length(AA54_MetricSubset), byrow = TRUE, dimnames =
list(names(AA54_MetricList_Factor1)))
> AA54_MetricSubset_Factor1
Discriminant Function Analysis (DFA)
Discriminant Function Analysis (DFA) can be used for both exploratory and confirmatory
classification of high dimensional correlated data. Similar to PCA and FA, DFA uses a linear
combination of variables to summarize patterns of variation in the data. In DFA
coefficients are estimated so to minimize within class variation and maximize between
class variation. Figure 8 shows how two variables cannot discriminate Group A from Group
B independently, but can easily separate the groups using a linear function weighting the
variables. The coefficients quantify the relative importance of each variable. One method
for determining the closeness of groups is to measure the mahalanobis distance, which
accounts for the correlation among variables. When variables are uncorrelated the
mahalanobis distance is simply the Euclidean distance.
HDMD Tutorial
31
Figure 8: LDA for 2 classes
Methodology
Since the goal in DFA is to determine a set of linear functions that discriminate groups,
most of the DFA procedures are group centric. This means that variables, and in this case
sites, will be normalized and transformed according to which group they belong. First, to
standardize the data, each element is centered by the group mean.
g1  x11


g2  x i1
X




gm x N1

g1 1


gm 
1
A scaling factor is defined by

x1p 


x ip 




x Np 
 p 


 p 


g1  x11  g1


g2  x i1  g 2
X'




gm x N1  g m
x1p  g1 


x ip  g 2 




x Np  g m 
HDMD Tutorial
32
1
  1
F   0

 0


0 
N
2
2

0
where  j 
x' ij  j

1 
i1
 p 
0

0

is the variance of each variable for the centered data. Using the standard estimators of
mean and variance, the data is normalized so

x11  g
1

1

1
Z

N  m x N1  g
m
1


x1p  g1

 p 


x Np  g m

 p 

Singular Value Decomposition (SVD) decomposes the data matrix Z into two orthonormal

matrices and a unique diagonalized scaling matrix so Z UDV T . The rank r is determined
by the number of elements in D larger than some tolerance value (t= 0.0001). Scaling
coefficients are updated by the SVD decomposition so

F' diag 1
  1
1
  1
  0

 0

0
0
 

 V  diag 1

 p 
 d1
1

0 v11

0 

1 v p1
 p 
1

v1r 1 d
 1
 0

v pr 
0




dr 
0
0
0 

0 
1 
d r 

This scaling minimizes the variance among variables within groups. Group mean values
are similarly decomposed by SVD to determine linear discriminates that maximize the
variance
 between group means. First the group matrix is initialized and scaled so
 g  x1
 1

G  
g m  x1




m  1N  m1
m  1N  m m 


m
m  1N  m1 
 F' where x j 
 c  jgc

g m  x p
c1

m  1N  m m 
g1  x p

m c is the number of observations in group c, and  c is the prior probability of group c.
mc
Unless otherwise specified,  c 
N by default. A second round of SVD is performed,



HDMD Tutorial
33
T
this time on group matrix G so G  UG DGVG . The final matrix transformation maximizes
the between group distances through additional scaling to result in the coefficient matrix
Fˆ  F'VG . Thus the scaling factor is normalized so within group covariance is spherical.
The observations are transformed by scaling coefficient matrix Fˆ so score S  FˆX
maximizes the variance between groups while minimizing the variance within groups.

To quantify group similarity, the mahalanobis function was used to measure the distance
between group means while accounting for the correlation of
variables. Ingeneral
Dm 

x  x  T
where  is the covariance matrix for centered and scaled data X
with mean  making  the correlation matrix. To calculate the mahalanobis distance
between groups b and c the means of each group are compared over all K variables while
accounting for variable correlations.


D 2 gb ,gc    kg b   kg c   kg b   kg c

 1g b  1g c

T
 1

r
 Kg b   Kg c  21


rK1
r12

rKi
r1K 
 1g b  1g c 




riK 
 Kg b   Kg c 

1 
If R is the Identity matrix with off diagonal elements equal to zero, there is no correlation
between variables and the mahalanobis distance reduces to the Euclidean distance.

Input Data
In this example, the bHLH288 data of 288 sequences over 51 sites is transformed using the
PAH Amino Acid Factor Transformation Matrix described in the previous section.
Sequence names have been dropped and instead are numbered for simplicity. Graphs for
the PSS, MS, CC, and EC metrics were calculated similarly.
HDMD Tutorial
34
Table 12: Metric Transformation of bHLH Sequences using AAMetric (subset shown)
... ... ...
Functions
package
MASS
HDMD
function
lda
pairwise.mahalanobis(X, grouping = NULL)
The mahalanobis function in the stats package determines the mahalanobis distance
between a vector and mean of the data. In many instances multiple distance measurements
are desired, such as all pairwise distances among a set of groups. pairwise.mahalanobis
takes a data matrix X and determines all pairwise distances between groups. If a separate
grouping vector is not specified, the function assumes the first column groups
observations.
HDMD Tutorial
35
Call
Based on Factor Scores determined by R calculations in FA and Metric Transformation
above
AA54_MetricList_Factor1 = FactorTransform(as.vector(bHLH288[,2]), Replace = AAMetric)
grouping = bHLH288[,1]
AA54_MetricFactor1 = matrix(unlist(AA54_MetricList_Factor1), nrow =
length(AA54_MetricList_Factor1), byrow = TRUE, dimnames =
list(names(AA54_MetricList_Factor1)))
AA54_lda_Metric1 = lda(AA54_MetricFactor1, grouping)
AA54_lda_RawMetric1 = as.matrix(AA54_MetricFactor1) %*% AA54_lda_Metric1$scaling
AA54_lda_RawMetric1Centered = scale(AA54_lda_RawMetric1, center = TRUE, scale =
FALSE)
plot(-1*AA54_lda_RawMetric1Centered[,1], -1*AA54_lda_RawMetric1Centered[,2], pch =
grouping, xlab="Canonical Variate 1", ylab="Canonical Variate 2", main="DA Scores
(Centered Raw Coefficients)\nusing Factor1 (pah) from R transformation")
lines(c(0,0), c(-15,15), lty="dashed")
lines(c(-35,25), c(0,0), lty="dashed")
Mahala_1 = pairwise.mahalanobis(AA54_lda_RawMetric1Centered, grouping)
D = sqrt(Mahala_1$distance)
rownames(D) = colnames(D) = c("A", "B", "C", "D", "E")
round(D, digits=3)
Resulting Output
>AA54_lda_Metric1
HDMD Tutorial
36
 FpxK

HDMD Tutorial
37
>Mahala_1
2
 DKxK
>round(D, digits=3)

>plot(-1*AA54_MetricRlda1_Centerprojection[,1], 1*AA54_MetricRlda1_Centerprojection[,2], pch = letter_grouping, xlab="Canonical
Variate 1", ylab="Canonical Variate 2", main="DA Scores (Centered Raw
Coefficients)\nusing Factor1 (pah) from R transformation", xlim = c(-30,20),
ylim=c(-11,10))
> lines(c(0,0), c(-15,15), lty="dashed")
> lines(c(-35,25), c(0,0), lty="dashed")
HDMD Tutorial
38
Figure 9: LDA for Factor1 using R Metric Transformation
Figure 10: LDA for Factor2 (pss) using R Metric Transformation
HDMD Tutorial
39
Figure 11: LDA for Factor3 (ms) using R Metric Transformation
Figure 12: LDA for Factor4 (cc) using R Metric Transformation
HDMD Tutorial
40
Figure 13: LDA for Factor5 (ec) using R Metric Transformation
References
Atchley, W. R. and A. D. Fernandes (2005). "Sequence signatures and the probabilistic
identification of proteins in the Myc-Max-Mad network." Proc Natl Acad Sci USA
102(18): 6401-6.
Atchley, W. R., J. Zhao, et al. (2005). "Solving the protein sequence metric problem." Proc
Natl Acad Sci U S A 102(18): 6395-400.
Shlen, Jonathon. A Tutorial on Principal Component Analysis. Center for Neural Science,
New York University Dated: April 22, 2009; Version 3.01.
<http://www.snl.salk.edu/~shlens/notes.html>
Nakai, K., Kidera, A., and Kanehisa, M.; Cluster analysis of amino acid indices for prediction
of protein structure and function. Protein Eng. 2, 93-100 (1988). [PMID:3244698]
Tomii, K. and Kanehisa, M.; Analysis of amino acid indices and mutation matrices for
sequence comparison and structure prediction of proteins. Protein Eng. 9, 27-36 (1996).
[PMID:9053899]
HDMD Tutorial
41
Kawashima, S., Ogata, H., and Kanehisa, M.; AAindex: amino acid index database. Nucleic
Acids Res. 27, 368-369 (1999). [PMID:9847231]
Kawashima, S. and Kanehisa, M.; AAindex: amino acid index database. Nucleic Acids Res. 28,
374 (2000). [PMID:10592278]
Kawashima, S., Pokarowski, P., Pokarowska, M., Kolinski, A., Katayama, T., and Kanehisa, M.;
AAindex: amino acid index database, progress report 2008. Nucleic Acids Res. 36, D202D205 (2008). [PMID:17998252]
Download