Principal Components Analysis Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate data set while accounting for as much of the original variation as possible present in the data set. The basic goal of PCA is to describe variation in a set of correlated variables, XT =(X1, ……,Xq), in terms of a new set of uncorrelated variables, YT =(Y1, …….,Yq), each of which is linear combination of the X variables. Y1, ………….,Yq - principal components decrease in the amount of variation in the original data Principal Components Analysis (PCA) The principal components analysis are most commonly used for constructing an informative graphical representation of the data. Principal components might be useful when: • There are too many explanatory variables relative to the number of observations. • The explanatory variables are highly correlated. Principal Components Analysis (PCA) The principal component is the linear combination of the variables X1, X2, ….Xq Y1 a11 X 1 a12 X 2 ... a1q X q Y1 accounts for as much as possible of the variation in the original data among all linear combinations of 2 2 a11 a12 ... a12q 1 Principal Components Analysis (PCA) The second principal component accounts for as much as possible of the remaining variation: Y2 a21 X 1 a22 X 2 ... a2 q X q with the constrain: 2 2 a21 a22 ... a22q 1 Y1 and Y2 are uncorrelated. Principal Components Analysis (PCA) The third principal component: Y3 a31 X 1 a32 X 2 ... a3q X q 2 2 a31 a32 ... a32q 1 Y3 is uncorrelated with Y1 and Y2 . If there are q variables, there are q principal components. Principal Components Analysis (PCA) Height First Leaf 108 12 111 11 147 23 218 21 240 37 223 30 242 28 480 77 290 40 263 55 Data: Height of first leaf length of Dactylorhyza orchids. ______ _____ Each observation is considered a coordinate in N-dimensional data space, where N is the number of variables and each axis of data space is one variable. Mean length ________ __ ____ Mean of height Step 1: A new set of axes is created, whose origins (0,0) is located at the mean of the dataset. Step 2: The new axes are rotated around their origins until the first axis gives a least squares best fit to the data (residuals are fitted orthogonally). Principal Components Analysis (PCA) PCA gives three useful sets of information about the dataset: • projection onto new coordinate axes (i.e. new set of variables encapsulating the overall information content). • the rotations needed to generate each new axis (i.e. the relative importance of each old variable to each new axis). • the actual information content of each new axis. Mechanics of PCA • Normalising the data Most multivariate datasets consists of extremely different variables (i.e. plant percentage cover will range from 0% to 100%, animal population values may exceed 10000, chemical concentrations may take any positive value). How to compare such disparate types of data? Approach: calculate the mean (µ) and standard deviation(s) of each variable (Xi) separately, then convert each observation into a corresponding Z score: X Zi i s Z score is dimensionless, each column of the data has been converted into a new variable which preserves the shape of the original data but has µ=0 and s=1. The process of converting to Z scores is known as normalization. Mechanics of PCA • Normalising the data Before normalisation µ: s: X 1.716 1.76 1.933 2.366 2.582 3.015 3.232 1.616 1.991 2.741 3.116 2.37 0.6 Y -0.567 -0.48 -0.134 0.732 1.165 2.031 2.464 1.232 0.982 0.482 0.232 0.74 0.97 Z 0.991 1.016 1.116 1.366 1.491 1.741 1.866 0.933 1.15 1.582 1.799 1.368 0.346 x, y, and z - axes µ - mean s - standard deviation After normalisation X -1.09 -1.02 -0.73 -0.01 0.35 1.08 1.44 -1.26 -0.63 0.62 1.24 0 1 Y -1.35 -1.26 -0.9 -0.01 0.44 1.33 1.78 0.51 0.25 -0.27 -0.52 0 1 Z -1.09 -1.02 -0.73 -0.01 0.35 1.08 1.44 -1.26 -0.63 0.62 1.24 0 1 Mechanics of PCA • The extraction of principal components The cloud of N-dimensional data points needs to be rotated to generate a set of N principal axes. The ordination is achieved by finding a set of numbers (loadings) which rotates the data to give the best fit. How to find the best possible values for the loadings? Answer: Finding the eigenvectors and eigenvalues of the Pearson’s correlation matrix (the matrix of all possible Pearson’s correlation coefficients between the variables under examination). X Y Z X 1.000 0.593 0.999 Y 0.593 1.000 0.594 Z 0.999 0.594 1.000 The covariance matrix can be used instead of correlation matrix when all the original variables have the same scale or if the data was normalized. Mechanics of PCA • Eigenvalues and eigenvectors When a square (N x N) matrix is multiplied with a (1 x N) matrix, the result is a new (1 x N) matrix. This operation can be repeated on a new (1 x N) matrix, generating another (1 x N) matrix. After a number of repeats (iterations) the pattern of numbers generated settles down to a constant shape, although their actual values change each time by a constant amount. The rate of growth (or shrinkage) per multiplication it is known as dominant eigenvalue, and the pattern they form is the dominant (or principal) eigenvector. M V V M - (N x N) matrix V - (1 x N) matrix - eigenvalue Mechanics of PCA • Eigenvalues and eigenvectors First iteration: 1 0.593 0.999 0.593 1 0.594 0.999 0.594 1 x 1 1 1 = 2.592 2.187 2.593 Second iteration: 1 0.593 0.999 0.593 1 0.594 0.999 0.594 1 x 2.592 2.187 2.593 Iteration number: 5 Resulting matrix: 98.6 79.3 98.6 First eigenvector: 0.967 0.777 0.967 = 6.48 5.26 6.48 10 9181 7384 9181 20 7.96e7 6.40e7 7.96e7 Second eigenvector: -0.253 0.629 -0.253 Dominant eigenvalue: 2.48 Once the equilibrium is reached each generation of numbers increases by a factor of 2.48. Mechanics of PCA PCA takes a set of R observations on N variables as a set of R points in an N-dimensional space. A new set of N principal axes is derived, each one defined by rotating the dataset by a certain angle with respect to the old axes. The first axis in the new space (the first principal axis of the data) encapsulate the maximum possible information content, the second axis contains the second greatest information content and so on. Eigenvectors - a relative patterns of numbers which is preserved under matrix multiplication. Eigenvalues - give a precise indication of the relative importance of each ordination axis, with the largest eigenvalue being associated with the first principal axis, the second largest eigenvalue being associated with the second principal axis, etc. Mechanics of PCA For example, a matrix with 20 species would generate 20 eigenvectors, but only the first three or four would be of any importance for interpreting the data. The relationship between eigenvalues and variance in PCA: 100 m Vm N Vm - percent variance explained by the mth ordination axis - the mth eigenvalue N - number of variables There is no formal test of significance available to decide if any given ordination axis is meaningful, nor is there any test to decide whether or not individual variables contribute significantly to an ordination axis. Mechanics of PCA Axis scores The Nth axis of the ordination diagram is derived by multiplying the matrix of normalized data by the Nth eigenvector. X -1.09 -1.02 -0.73 -0.01 0.35 1.08 1.44 -1.26 -0.63 0.62 1.24 Y -1.35 -1.26 -0.9 -0.01 0.44 1.33 1.78 0.51 0.25 -0.27 -0.52 Z -1.09 -1.02 -0.73 -0.01 0.35 1.08 1.44 -1.26 -0.63 0.62 1.24 x 0.967 0.777 0.967 = first eigenvector -3.16 -2.95 -2.11 -0.02 1.02 3.12 4.17 -2.04 -1.02 0.99 1.99 first axis scores X -1.09 -1.02 -0.73 -0.01 0.35 1.08 1.44 -1.26 -0.63 0.62 1.24 Y -1.35 -1.26 -0.9 -0.01 0.44 1.33 1.78 0.51 0.25 -0.27 -0.52 Z -1.09 -1.02 -0.73 -0.01 0.35 1.08 1.44 -1.26 -0.63 0.62 1.24 x -0.253 0.629 -0.253 = second eigenvector -0.30 -0.28 -0.20 0.00 0.10 0.29 0.39 0.96 0.48 -0.48 -0.95 second axis scores PCA Example Excavations of prehistoric sites in northeast Thailand have produced a series of canid (dog) bones covering a period from about 3500 BC to the present. In order to clarify the ancestry of the prehistoric dogs, mandible measurements were made on the available specimens. These were then compared with similar measurements on the golden jackal, the Chinese wolf, the Indian wolf, the dingo, the cuon, and the modern dog from Thailand. How these groups are related, and how the prehistoric group is related to the others? R data “Phistdog” Variables: Mbreadth- breadth of mandible Mheight- height of mandible below 1st molar mlength- length of 1st molar mbreadth- breadth of 1st molar mdist- length from 1st to 3rd molars inclusive pmdist- length from 1st to 4th premolars inclusive PCA Example # read the “Phistdog” data and consider the first column as the row names >Phistdog=read.csv("E:/Multivariate_analysis/Data/Prehist_dog.csv",header=T,ro w.names=1) Calculate the variance of Phistdog data set. The round command is used to reduce the number of decimals at 2 for the reason of space. > round(sapply(Phistdog,var),2) Mbreath Mheight mlength mbreadth 2.88 10.56 9.61 1.36 mdist pmdist 24.30 31.52 The measurements are on a similar scale, variances are not very different. We can use either correlation or the covariance matrix. PCA Example Calculate the correlation matrix of the data. > round(cor(Phistdog),2) Mbreath Mheight mlength mbreadth mdist pmdist Mbreath 1.00 0.95 0.92 0.98 0.78 0.81 Mheight 0.95 1.00 0.88 0.95 0.71 0.85 mlength 0.92 0.88 1.00 0.97 0.88 0.94 mbreadth 0.98 0.95 0.97 1.00 0.85 0.91 mdist 0.78 0.71 0.88 0.85 1.00 0.89 pmdist 0.81 0.85 0.94 0.91 0.89 1.00 PCA Example Calculate the covariance matrix of the data. > round(cov(Phistdog),2) Mbreath Mheight mlength mbreadth mdist pmdist Mbreath 2.88 5.25 4.85 1.93 6.52 7.74 Mheight 5.25 10.56 8.90 3.59 11.45 15.58 mlength 4.85 8.90 9.61 3.51 13.39 16.31 mbreadth 1.93 3.59 3.51 1.36 4.86 5.92 mdist 6.52 11.45 13.39 4.86 24.30 24.60 pmdist 7.74 15.58 16.31 5.92 24.60 31.52 PCA Example Calculate the eigenvectores and eigenvalues of the correlation matrix: > eigen(cor(Phistdog)) $values [1] 5.429026124 0.369268401 0.128686279 0.064760299 0.006117398 0.002141499 $vectors [,1] [,2] [,3] [,4] [,5] [,6] [1,] -0.4099426 0.40138614 -0.45937507 -0.005510479 0.009871866 0.6779992 [2,] -0.4033020 0.48774128 0.29350469 -0.511169325 -0.376186947 -0.3324158 [3,] -0.4205855 -0.08709575 0.02680772 0.737388619 -0.491604714 -0.1714245 [4,] -0.4253562 0.16567935 -0.12311823 0.170218718 0.739406740 -0.4480710 [5,] -0.3831615 -0.67111237 -0.44840921 -0.404660012 -0.136079802 -0.1394891 [6,] -0.4057854 -0.33995660 0.69705234 -0.047004708 0.226871533 0.4245063 PCA Example Calculate the eigenvectores and eigenvalues of the covariance matrix: > eigen(cov(Phistdog)) $values [1] 72.512852567 4.855621390 2.156165476 0.666083782 0.024355099 [6] 0.005397877 $vectors [,1] [,2] [,3] [,4] [,5] [,6] [1,] -0.1764004 -0.2228937 -0.4113227 -0.10162260 0.65521113 0.557123088 [2,] -0.3363603 -0.6336812 -0.3401245 0.47472891 -0.36879498 -0.090818041 [3,] -0.3519843 -0.1506859 -0.1472096 -0.83773573 -0.36033271 -0.009453262 [4,] -0.1301150 -0.1132540 -0.1502766 -0.10976633 0.51257082 -0.820294484 [5,] -0.5446003 0.7091113 -0.3845381 0.20868622 -0.09193887 -0.026446421 [6,] -0.6467862 -0.1019554 0.7231913 0.08309978 0.18348673 0.087716189 PCA Example Extract the principal components from the correlation matrix: > Phistdog_Cor=princomp(Phistdog,cor=TRUE) > summary(Phistdog_Cor,loadings=TRUE) Importance of components: Comp.1 Comp.2 Comp.3 Standard deviation 2.3300271 0.60767458 0.35872870 Proportion of Variance 0.9048377 0.06154473 0.02144771 Cumulative Proportion 0.9048377 0.96638242 0.98783013 Loadings: Comp.1 Comp.2 Comp.3 Mbreath -0.410 0.401 -0.459 Mheight -0.403 0.488 0.294 mlength -0.421 mbreadth -0.425 0.166 -0.123 mdist -0.383 -0.671 -0.448 pmdist -0.406 -0.340 0.697 The first principal component accounts for 90% of variance. All other components account for less than 10% variance each. PCA Example Extract the principal components from the covariance matrix: > Phistdog_Cov=princomp(Phistdog) > summary(Phistdog_Cov,loadings=TRUE) Importance of components: Comp.1 Comp.2 Comp.3 Standard deviation 7.8837728 2.04008853 1.35946380 Proportion of Variance 0.9039195 0.06052845 0.02687799 Cumulative Proportion 0.9039195 0.96444795 0.99132595 Loadings: Comp.1 Mbreath -0.176 Mheight -0.336 mlength -0.352 mbreadth -0.130 mdist -0.545 pmdist -0.647 Comp.2 0.223 0.634 0.151 0.113 -0.709 0.102 Comp.3 -0.411 -0.340 -0.147 -0.150 -0.385 0.723 The loadings obtained from the covariance matrix are different compared to those from the correlation matrix. Proportions of variance are similar. PCA Example Plot variances of the principal components: > screeplot(Phistdog_Cor,main="Phistdog",cex.names=0.75) 10 20 30 40 50 60 0 Variances Phistdog Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 PCA Example Equations for the first two principal components from the correlation matrix: Y1 0.41Mbreadth 0.4Mheight 0.42mlength 0.42mbreadth 0.38mdist 0.4 pmdist Y2 0.4Mbreadth 0.48Mheight 0.16mbreadth 0.67mdist 0.34 pmdist Equations for the first two principal components from the covariance matrix: Y1 0.17Mbreadth 0.33Mheight 0.35mlength 0.13mbreadth 0.54mdist 0.64 pmdist Y2 0.22Mbreadth 0.63Mheight 0.15mlength 0.11mbreadth 0.7mdist 0.1pmdist Negative loadings on first principal axis for all variables . Mostly positive loadings on the second principal axis. PCA Example Calculate the axis scores for the principal components from the correlation matrix: > round(Phistdog_Cor$scores,2) Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Modern 1.47 0.04 -0.05 -0.18 -0.08 0.09 G.jackal 3.32 -0.66 -0.25 0.34 0.05 -0.01 C.wolf -4.33 0.03 -0.23 0.11 0.09 0.03 I.wolf -2.13 -0.58 -0.09 0.03 -0.14 -0.05 Cuon 0.45 1.16 0.29 0.30 -0.03 -0.02 Dingo 0.08 -0.47 0.73 -0.20 0.06 -0.01 Prehistoric 1.14 0.49 -0.40 -0.40 0.04 -0.05 PCA Example Calculate the axis scores for the principal components from the covariance matrix: > round(Phistdog_Cov$scores,2) Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Modern 4.77 -0.27 -0.18 0.49 0.01 -0.15 G.jackal 10.23 -2.76 0.26 -1.04 0.08 0.03 C.wolf -13.89 0.18 -0.83 -0.39 0.22 -0.01 I.wolf -8.25 -1.67 -0.25 -0.23 -0.29 0.00 Cuon 3.98 4.31 0.17 -0.76 -0.07 0.01 Dingo -2.00 0.02 2.83 0.82 0.04 0.04 Prehistoric 5.16 0.20 -2.01 1.10 0.01 0.08 PCA Example Plot the first principal component vs. second principal component obtained from the correlation matrix and >plot(Phistdog_Cor$scores[,2]~Phistdog_Cor$scores[,1],xlab="PC1",ylab="PC2" ,pch=15,xlim=c(-4.5,3.5),ylim=c(-0.75,1.5)) >text(Phistdog_Cor$scores[,1],Phistdog_Cor$scores[,2],labels=row.names(Phist dog),cex=0.7,pos=rep(1,7)) > abline(h=0) > abline(v=0) from the covariance matrix: >plot(Phistdog_Cov$scores[,2]~Phistdog_Cov$scores[,1],xlab="PC1",ylab="PC 2",pch=15,xlim=c(-14.5,11),ylim=c(-3.5,4.5)) >text(Phistdog_Cov$scores[,1],Phistdog_Cov$scores[,2],labels=row.names(Phi stdog),cex=0.7,pos=rep(1,7)) > abline(v=0) > abline(h=0) 4 1.5 PCA Example 1.0 Cuon 0.0 Prehistoric Modern I.wolf G.jackal -15 -10 -5 0 5 10 PC1 PCA diagram based on Covariance Prehistoric C.wolf Modern -0.5 Dingo 0.5 PC2 0 C.wolf -2 PC2 2 Cuon Dingo I.wolf -4 -2 G.jackal 0 2 PC1 PCA diagram based on Correlation PCA Example Even if the scores given by the covariance and correlation matrix are different the information provided by the two diagrams is the same. The Modern dog has the closest mandible measurements to the Prehistoric dog, which shows that the two groups are related. Cuon and Dingo groups are the next closest groups to the Prehistoric dog. I. wolf, C wolf, and G. jack are not related to the Prehistoric dog or to any other group.