pptx

advertisement
Principal Components
Analysis
Principal Components Analysis (PCA)
A multivariate technique with the central aim of reducing the
dimensionality of a multivariate data set while accounting for as
much of the original variation as possible present in the data set.
The basic goal of PCA is to describe variation in a set of correlated
variables, XT =(X1, ……,Xq), in terms of a new set of uncorrelated variables,
YT =(Y1, …….,Yq), each of which is linear combination of the X variables.
Y1, ………….,Yq - principal components
decrease in the amount of
variation in the original data
Principal Components Analysis (PCA)
The principal components analysis are most commonly used
for constructing an informative graphical representation of the
data.
Principal components might be useful when:
• There are too many explanatory variables relative to the
number of observations.
• The explanatory variables are highly correlated.
Principal Components Analysis (PCA)
The principal component is the linear combination of the
variables X1, X2, ….Xq
Y1  a11 X 1  a12 X 2  ...  a1q X q
Y1 accounts for as much as possible of the variation in the
original data among all linear combinations of
2
2
a11
 a12
 ...  a12q  1
Principal Components Analysis (PCA)
The second principal component accounts for as much as
possible of the remaining variation:
Y2  a21 X 1  a22 X 2  ...  a2 q X q
with the constrain:
2
2
a21
 a22
 ...  a22q  1
Y1 and Y2 are uncorrelated.
Principal Components Analysis (PCA)
The third principal component:
Y3  a31 X 1  a32 X 2  ... a3q X q
2
2
a31
 a32
 ...  a32q  1
Y3 is uncorrelated with Y1 and Y2 .
If there are q variables, there are q principal components.
Principal Components Analysis (PCA)
Height First Leaf
108
12
111
11
147
23
218
21
240
37
223
30
242
28
480
77
290
40
263
55
Data: Height of first
leaf length of
Dactylorhyza orchids.
______ _____
Each observation is considered a coordinate in N-dimensional data space, where N
is the number of variables and each axis of data space is one variable.
Mean length
________ __ ____
Mean of height
Step 1: A new set of axes is created,
whose origins (0,0) is located at the
mean of the dataset.
Step 2: The new axes are rotated around
their origins until the first axis gives a least
squares best fit to the data (residuals are
fitted orthogonally).
Principal Components Analysis (PCA)
PCA gives three useful sets of information about the dataset:
• projection onto new coordinate axes (i.e. new set of
variables encapsulating the overall information content).
• the rotations needed to generate each new axis (i.e. the
relative importance of each old variable to each new axis).
• the actual information content of each new axis.
Mechanics of PCA
• Normalising the data
Most multivariate datasets consists of extremely different variables (i.e.
plant percentage cover will range from 0% to 100%, animal population
values may exceed 10000, chemical concentrations may take any
positive value). How to compare such disparate types of data?
Approach: calculate the mean (µ) and standard deviation(s) of each
variable (Xi) separately, then convert each observation into a
corresponding Z score:
X 
Zi  i
s
Z score is dimensionless, each column of the data has been converted
into a new variable which preserves the shape of the original data but
has µ=0 and s=1. The process of converting to Z scores is known as
normalization.
Mechanics of PCA
• Normalising the data
Before normalisation
µ:
s:
X
1.716
1.76
1.933
2.366
2.582
3.015
3.232
1.616
1.991
2.741
3.116
2.37
0.6
Y
-0.567
-0.48
-0.134
0.732
1.165
2.031
2.464
1.232
0.982
0.482
0.232
0.74
0.97
Z
0.991
1.016
1.116
1.366
1.491
1.741
1.866
0.933
1.15
1.582
1.799
1.368
0.346
x, y, and z - axes
µ - mean
s - standard deviation
After normalisation
X
-1.09
-1.02
-0.73
-0.01
0.35
1.08
1.44
-1.26
-0.63
0.62
1.24
0
1
Y
-1.35
-1.26
-0.9
-0.01
0.44
1.33
1.78
0.51
0.25
-0.27
-0.52
0
1
Z
-1.09
-1.02
-0.73
-0.01
0.35
1.08
1.44
-1.26
-0.63
0.62
1.24
0
1
Mechanics of PCA
• The extraction of principal components
The cloud of N-dimensional data points needs to be rotated to generate
a set of N principal axes. The ordination is achieved by finding a set of
numbers (loadings) which rotates the data to give the best fit.
How to find the best possible values for the loadings?
Answer: Finding the eigenvectors and eigenvalues of the Pearson’s
correlation matrix (the matrix of all possible Pearson’s correlation coefficients
between the variables under examination).
X
Y
Z
X
1.000
0.593
0.999
Y
0.593
1.000
0.594
Z
0.999
0.594
1.000
The covariance matrix can be used instead of correlation matrix when all the
original variables have the same scale or if the data was normalized.
Mechanics of PCA
• Eigenvalues and eigenvectors
When a square (N x N) matrix is multiplied with a (1 x N) matrix, the result
is a new (1 x N) matrix. This operation can be repeated on a new (1 x N)
matrix, generating another (1 x N) matrix. After a number of repeats
(iterations) the pattern of numbers generated settles down to a constant
shape, although their actual values change each time by a constant amount.
The rate of growth (or shrinkage) per multiplication it is known as dominant
eigenvalue, and the pattern they form is the dominant (or principal)
eigenvector.
M V   V
M
- (N x N) matrix
V
- (1 x N) matrix

- eigenvalue
Mechanics of PCA
• Eigenvalues and eigenvectors
First iteration:
1
0.593
0.999
0.593
1
0.594
0.999
0.594
1
x
1
1
1
=
2.592
2.187
2.593
Second iteration:
1
0.593
0.999
0.593
1
0.594
0.999
0.594
1
x
2.592
2.187
2.593
Iteration number:
5
Resulting matrix:
98.6
79.3
98.6
First eigenvector:
0.967
0.777
0.967
=
6.48
5.26
6.48
10
9181
7384
9181
20
7.96e7
6.40e7
7.96e7
Second eigenvector:
-0.253
0.629
-0.253
Dominant eigenvalue: 2.48
Once the equilibrium is reached each generation of numbers increases by a
factor of 2.48.
Mechanics of PCA
PCA takes a set of R observations on N variables as a set of R points
in an N-dimensional space. A new set of N principal axes is derived,
each one defined by rotating the dataset by a certain angle with respect
to the old axes.
The first axis in the new space (the first principal axis of the data)
encapsulate the maximum possible information content, the second axis
contains the second greatest information content and so on.
Eigenvectors - a relative patterns of numbers which is preserved under
matrix multiplication.
Eigenvalues - give a precise indication of the relative importance of each
ordination axis, with the largest eigenvalue being associated with the
first principal axis, the second largest eigenvalue being associated
with the second principal axis, etc.
Mechanics of PCA
For example, a matrix with 20 species would generate 20 eigenvectors, but
only the first three or four would be of any importance for interpreting
the data.
The relationship between eigenvalues and variance in PCA:
100  m
Vm 
N
Vm
- percent variance explained by the mth ordination axis

- the mth eigenvalue
N
- number of variables
There is no formal test of significance available to decide if any given
ordination axis is meaningful, nor is there any test to decide whether
or not individual variables contribute significantly to an ordination axis.
Mechanics of PCA
Axis scores
The Nth axis of the ordination diagram is derived by multiplying the matrix
of normalized data by the Nth eigenvector.
X
-1.09
-1.02
-0.73
-0.01
0.35
1.08
1.44
-1.26
-0.63
0.62
1.24
Y
-1.35
-1.26
-0.9
-0.01
0.44
1.33
1.78
0.51
0.25
-0.27
-0.52
Z
-1.09
-1.02
-0.73
-0.01
0.35
1.08
1.44
-1.26
-0.63
0.62
1.24
x
0.967
0.777
0.967
=
first
eigenvector
-3.16
-2.95
-2.11
-0.02
1.02
3.12
4.17
-2.04
-1.02
0.99
1.99
first
axis scores
X
-1.09
-1.02
-0.73
-0.01
0.35
1.08
1.44
-1.26
-0.63
0.62
1.24
Y
-1.35
-1.26
-0.9
-0.01
0.44
1.33
1.78
0.51
0.25
-0.27
-0.52
Z
-1.09
-1.02
-0.73
-0.01
0.35
1.08
1.44
-1.26
-0.63
0.62
1.24
x
-0.253
0.629
-0.253
=
second
eigenvector
-0.30
-0.28
-0.20
0.00
0.10
0.29
0.39
0.96
0.48
-0.48
-0.95
second
axis scores
PCA Example
Excavations of prehistoric sites in northeast Thailand have produced a series of
canid (dog) bones covering a period from about 3500 BC to the present.
In order to clarify the ancestry of the prehistoric dogs, mandible measurements
were made on the available specimens. These were then compared with similar
measurements on the golden jackal, the Chinese wolf, the Indian wolf, the dingo,
the cuon, and the modern dog from Thailand. How these groups are related,
and how the prehistoric group is related to the others?
R data “Phistdog”
Variables:
Mbreadth- breadth of mandible
Mheight- height of mandible below 1st molar
mlength- length of 1st molar
mbreadth- breadth of 1st molar
mdist- length from 1st to 3rd molars inclusive
pmdist- length from 1st to 4th premolars inclusive
PCA Example
# read the “Phistdog” data and consider the first column as the row names
>Phistdog=read.csv("E:/Multivariate_analysis/Data/Prehist_dog.csv",header=T,ro
w.names=1)
Calculate the variance of Phistdog data set. The round command is used to reduce
the number of decimals at 2 for the reason of space.
> round(sapply(Phistdog,var),2)
Mbreath Mheight mlength mbreadth
2.88
10.56
9.61
1.36
mdist pmdist
24.30 31.52
The measurements are on a similar scale, variances are not very different.
We can use either correlation or the covariance matrix.
PCA Example
Calculate the correlation matrix of the data.
> round(cor(Phistdog),2)
Mbreath Mheight mlength mbreadth mdist pmdist
Mbreath
1.00
0.95
0.92 0.98
0.78 0.81
Mheight
0.95
1.00
0.88 0.95
0.71 0.85
mlength
0.92
0.88
1.00 0.97
0.88 0.94
mbreadth 0.98
0.95
0.97 1.00
0.85 0.91
mdist
0.78
0.71
0.88 0.85
1.00 0.89
pmdist
0.81
0.85
0.94 0.91
0.89 1.00
PCA Example
Calculate the covariance matrix of the data.
> round(cov(Phistdog),2)
Mbreath Mheight mlength mbreadth mdist pmdist
Mbreath
2.88 5.25
4.85
1.93
6.52 7.74
Mheight
5.25 10.56
8.90
3.59
11.45 15.58
mlength
4.85 8.90
9.61
3.51
13.39 16.31
mbreadth 1.93 3.59
3.51
1.36
4.86 5.92
mdist
6.52 11.45
13.39
4.86
24.30 24.60
pmdist
7.74 15.58
16.31
5.92
24.60 31.52
PCA Example
Calculate the eigenvectores and eigenvalues of the correlation matrix:
> eigen(cor(Phistdog))
$values
[1] 5.429026124 0.369268401 0.128686279 0.064760299 0.006117398 0.002141499
$vectors
[,1]
[,2]
[,3]
[,4]
[,5]
[,6]
[1,] -0.4099426 0.40138614 -0.45937507 -0.005510479 0.009871866 0.6779992
[2,] -0.4033020 0.48774128 0.29350469 -0.511169325 -0.376186947 -0.3324158
[3,] -0.4205855 -0.08709575 0.02680772 0.737388619 -0.491604714 -0.1714245
[4,] -0.4253562 0.16567935 -0.12311823 0.170218718 0.739406740 -0.4480710
[5,] -0.3831615 -0.67111237 -0.44840921 -0.404660012 -0.136079802 -0.1394891
[6,] -0.4057854 -0.33995660 0.69705234 -0.047004708 0.226871533 0.4245063
PCA Example
Calculate the eigenvectores and eigenvalues of the covariance matrix:
> eigen(cov(Phistdog))
$values
[1] 72.512852567 4.855621390 2.156165476 0.666083782 0.024355099
[6] 0.005397877
$vectors
[,1]
[,2]
[,3]
[,4]
[,5]
[,6]
[1,] -0.1764004 -0.2228937 -0.4113227 -0.10162260 0.65521113 0.557123088
[2,] -0.3363603 -0.6336812 -0.3401245 0.47472891 -0.36879498 -0.090818041
[3,] -0.3519843 -0.1506859 -0.1472096 -0.83773573 -0.36033271 -0.009453262
[4,] -0.1301150 -0.1132540 -0.1502766 -0.10976633 0.51257082 -0.820294484
[5,] -0.5446003 0.7091113 -0.3845381 0.20868622 -0.09193887 -0.026446421
[6,] -0.6467862 -0.1019554 0.7231913 0.08309978 0.18348673 0.087716189
PCA Example
Extract the principal components from the correlation matrix:
> Phistdog_Cor=princomp(Phistdog,cor=TRUE)
> summary(Phistdog_Cor,loadings=TRUE)
Importance of components:
Comp.1 Comp.2 Comp.3
Standard deviation 2.3300271 0.60767458 0.35872870
Proportion of Variance 0.9048377 0.06154473 0.02144771
Cumulative Proportion 0.9048377 0.96638242 0.98783013
Loadings:
Comp.1 Comp.2 Comp.3
Mbreath -0.410 0.401 -0.459
Mheight -0.403 0.488
0.294
mlength -0.421
mbreadth -0.425 0.166
-0.123
mdist
-0.383 -0.671
-0.448
pmdist
-0.406 -0.340
0.697
The first principal component accounts for 90% of variance. All other components
account for less than 10% variance each.
PCA Example
Extract the principal components from the covariance matrix:
> Phistdog_Cov=princomp(Phistdog)
> summary(Phistdog_Cov,loadings=TRUE)
Importance of components:
Comp.1 Comp.2 Comp.3
Standard deviation 7.8837728 2.04008853 1.35946380
Proportion of Variance 0.9039195 0.06052845 0.02687799
Cumulative Proportion 0.9039195 0.96444795 0.99132595
Loadings:
Comp.1
Mbreath -0.176
Mheight -0.336
mlength -0.352
mbreadth -0.130
mdist
-0.545
pmdist
-0.647
Comp.2
0.223
0.634
0.151
0.113
-0.709
0.102
Comp.3
-0.411
-0.340
-0.147
-0.150
-0.385
0.723
The loadings obtained
from the covariance
matrix are different
compared to those from
the correlation matrix.
Proportions of variance
are similar.
PCA Example
Plot variances of the principal components:
> screeplot(Phistdog_Cor,main="Phistdog",cex.names=0.75)
10 20 30 40 50 60
0
Variances
Phistdog
Comp.1
Comp.2
Comp.3
Comp.4
Comp.5
Comp.6
PCA Example
Equations for the first two principal components from the correlation
matrix:
Y1  0.41Mbreadth  0.4Mheight  0.42mlength  0.42mbreadth  0.38mdist  0.4 pmdist
Y2  0.4Mbreadth  0.48Mheight  0.16mbreadth  0.67mdist  0.34 pmdist
Equations for the first two principal components from the covariance
matrix:
Y1  0.17Mbreadth  0.33Mheight  0.35mlength  0.13mbreadth  0.54mdist  0.64 pmdist
Y2  0.22Mbreadth  0.63Mheight  0.15mlength  0.11mbreadth  0.7mdist  0.1pmdist
Negative loadings on first principal axis for all variables . Mostly positive
loadings on the second principal axis.
PCA Example
Calculate the axis scores for the principal components from the correlation matrix:
> round(Phistdog_Cor$scores,2)
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Modern
1.47 0.04 -0.05 -0.18
-0.08 0.09
G.jackal
3.32 -0.66 -0.25
0.34
0.05 -0.01
C.wolf
-4.33 0.03 -0.23
0.11
0.09 0.03
I.wolf
-2.13 -0.58 -0.09
0.03
-0.14 -0.05
Cuon
0.45 1.16
0.29
0.30
-0.03 -0.02
Dingo
0.08 -0.47
0.73 -0.20
0.06 -0.01
Prehistoric 1.14 0.49
-0.40 -0.40
0.04 -0.05
PCA Example
Calculate the axis scores for the principal components from the covariance matrix:
> round(Phistdog_Cov$scores,2)
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Modern
4.77 -0.27
-0.18
0.49 0.01 -0.15
G.jackal
10.23 -2.76
0.26
-1.04 0.08 0.03
C.wolf
-13.89 0.18
-0.83
-0.39 0.22 -0.01
I.wolf
-8.25 -1.67
-0.25
-0.23 -0.29 0.00
Cuon
3.98 4.31
0.17
-0.76 -0.07 0.01
Dingo
-2.00 0.02
2.83
0.82 0.04 0.04
Prehistoric 5.16 0.20
-2.01
1.10 0.01 0.08
PCA Example
Plot the first principal component vs. second principal component obtained from
the correlation matrix and
>plot(Phistdog_Cor$scores[,2]~Phistdog_Cor$scores[,1],xlab="PC1",ylab="PC2"
,pch=15,xlim=c(-4.5,3.5),ylim=c(-0.75,1.5))
>text(Phistdog_Cor$scores[,1],Phistdog_Cor$scores[,2],labels=row.names(Phist
dog),cex=0.7,pos=rep(1,7))
> abline(h=0)
> abline(v=0)
from the covariance matrix:
>plot(Phistdog_Cov$scores[,2]~Phistdog_Cov$scores[,1],xlab="PC1",ylab="PC
2",pch=15,xlim=c(-14.5,11),ylim=c(-3.5,4.5))
>text(Phistdog_Cov$scores[,1],Phistdog_Cov$scores[,2],labels=row.names(Phi
stdog),cex=0.7,pos=rep(1,7))
> abline(v=0)
> abline(h=0)
4
1.5
PCA Example
1.0
Cuon
0.0
Prehistoric
Modern
I.wolf
G.jackal
-15
-10
-5
0
5
10
PC1
PCA diagram based on Covariance
Prehistoric
C.wolf
Modern
-0.5
Dingo
0.5
PC2
0
C.wolf
-2
PC2
2
Cuon
Dingo
I.wolf
-4
-2
G.jackal
0
2
PC1
PCA diagram based on Correlation
PCA Example
Even if the scores given by the covariance and correlation matrix are different
the information provided by the two diagrams is the same.
The Modern dog has the closest mandible measurements to the Prehistoric dog,
which shows that the two groups are related.
Cuon and Dingo groups are the next closest groups to the Prehistoric dog.
I. wolf, C wolf, and G. jack are not related to the Prehistoric dog or to any
other group.
Download