5:1 Analysis of Principal Components

advertisement
106750892
Revised: 2/17/2016
Chapter 5. Principal Components
Analysis.
5:1 Analysis of Principal Components
5:2 Introduction
PCA is a method to study the structure of the data, with emphasis on determining the patterns
of covariances among variables. Thus, PCA is the study of the structure of the variance-covariance
matrix. In practical terms, PCA is a method to identify variable or sets of variables that are highly
correlated with each other. The results can be used for multiple purposes:

To construct a new set of variables that are linear combinations of the original
variables and that contain exactly the same information as the original variables but
that are orthogonal to each other.

To identify patterns of multicollinearity in a data set and use the results to address the
collinearity problem in multiple linear regression.

To identify variables or factors, underlying the original variables, which are responsible
for the variation in the data.

To find out the effective number of dimensions over which the data set exhibits
variation, with the purpose of reducing the number of dimensions of the problem.

To create a few orthogonal variables that contain most of the information in the data
and that simplify the identification of groupings in the observations.
PCA is applied to a single group of variables; there is no distinction between explanatory and
response variables. In multiple linear regression (MLR), PCA is applied only to the set of X
variables, to study multicollinearity.
5:2.1 Example 1
Koenigs et al., (1982) used PCA to identify environmental gradients and to relate vegetation
gradients to environmental variables. Forty-eight environmental variables, including soil chemical
and physical characteristics, were measured in 40 different sites. In addition, over 17 vegetation
variables, including % cover, height, and density, were measured in each site. PCA was applied
both to the environmental and vegetational variables.
The first 3 PC's of the vegetational variables explained 73.3% of all the variation in the data.
These 3 dimensions are interpreted as the gradients naturally "perceived" by the plant community,
and were well correlated with environmental variables related to soil moisture.
5:2.2 Example 2
Jeffers (J.N.R. Jeffers. 1967. Two case-studies on the application of Principal component
analysis. Applied Statistics, 16:225-236) applied PCA to a sample of 40 winged aphids on which 19
different morphological characteristics had been measured. The characteristics measured included
body length, body width, forewing length, leg length, length of various antennal segments, number
of spiracles, etc. Winged aphids are difficult to identify, so the study used PCA to determine the
number of distinct taxa present in the sample. Although PCA is not a formal procedure to define the
clusters or groups of observations, it simplifies the data so they can be inspected graphically. The
first two PC’s explained 85% of the total variance in the correlation matrix. When the 40 samples
were plotted on the first two PC’s they formed 4 major groups distributed in an “interesting” S
106750892
1
106750892
Revised: 2/17/2016
shape. Although 19 traits were measured, the data only contained information equivalent to 2-3
independent traits.
Figure 5-1. Use of principal components to facilitate the identification of taxons of aphids.
5:3 Model and concept
PCA does not have any model to be tested, although it is assumed that the variables are
linearly related. The analysis can be thought of as looking at the same set of data from a different
perspective. The perspective is changed by moving the origin of the coordinate system to the
centroid of the data and then rotating the axes.
Given a set of p variables (X1, ..., Xp), PCA calculates a set of p linear combinations of the
variables (PC1, ..., PCp) such that:

The total variation in the new set of variables or principal
components is the same as in the original variables.

The first PC contains the most variance possible, e.g. as much
variance as can be captured in a single axis.

The second PC is orthogonal to the first one (their correlation is
0), and contains as much of the remaining variance as possible.

The third PC is orthogonal to all previous PC's and also contains
the most variance possible.

Etc.
This procedure is achieved by calculating a matrix of coefficients whose columns are called
eigenvectors of the variance-covariance or of the correlation matrix of the data set. Some basic
consequences of the procedure are that:

All original variables are involved in the calculation of PC scores (i.e. the location of
each observation in the new set of axis formed by the PC's).
106750892
2
106750892
Revised: 2/17/2016

The sum of variances of the PC's equals the sum of the variances of the original
variables when PCA is based on the variance-covariance matrix, or the sum of the
variances of the standardized variables when PCA is based on the correlation matrix.

There are p eigenvalues (p=number of variables in the data), each one associated with
one eigenvector and a PC. These eigenvalues are the variances of the data in each
PC. Thus, the sum of eigenvalues based on the variance-covariance matrix is equal to
the sum of variances of the original variables.
PCA based on the correlation matrix is equivalent to using PCA based on the variancecovariance of the standardized variables. Because standardized variables have variance=1, the
sum of eigenvalues is p, the number of variables.
5:4 Assumptions and potential problems
5:4.1 Normality
For descriptive PCA no specific distribution is assumed. If the variables have a multivariate
normal distribution the results of the analysis are enhanced and tend to be clearer. Normality can
be assessed by using the Analyze Distribution platform in JMP, or the PROC UNIVARIATE.
Transformations can be applied to approach normality as described in Figure 4.6 of Tabachnick
and Fidell (1996). Multivariate normality can be assessed by looking at the pairwise scatterplots. If
variables are normal and linearly related, the data will tend to exhibit multivariate normality. Strict
testing of multivariate normality can be achieved by calculating the Jacknifed squared Mahalanobis
distance for each observation and then testing the hypothesis that its distribution is a 2 distribution
with as many degrees of freedom as variables considered in the PCA. This test is very sensitive, so
it is recommended that a very low  be used (e.g. 0.001 or 0.0005).
Open the file spartina.jmp. In JMP, select the ANALYZE -> MULTIVARIATE platform and
include all variables in the Response box; then, click OK.
In the results of the multivariate
platform, select Outlier Analysis.
This will display the Mahalanobis
distance for all points in the
dataset. In the picture below, the
observations were already sorted
by increasing distance, so the plot
looks like an ordered string of
dots.
106750892
3
106750892
Revised: 2/17/2016
Click on the red triangle
by Outlier Analysis and
select Save Jackknife
Distance. This creates a
new column in the data
table, labeled Jackknife
Distance. Create a new
column where you
calculate the squared
Jackknife Mahalanobis
distance, which in the
next picture is labeled Dsq.
Note that the two labels, “Mahalanobis” and “Jackknife” both refer to the same statistical
distance D. The difference is that the latter is calculated using the jackknife procedure whereby D
for each observation is calculated while holding the observation out of the data to obtain the
variance-covariance matrix.
D2 is the variable that should
be distributed like a 2 random
variable with p=14 in this case.
In order to determine how
close D2 is to a 2, it is
necessary to create a column
of expected quantiles for the
2. For this the data must be
sorted in increasing order of
D2. The column with 2
quantile is created with a
formula as indicated in the
next figure. Conceptually, the
2 quantile is the value of a 2
with p degrees of freedom that
is greater than x% of the
values, where x is the
proportion of rows
(observations) with lower
values of D2.
106750892
4
106750892
Revised: 2/17/2016
For the final step, simple regress D2 on the 2 quantile. The line should be straight, with slope 1 and a
zero intercept. In the specific Spartina example, there is at least one outlier that throws off multivariate
normality.
5:4.2 Linearity
PCA assumes (i.e. only accounts for) linear relationships among variables. Lack of linearity
works against the ability to “concentrate” variation in few PC’s. Linearity can be examined by
looking at the pairwise scatterplots. If two variables are not linearly related, a transformation is
applied. The variable to be transformed should be carefully selected such as not to disrupt the
linearity with the rest of the variables.
5:4.3 Sample size
A potential problem of PCA is that results are not reliable (e.g. are different from sample to
sample) if sample sizes are small. The problem is not as grave as for Factor analysis, and it
diminishes as the variables exhibit higher correlations. Although some textbooks recommend 300
cases or observations, this is probably more appropriate for social studies where many variables
cannot be measured directly (e.g., intelligence). In agriculture and biology, scientists routinely use
data sets with 30 or more observations and the results pass peer review.
5:4.4 Outliers
Multivariate outliers can be a major problem in PCA, because just one or a few observations
can completely distort the results. As indicated in the Data Screening topic, multivariate outliers can
be identified by testing the jackknifed squared Mahalanobis distance. Transformations can help. If
outliers remain after transformations, observations can be considered for deletion, but this has to be
fully reported, and it has to be understood that elimination of observations just because they do not
fit the rest of the data can have negative implications on the correspondence between sample and
population. If sample size is very large, then deletion of a few outliers will not severely restrict the
applicability of the results.
5:5 Geometry of PCA
The principal components are obtaining by rotating a set of orthogonal (perpendicular and
independent) axes, pivoting at the centroid of the data. First, the direction that maximizes the
variance of the scores (or perpendicular projections of the data points on the moving axes) on the
first axis is determined, and the axis (PC1) is then fixed in that position. The rotation continues with
the constraint that PC1 is the axis of rotation until the second axis maximizes the variance of the
scores, which is the position for PC2. The procedure continues until PC p-1 is set. Because of the
orthogonality, setting PC p-1 also sets the last PC.
106750892
5
Revised: 2/17/2016
106750892
5:6 Procedure for analysis
5:6.1 JMP procedure.
There are two ways to get Principal Components in JMP, through the Multivariate and the
Graph -> Spinning Plot platforms. Both give all the numerical information. The Spinning Plot also
displays a biplot, which is described below. When using the Spinning Plot make sure you select
Principal Components and not “std Principal Components.” The latter give the standardized PC
scores.
5:6.2 SAS code and output
proc princomp data=spartina out=spartpc;
var h2s sal eh7 ph acid p k ca mg na mn zn cu nh4;
run;
In this example, PCA is done on the correlation matrix, which is equivalent to saying that PC’s
were calculated on the basis of the standardized variables. Using the correlation matrix is the
default option for the PROC PRINCOMP, and is the most common choice. Alternatively, the
analysis can use the covariance matrix, which just centers the data (i.e. all the principal
components go through the centroid of the sample). The rationale for choosing correlation or
covariance for PCA is discussed below.
Box 1: Simple statistics
Principal Component Analysis
45 Observations
14 Variables
Mean
StD
H2S
-601.7777778
30.6956385
Mean
StD
P
32.29688889
27.58669395
Mean
StD
Simple Statistics
SAL
EH7
30.26666667
-314.4000000
3.71972629
36.9559935
PH
4.602222222
1.246994366
ACID
3.861777778
2.506354913
K
797.6228889
297.6023371
MG
3075.109333
939.406676
NA
16596.71111
6882.42337
MN
38.10054667
24.48057096
CA
2365.318889
1718.327317
ZN
17.87524000
8.27980582
106750892
CU
3.988576667
1.036991704
NH4
87.45520000
47.27275022
6
Revised: 2/17/2016
106750892
Each eigenvalue represents the amount of variation from the original sample that is explained
by the corresponding PC. In this example PCA was based on the correlations or standardized
variables. Each standardized variable has a mean of zero and a variance of 1. Thus the sum of the
variances of the original variables is equal to the number of variables, and the first PC accounts for
4.924/14 or 0.3517 of the total sample variance.
The eigenvectors are vectors of coefficients that can be used to get the values of the projections
of each observation on each new axis or PC. The logic behind this is just a change of axes: just as
Box 2: Correlation Matrix.
H2S
SAL
EH7
PH
ACID
P
K
CA
MG
NA
MN
ZN
CU
NH4
H2S
1.0000
0.0958
0.3997
0.2735
-.3738
-.1154
0.0690
0.0933
-.1078
-.0038
0.1415
-.2724
0.0127
-.4262
SAL
0.0958
1.0000
0.3093
-.0513
-.0125
-.1857
-.0206
0.0880
-.0100
0.1623
-.2536
-.4208
-.2660
-.1568
Correlation Matrix
EH7
PH
0.3997
0.2735
0.3093
-.0513
1.0000
0.0940
0.0940
1.0000
-.1531
-.9464
-.3054
-.4014
0.4226
0.0192
-.0421
0.8780
0.2985
-.1761
0.3425
-.0377
-.1113
-.4751
-.2320
-.7222
0.0945
0.1814
-.2390
-.7460
ACID
-.3738
-.0125
-.1531
-.9464
1.0000
0.3829
-.0702
-.7911
0.1305
-.0607
0.4204
0.7147
-.1432
0.8495
P
-.1154
-.1857
-.3054
-.4014
0.3829
1.0000
-.2265
-.3067
-.0632
-.1632
0.4954
0.5574
-.0531
0.4897
K
0.0690
-.0206
0.4226
0.0192
-.0702
-.2265
1.0000
-.2652
0.8622
0.7921
-.3475
0.0736
0.6931
-.1176
H2S
SAL
EH7
PH
ACID
P
K
CA
MG
NA
MN
ZN
CU
NH4
CA
0.0933
0.0880
-.0421
0.8780
-.7911
-.3067
-.2652
1.0000
-.4184
-.2482
-.3090
-.6999
-.1122
-.5826
MG
-.1078
-.0100
0.2985
-.1761
0.1305
-.0632
0.8622
-.4184
1.0000
0.8995
-.2194
0.3452
0.7121
0.1082
NA
-.0038
0.1623
0.3425
-.0377
-.0607
-.1632
0.7921
-.2482
0.8995
1.0000
-.3101
0.1170
0.5601
-.1070
ZN
-.2724
-.4208
-.2320
-.7222
0.7147
0.5574
0.0736
-.6999
0.3452
0.1170
0.6033
1.0000
0.2121
0.7207
CU
0.0127
-.2660
0.0945
0.1814
-.1432
-.0531
0.6931
-.1122
0.7121
0.5601
-.2335
0.2121
1.0000
0.0137
NH4
-.4262
-.1568
-.2390
-.7460
0.8495
0.4897
-.1176
-.5826
0.1082
-.1070
0.5270
0.7207
0.0137
1.0000
MN
0.1415
-.2536
-.1113
-.4751
0.4204
0.4954
-.3475
-.3090
-.2194
-.3101
1.0000
0.6033
-.2335
0.5270
the location of an observation can be expressed as a vector of p dimensions (in the example p=14)
where the p dimensions are the measured variables, each observation can be expressed as a
vector of p PC values or scores. The scores for all PC's for all observations can be saved into a
SAS data set by using the OUT=filename option in the PROC PRINCOMP. This options creates a
SAS file with all the information contained in the file specified by the DATA= option, plus the scores
for all observations in all PC's. In JMP, from the Spinning Plot red triangle, select Save Principal
Components.
106750892
7
Revised: 2/17/2016
106750892
To further the explanation of the eigenvectors, consider the first observation in the Spartina data
set. The score for that observation on PC1 can be calculated by multiplying the standardized value
for each variable by the corresponding element of the first column of the matrix of eigenvectors and
adding all the terms, as shown in Table 1.
Box 3. Eigenvalues
PRIN1
PRIN2
PRIN3
PRIN4
PRIN5
PRIN6
PRIN7
PRIN8
PRIN9
PRIN10
PRIN11
PRIN12
PRIN13
PRIN14
Eigenvalues of the Correlation Matrix
Eigenvalue
Difference
Proportion
4.92391
1.22868
0.351708
3.69523
2.08810
0.263945
1.60713
0.27222
0.114795
1.33490
0.64330
0.095350
0.69160
0.19103
0.049400
0.50057
0.11513
0.035755
0.38544
0.00467
0.027531
0.38077
0.21480
0.027198
0.16597
0.02298
0.011855
0.14299
0.05613
0.010214
0.08687
0.04158
0.006205
0.04529
0.01544
0.003235
0.02985
0.02036
0.002132
0.00949
.
0.000678
Cumulative
0.35171
0.61565
0.73045
0.82580
0.87520
0.91095
0.93848
0.96568
0.97754
0.98775
0.99395
0.99719
0.99932
1.00000
Box 4. Eigenvectors
H2S
SAL
EH7
PH
ACID
P
K
CA
MG
NA
MN
ZN
CU
NH4
PRIN1
-.163637
-.107894
-.123813
-.408217
0.411680
0.273196
-.033446
-.358562
0.079033
-.017130
0.277082
0.404195
-.010788
0.398754
PRIN2
0.009086
0.017324
0.225247
-.027467
-.000362
-.111277
0.487887
-.180445
0.498653
0.470439
-.182164
0.088823
0.391707
-.025968
Eigenvectors
PRIN3
PRIN4
0.231669
0.689722
0.605727
-.270389
0.458251
0.301313
-.282670
0.081726
0.204919
-.165831
-.160543
0.199965
-.022907
0.043000
-.206595
-.054385
-.049515
-.036561
0.050575
-.054358
0.019849
0.483078
-.176373
0.150047
-.376740
0.102023
-.010607
-.104087
PRIN5
0.014386
0.508742
-.166758
0.091618
-.162713
0.747115
-.061998
0.206152
0.103793
0.239519
0.038899
-.007768
0.063434
-.005857
PRIN6
-.419348
0.010076
0.596651
0.191256
-.024061
-.017903
-.016587
0.427579
0.034182
-.060440
0.299511
0.034351
0.077993
0.381686
PRIN7
0.300094
0.383770
-.296867
0.056897
0.117085
-.336928
-.067421
0.104949
-.044195
-.181661
0.124567
-.072907
0.562581
0.395252
H2S
SAL
EH7
PH
ACID
P
K
CA
MG
NA
MN
ZN
CU
NH4
PRIN8
-.073755
0.100873
-.312742
-.029538
-.152610
-.398662
-.115096
0.185889
0.170996
0.449939
0.531706
0.208525
-.277074
-.145025
PRIN9
0.168302
-.175066
-.226136
0.023918
0.095416
0.077828
0.559085
0.186412
-.011293
0.088170
0.086117
-.439455
-.376706
0.420100
PRIN10
0.295840
-.227621
0.083754
0.146959
0.101118
-.017685
-.555004
0.073763
0.111582
0.439200
-.361647
0.014406
-.129195
0.393717
PRIN12
-.015407
-.156210
0.055421
-.331152
0.455459
0.064822
-.030301
0.346574
-.397791
0.363391
0.077826
-.222750
0.305087
-.301510
PRIN13
0.006864
-.094878
-.033492
0.025938
0.351392
0.065467
-.249524
0.079545
0.690127
-.276211
0.172893
-.396331
-.000372
-.230796
PRIN14
-.079812
0.089376
-.023123
0.750134
0.477337
0.014741
0.072785
-.307040
-.192283
0.143663
0.140813
0.041311
-.043094
-.117317
PRIN11
0.222927
0.088425
-.023086
0.041662
0.344782
-.034542
0.217893
0.511310
0.118799
-.216233
-.269913
0.568635
-.192872
-.130247
106750892
8
Revised: 2/17/2016
106750892
Table 1. Example showing how to calculate the PC1 score
for observation1. The values of the original variables are
standardized because this PCA was performed on the
correlation matrix.
PC11 = -0.164*[-610.00-(-601.78)]/30.70
+
+
+
+
+
+
+
+
+
+
+
+
+
=
-0.108*[33.00-(30.27)]/3.72
-0.124*[-290.00-(-314.40)]/36.96
-0.408*[5.00-(4.60)]/1.25
0.412*[2.34-(3.86)]/2.51
0.273*[20.24-(32.30)]/27.59
-0.033*[1441.67-(797.62)]/297.60
-0.359*[2150.00-(2365.32)]/1718.33
0.079*[5169.05-(3075.11)]/939.41
-0.017*[35184.50-(16596.71)]/6882.42
0.277*[14.29-(38.10)]/24.48
0.404*[16.45-(17.88)]/8.28
-0.011*[5.02-(3.99)]/1.04
0.399*[59.52-(87.46)]/47.27
-1.097
In matrix notation the calculation of PC scores is straightforward; simply multiply the matrix of
standardized values in the original axes, called Z, (standardized data matrix) and the matrix of
eigenvectors V to obtain an nxp matrix of scores W:
W=ZV
The calculations in matrix notation are illustrated in the file HOW03.xls.
5:6.3 Loadings
"Loadings" are the correlations between each one of the original variables and each one of the
principal components. Therefore, there are as many loadings as coefficients in the matrix of
eigenvectors. Loadings can be used to interpret the results of the PCA, because a high loading for
a variable in a PC indicates that the PC has a strong common component or relationship with the
variable. Loadings are interpreted by looking at the set of loadings for each PC and identifying
groups that are large and negative, and groups that are large and positive. This is then used to
interpret the PC as being an underlying factor that reflects increases in variables with positive
loadings, and decreases in variables with negative loadings.
Loadings can be calculated as a function of the eigenvectors and standard deviations of PC's
and original variables, or they can be calculated directly by saving the PC scores and correlating
them with the original variables.
rij  wij
sj
si
where rij is the loading of variable
i in PC j,
w is the element from the eigenvector that goes with
variable i in PC j, and s i and s j are the standard deviations of
variable i and PC j.
In the Spartina example, the loading for H2S in PC1 is
-0.163637*sqrt(4.92391) = -0.3631. The loading for Mg in PC2 is 0.498653*sqrt(3.69523) = 0.9586.
5:6.4 Interpretation of results
Interpretation of the results depends on the main goal for the PCA. We will consider two main
types of goals:
106750892
9
106750892
Revised: 2/17/2016
1. Reduction of dimensionality of data set and/or identification of
underlying factors.
2. Analysis of collinearity among X variables in a regression
problem.
5:6.4.1
Identification of fewer underlying factors
In the first case, the interpretation depends on whether the analysis identifies a clear subset of
PC's that explain a large proportion of all of the variance. When this is the case, it is possible to try
to interpret the most important PC's as underlying factors.
In the Spartina example, the first two principal components represent two systems of variables
that tend to vary together. PC1 is associated with pH, buffer acid, NH4, and Ca, and appears to
represent a gradient of acidity-alkalinity. Similar interpretations can be reached for the next two
principal components. The interpretation is simplified by using Gabriel's Biplots.
A Gabriel's biplot has two components (hence the name biplot) plotted on the same set of axes
represented by a pair of PC's (typically PC1 vs. PC2): a scatterplot of observations, and a plot of
vectors that represent the loadings or correlations of each original variable with each PC in the
biplot. The first element is obtained as by plotting PC1 vs. PC2 for each observation as dots on the
graph. The second element is obtained by making a vector, for each variable, that goes from the
origin (0,0) to the point represented by (rxpc1, rxpc2), where rxpc1, and rxpc2 are the loadings for the
variable X with PC1 and PC2, respectively. Therefore, the Gabriel's biplot has as many points as
observations, and as many vectors as variables in the data set. For the Spartina example, there are
45 points and 14 vectors (Figure 1). Groups of vectors that point in about the same (or directly
opposite) directions indicate variables that tend to change together.
Note that JMP draws the “rays” or vectors in the Spinning Graph by linking the origin to the PC
scores or coordinates of a fictitious point that is at the average of all original variables except for the
one it represents, for which it has a value equal to 3 standard deviations. This facilitates viewing the
rays and scatter of points together, and it preserves the interpretability of the relative lengths of the
vectors, but the vectors no longer have a total length of 1 (over all PC’s). I graphed a the loadings
for the spartina example to illustrate this point in the HW03.xls file, with diamonds marking what
would be the tips of the rays in a biplot.
Why do the vectors have a length equal 1.0? Think of the vector for one of the variables, say
pH, and imagine it poking through the 14-dimensional space formed by the principal axis or
components. The length of the vector is the length of a hypotenuse of a right triangle. By applying
Pythagorean theorem several times on can calculate the length of the vector, which is the sum of
the squares of the 14 coordinates. Recall that each coordinate is the correlation between the
corresponding PC and pH, because of the way the vector for pH was constructed. Therefore, the
square of each coordinate is the R2 or proportion of the variance in pH explained by each PC. Since
the PC’s are orthogonal to each other and they contain all of the variance in the sample, no portion
of the variance of pH is explained by more than one component, and the sum of the variance
explained by all components equals the total variance in pH. Therefore, the sum of the individual
r2’s, which is the length of the vector, must equal 1.
The points on the plot help to see how the observations may form different groups or vary along
the "gradients" represented by the combination of both PC's. The vectors that have length close to
1 (say >.85) represent variables that have a strong association with the two components, i.e., the
two PC's capture a great deal of the variation of the original variable. When the vector length is
close to 1, the relationships of that vector and others that are also close to 1 will be accurately
displayed in the plot. The direction of the vector shows the sign of the correlations (loadings).
Moreover, the angle between any two vectors shows the degree of correlation between the
variation of the two variables that is captured on the PC1-PC2 plane (in fact, r = cos [angle]). When
vectors tend to form "bundles" they can be interpreted as systems of variables that describe the
gradient. For example, Ca, pH, NH4, acid, and Zn form such a system.
106750892
10
Revised: 2/17/2016
106750892
Figure 1. Gabriel's biplot for the Spartina example. Numbers next to each point indicate the
location of the sample.
5:6.4.2
Identification of collinearity
When the PCA is performed on a set of X variables that are considered as explanatory
variables for a multiple linear regression problem, the interpretation is different from above. The
main goal in this case is to determine if there are variables in the set that tend to be almost perfect
linear combinations of other variables in the set. These variable have to be identified and
considered for deletion.
Eigenvalue
4.924
3.695
1.607
1.335
0.692
0.501
0.385
0.381
0.166
0.143
0.087
0.045
0.030
0.010
Condition
number
1.00
1.15
1.75
1.92
2.67
3.14
3.57
3.60
5.45
5.87
7.53
10.43
12.83
22.77
Identification of variables that may be causing a collinearity
problem is achieved by calculating the Condition Number (CN) for
each PC. Keep in mind that collinearity is a problem for MLR, not for
PCA; we use PCA to work on the MLR problem.
CNi 
1
i
Hence, the CN for PCi is the square root of the quotient between
the largest eigenvalue and the eigenvalue for the PC under
consideration. A value of CN of 30 or greater identifies a PC
implicated in the collinearity. Those variables can be identified by
requesting the COLLINOINT option in the PROC REG in SAS.
106750892
11
106750892
Revised: 2/17/2016
5:7 Some issues and Potential problems in PCA
5:7.1 Use of correlation or covariance matrix?
The most typical choice is to use the correlation matrix to perform PCA, because this removes
the impact of differences in the units used to express different variables. When the covariance
matrix is used (by choosing Principal Components on Covariance in JMP or specifying the COV
option in the PROC PRINCOMP statement in SAS), those variables that are expressed in units that
yield values of large numerical magnitude will tend to dominate the first PC's. A change of units in a
variable, say from g to kg will tend to reduce its contribution to the first PC's. In most cases this
would be an undesirable artifact, because results would depend on the units used.
A nice example for a situation where the use of the covariance matrix is recommended is given
by Lattin et al., (2003) page 112:
(From Lattin, J., J. Douglas Carroll, and Paul E. Green. 2003. Analyzing Multivariate Data.
Thomson Brooks/Cole.
5:7.2 Interpretation depends on goal.
In a sense, when PCA is performed to identify underlying factors and to reduce problem
dimensionality, one hopes to finds a high degree of relation among some subgroups of variables.
On the other hand, in MLR one hopes to find that all measured X variables will increase our ability
to explain Y. In the case of Multiple Linear Regression (MLR), it is desirable to have many
eigenvalues close to 1.
5:7.3 Interpretability.
One of the main problems with PCA is that often times, the new axes identified are difficult to
interpret, and they may all involve a combination with a significant "component" of each original
variable. There is no formal procedure to interpret PC's or to deal with lack of interpretability.
Interpretation can be easier if the problem allows rotation of the PC o transform them into more
understandable “factors.” This type of analysis, closely related to PCA is called Factor Analysis, and
it is widely used in the social sciences.
5:7.4 How many PC’s should be retained?
In using PCA for reduction of dimensionality, one must decide how many components to keep
for further analyses and explanation. In terms of visual presentation of results, it is very hard to
convey results in more than 2 or 3 dimensions.
106750892
12
106750892
Revised: 2/17/2016
There are at least three options for deciding how many PC's to use, scree plots, retaining PC's
whose eigenvalues exceed a critical value, and retaining sufficient PC's to account for a critical
proportion of the original data.
5:7.4.1
Scree plot.
The Scree plot is a graph of the eigenvalues in decreasing order. The y-axis has shows the
eigenvalues and the x-axis shows their order. The graph is inspected visually to identify "elbows,"
and the location of these breaks in the line is used to select a given number of PC's.
Figure 5-2. Use of Scree plot to decide how many PC to retain. This choice is subjective,
and focuses on finding "breaks" in the continuity of the line. In this case, keeping 3 or 5 PC's
would be acceptable choices.
5:7.4.2
Retain if >average.
When PCA is based on the correlation matrix, the sum of eigenvalues equals p, the number of
variables. Thus, the average value for the eigenvalues is 1.0. Any PC whose eigenvalue if greater
than 1 explains more than the average amount of variation, and can be kept.
5:7.4.3
Retain as many necessary for 80%.
Finally, a sufficient number of PC's can be retained to account for a desired proportion of the
total variance in the original data. This proportion can be chosen subjectively.
106750892
13
Download