Stat 407 Lab 8 Principal Component Analysis SOLUTION Fall 2001

advertisement
Stat 407 Lab 8 Principal Component Analysis SOLUTION Fall 2001
In this lab we will use S-Plus to conduct principal component analysis. We explore the effect of restructuring
variables into comparable units.
1. Determine the principal components associated with the following covariance matrix:
Σ=
"
5 2
2 2
#
Also calculate the proportion of total variance explained by the first principal component.
The menu interface to prinicipal component analysis doesn’t seem to be able to handle data input as a
var-cov matrix, but this question can be done with scripting:
> x<-matrix(c(5,2,2,2),ncol=2,byrow=T)
> eigen(x)
$values:
[1] 6 1
$vectors:
[,1]
[,2]
[1,] -0.8944272 0.4472136
[2,] -0.4472136 -0.8944272
The principal components are Y1 = −0.89X1 − 0.45X2 , Y2 = 0.45X1 − 0.89X2 and the proportion of total
variance explained by the first principal component is 6/(6 + 1) = 0.86.
2. For the women’s track data, we have examined the PCA for the variance-covariance matrix (raw data) and
the correlation matrix (standardized data). Now we will examine transforming the data in another way,
and re-doing PCA.
(a) Convert the national track records to speeds measured in meters per second (m/s). Notice that the
records for 800m, 1500m, 3000m, and marathon are given in minutes. The marathon is 26.2 miles or
42195 meters. Write down your equations for converting the measurements.
Let x be the number recorded for each event, then a
ms100 = 100/x
ms200 = 200/x
ms400 = 400/x
ms800 = 800/(x*60)
ms1500 = 1500/(x*60)
ms3000 = 3000/(x*60)
msmar = 42195/(x*60)
(b) Now perform PCA on the covariance matrix of the speed data. Report your results (eigenvalues,
eigenvectors, percentage of total variance, scree plot, plot of PCs).
The eigenvalues are 0.9613, 0.0968, 0.0521, 0.0228, 0.0071, 0.0051, 0.0033, which are the quares of the
standard deviations reported below. The eigenvectors are the loadings for each component, printed
below. The proportion of total variance is displayed as cumulative proportion.
1
Standard deviation
Proportion of Variance
Cumulative Proportion
Standard deviation
Proportion of Variance
Cumulative Proportion
Loadings:
Comp. 1
ms100 0.291
ms200 0.342
ms400 0.339
ms800 0.305
ms1500 0.386
ms3000 0.400
msmar 0.531
Comp. 1
Comp. 2
Comp. 3
Comp. 4
0.9804609 0.31108587 0.22823611 0.15106129
0.8369859 0.08425937 0.04535512 0.01986845
0.8369859 0.92124525 0.96660037 0.98646882
Comp. 5
Comp. 6
Comp. 7
0.084105621 0.071555019 0.057854053
0.006158964 0.004457977 0.002914239
0.992627784 0.997085761 1.000000000
Comp. 2 Comp. 3
-0.427
0.250
-0.558
0.320
-0.382 -0.321
-0.475
0.197 -0.372
0.254 -0.215
0.507
0.567
Comp. 4
0.329
0.132
-0.537
-0.309
0.362
0.475
-0.365
Comp. 5 Comp. 6 Comp. 7
-0.156 -0.728
0.629
0.215
0.477
-0.343
-0.505 -0.137
0.558
-0.365
0.226 -0.599
0.591
0.393
Code for drawing Scree Plot
> plot(c(1:7),eval,type=’’l’’,xlab=’’Component Number’’,
ylab=’’Eigenvalue’’,main=’’Scree Plot’’)
2
(c) Compare these results to those obtained in the previous PCA results (raw, standardized)? Are the
rankings of countries different?
Here are some brief summaries of the previous results.
RAW: The first principal component explains about 99% of the variance in the data, but the first
principal component is composed entirely of the variable of marathon times. This event does have the
most variation in the data, as expected. But the result doesn’t tell us anything useful about the data:
it isn’t appropriate to describe a country’s athletic performance by one event alone. Actually take note
of correlations between the first principal component and the events:
PC 1
100m 200m 400m 800m 1500m 3000m marathon
0.69 0.69 0.71 0.78 0.88 0.90 1.0
So the first PC is correlated with all event but perfectly correlated with the marathon event.
CORR: The first principal component explains about 83% of the variance, and the second principal
component adds to make the proportion of total variance up to 92%. The loadings of the first
principal component indicate that it is a linear combination of all the events, roughly and average,
and the loadings for the second principal component indicate it is a contrast between short distance
events and long distance events. The correlations between the first two PCs and the events is:
100m 200m 400m 800m 1500m 3000m marathon
PC 1 0.89 0.88 0.92 0.93 0.94 0.94 0.88
PC 2 -0.40 -0.43 -0.20 0.13 0.29 0.28 0.30
The results on the data transformed to m/s are similar to the results on the correlation matrix (standardized data), except that the loadings get increasingly higher as the distance of the event increases.
100ms 200ms 400ms 800ms 1500ms 3000ms marms
PC 1 0.84 0.84 0.88 0.92 0.96
0.96
0.94
3
(d) Which approach is most appropriate?
Its not clear which approach is the most appropriate here. PCA has been used mostly as an exploratory
data analysis tool. As such we look at the results and then try to understand what it is telling us about
the structure in the data, in particular, the variance structure. The most informative methods in this
sense are the PCA based on correlation or records translated to m/s. They say that the most variation
in the data is due to an average (or weighted) over all the events. So we may be better defining a new
variable that is strictly the average over all events to define the athletic prowess of the country. The
second largest source of variation in the data comes from distinguishing short distance events from long
distance events, and again we may be best defining a new variable that is aspecific contrast between
the average of short distance events and the average of the long distance events. In all, it seems that
two PCs are probably adequate for describing the variation in the data.
4
Download