Stat 407 Lab 8 Principal Component Analysis SOLUTION Fall 2001 In this lab we will use S-Plus to conduct principal component analysis. We explore the effect of restructuring variables into comparable units. 1. Determine the principal components associated with the following covariance matrix: Σ= " 5 2 2 2 # Also calculate the proportion of total variance explained by the first principal component. The menu interface to prinicipal component analysis doesn’t seem to be able to handle data input as a var-cov matrix, but this question can be done with scripting: > x<-matrix(c(5,2,2,2),ncol=2,byrow=T) > eigen(x) $values: [1] 6 1 $vectors: [,1] [,2] [1,] -0.8944272 0.4472136 [2,] -0.4472136 -0.8944272 The principal components are Y1 = −0.89X1 − 0.45X2 , Y2 = 0.45X1 − 0.89X2 and the proportion of total variance explained by the first principal component is 6/(6 + 1) = 0.86. 2. For the women’s track data, we have examined the PCA for the variance-covariance matrix (raw data) and the correlation matrix (standardized data). Now we will examine transforming the data in another way, and re-doing PCA. (a) Convert the national track records to speeds measured in meters per second (m/s). Notice that the records for 800m, 1500m, 3000m, and marathon are given in minutes. The marathon is 26.2 miles or 42195 meters. Write down your equations for converting the measurements. Let x be the number recorded for each event, then a ms100 = 100/x ms200 = 200/x ms400 = 400/x ms800 = 800/(x*60) ms1500 = 1500/(x*60) ms3000 = 3000/(x*60) msmar = 42195/(x*60) (b) Now perform PCA on the covariance matrix of the speed data. Report your results (eigenvalues, eigenvectors, percentage of total variance, scree plot, plot of PCs). The eigenvalues are 0.9613, 0.0968, 0.0521, 0.0228, 0.0071, 0.0051, 0.0033, which are the quares of the standard deviations reported below. The eigenvectors are the loadings for each component, printed below. The proportion of total variance is displayed as cumulative proportion. 1 Standard deviation Proportion of Variance Cumulative Proportion Standard deviation Proportion of Variance Cumulative Proportion Loadings: Comp. 1 ms100 0.291 ms200 0.342 ms400 0.339 ms800 0.305 ms1500 0.386 ms3000 0.400 msmar 0.531 Comp. 1 Comp. 2 Comp. 3 Comp. 4 0.9804609 0.31108587 0.22823611 0.15106129 0.8369859 0.08425937 0.04535512 0.01986845 0.8369859 0.92124525 0.96660037 0.98646882 Comp. 5 Comp. 6 Comp. 7 0.084105621 0.071555019 0.057854053 0.006158964 0.004457977 0.002914239 0.992627784 0.997085761 1.000000000 Comp. 2 Comp. 3 -0.427 0.250 -0.558 0.320 -0.382 -0.321 -0.475 0.197 -0.372 0.254 -0.215 0.507 0.567 Comp. 4 0.329 0.132 -0.537 -0.309 0.362 0.475 -0.365 Comp. 5 Comp. 6 Comp. 7 -0.156 -0.728 0.629 0.215 0.477 -0.343 -0.505 -0.137 0.558 -0.365 0.226 -0.599 0.591 0.393 Code for drawing Scree Plot > plot(c(1:7),eval,type=’’l’’,xlab=’’Component Number’’, ylab=’’Eigenvalue’’,main=’’Scree Plot’’) 2 (c) Compare these results to those obtained in the previous PCA results (raw, standardized)? Are the rankings of countries different? Here are some brief summaries of the previous results. RAW: The first principal component explains about 99% of the variance in the data, but the first principal component is composed entirely of the variable of marathon times. This event does have the most variation in the data, as expected. But the result doesn’t tell us anything useful about the data: it isn’t appropriate to describe a country’s athletic performance by one event alone. Actually take note of correlations between the first principal component and the events: PC 1 100m 200m 400m 800m 1500m 3000m marathon 0.69 0.69 0.71 0.78 0.88 0.90 1.0 So the first PC is correlated with all event but perfectly correlated with the marathon event. CORR: The first principal component explains about 83% of the variance, and the second principal component adds to make the proportion of total variance up to 92%. The loadings of the first principal component indicate that it is a linear combination of all the events, roughly and average, and the loadings for the second principal component indicate it is a contrast between short distance events and long distance events. The correlations between the first two PCs and the events is: 100m 200m 400m 800m 1500m 3000m marathon PC 1 0.89 0.88 0.92 0.93 0.94 0.94 0.88 PC 2 -0.40 -0.43 -0.20 0.13 0.29 0.28 0.30 The results on the data transformed to m/s are similar to the results on the correlation matrix (standardized data), except that the loadings get increasingly higher as the distance of the event increases. 100ms 200ms 400ms 800ms 1500ms 3000ms marms PC 1 0.84 0.84 0.88 0.92 0.96 0.96 0.94 3 (d) Which approach is most appropriate? Its not clear which approach is the most appropriate here. PCA has been used mostly as an exploratory data analysis tool. As such we look at the results and then try to understand what it is telling us about the structure in the data, in particular, the variance structure. The most informative methods in this sense are the PCA based on correlation or records translated to m/s. They say that the most variation in the data is due to an average (or weighted) over all the events. So we may be better defining a new variable that is strictly the average over all events to define the athletic prowess of the country. The second largest source of variation in the data comes from distinguishing short distance events from long distance events, and again we may be best defining a new variable that is aspecific contrast between the average of short distance events and the average of the long distance events. In all, it seems that two PCs are probably adequate for describing the variation in the data. 4