#7.26 a) For each of the response variables, I first fit main effects models and then deleted those insignificant variables. The models I finally chose are as followings: Y1: Y2: Y3: Y4: The residual plots and leverage plots are shown below. There appear to be potential negative outliers in the high fitted values of all four models, which are observations 51, 52, and 56. From the leverage plots, I found observations 60 and 61 are influential points. Finally, a prediction interval for the new observation with the given values of the covariates is (1.26, 4.4). b) I fit the main effects model to the multivariate multiple linear regression, and the matrix of estimated coefficients and estimated error covariance matrix are shown below: The residual plots for the four response variables are shown below. There are high potential outliers in the high fitted values of the four variables. The simultaneous confidence intervals for a new observation at the specified values of the covariates are as followings: The interval for SF is wider than the one calculated in 26a, which should be expected since these are simultaneous prediction intervals, requiring to be wider to maintain the same coverage. #7.27 Type II MANOVA Tests: Sum of squares and products for error: Assessment implementation assessment 1.128 0.5730 implementation 0.573 2.0455 Term: severity Sum of squares and products for the hypothesis: Assessment implementation assessment 4.68075 implementation 12.48200 12.48200 33.28533 Multivariate Tests: severity Df test stat approx F num Df den Df Pillai Wilks 1 0.943124 49.7464 1 0.056876 49.7464 2 6 0.00018399 *** 2 Hotelling-Lawley 1 16.582133 49.7464 Roy 1 16.582133 49.7464 Pr(>F) 2 6 0.00018399 *** 2 6 0.00018399 *** 6 0.00018399 *** Term: complexity Sum of squares and products for the hypothesis: assessment implementation assessment 37.07408 66.53325 implementation 66.53325 119.40075 Multivariate Tests: complexity Df test stat approx F num Df den Df Pillai 1 0.98548 203.5457 Wilks 2 1 0.01452 203.5457 6 3.0642e-06 *** 2 Hotelling-Lawley 1 67.84857 203.5457 Roy 1 67.84857 203.5457 Pr(>F) 2 6 3.0642e-06 *** 2 6 3.0642e-06 *** 6 3.0642e-06 *** Term: experience Sum of squares and products for the hypothesis: assessment implementation assessment 11.532 implementation 26.257 26.25700 59.78408 Multivariate Tests: experience Df test stat approx F num Df den Df Pillai Wilks 1 0.968544 92.37208 1 0.031456 92.37208 2 6 3.1124e-05 *** 2 Hotelling-Lawley 1 30.790694 92.37208 Roy 1 30.790694 92.37208 Pr(>F) 2 6 3.1124e-05 *** 2 6 3.1124e-05 *** 6 3.1124e-05 *** Term: severity:complexity Sum of squares and products for the hypothesis: assessment implementation assessment 1.458 implementation 2.3490 2.349 3.7845 Multivariate Tests: severity:complexity Df test stat approx F num Df den Df Pr(>F) Pillai 1 0.6973225 6.91154 Wilks 2 1 0.3026775 6.91154 6 0.027729 * 2 6 0.027729 * Hotelling-Lawley 1 2.3038466 6.91154 Roy 1 2.3038466 6.91154 2 2 6 0.027729 * 6 0.027729 * Term: complexity:experience Sum of squares and products for the hypothesis: assessment implementation assessment 0.512 implementation 0.176 0.1760 0.0605 Multivariate Tests: complexity:experience Df test stat approx F num Df den Df Pr(>F) Pillai 1 0.3158987 1.385315 Wilks 1 0.6841013 1.385315 2 6 0.32016 2 Hotelling-Lawley 1 0.4617718 1.385315 Roy 1 0.4617718 1.385315 2 6 0.32016 2 6 0.32016 6 0.32016 Term: severity:experience Sum of squares and products for the hypothesis: assessment implementation assessment implementation 0.20 0.29 0.2900 0.4205 Multivariate Tests: severity:experience Df test stat approx F num Df den Df Pr(>F) Pillai Wilks 1 0.2178130 0.8353998 1 0.7821870 0.8353998 2 2 Hotelling-Lawley 1 0.2784666 0.8353998 Roy 1 0.2784666 0.8353998 2 6 0.47855 6 0.47855 2 6 0.47855 6 0.47855 #8.10 a) The sample covariance matrix S is shown below: The sample principle components are b) The first principal component vector explains 52.9% of the total sample variance, the second explains 27.1% of the total sample variance, and the third explains 9.8% of the total sample variance. The first component is the negative weighted sum of all five variables, with the greatest weight on Royal and Exxon Mobil. The second component is a contrast between the first three stocks--Morgan, Citibank, Wells Fargo and the last two--Royal Dutch Shell and Exxon Mobil. The third is a contrast between Morgan, Exxon Mobil and Cibibank, Well Fargo, Royal Dutch Shell. c) The 90% confidence intervals for the three variances of the population components are: d) Since 89.88% of the total sample variance can be explained by the first three principal components, I believe that the stock data can be summarized in three dimensions rather than five dimensions without much loss of information. #8.18 a) Shown below are the sample correlation matrix, the eigenvalues, and the eigenvectors: R: b) The first two principle components are: The cumulative percentages of the total variance explained by the first two principle component vectors of the standardized data are 82.9% and 92.3%, respectively. Shown below are correlations between the two component vectors and the standardized variables. 100m 200m 400m 800m 1500m 3000m Marathon PC1 0.888 0.880 0.919 0.927 0.938 0.937 0.884 PC2 0.396 0.434 0.199 -0.126 -0.291 -0.281 -0.298 c) The first component is the sum of all seven variables, with relatively equal weights. So it might measure each country’s athletic ability. The second component is a contrast between the first three distances and the last four distances. So it might measure the relative strength of countries in the various running distances. d) The top ten countries according to the first principal component are GDR, USSR, USA, Czech Republic, FRG, GBNI, Poland, Canada, Finland, and Italy. The bottom ten countries are Singapore, Indonesia, Dominican Republic, Malaysia, Costa Rica, Guatemala, Papua New Guinea, Mauritius, Cook Islands, and Western Samoa. I am not surprised by the rankings according to the first component vector. #8.19 Perform a principal components analysis using the covariance matrix S of the speed data. The cumulative percentages of the total variance explained by the first two principal component vectors of the data are 83.7% and 92.1%, respectively. Correlations between the two component vectors and the standardized variables are shown below. Since the patterns of correlations are basically the same as in the previous problems, the interpretations of the first two principal component vectors can remain the same. 100m PC1 200m 400m 800m 1500m 3000m Marathon 0.869 0.864 0.893 0.917 0.947 0.949 0.927 PC2 -0.405 -0.448 -0.319 -0.008 0.154 0.191 0.281 The rank of the nations on the basis of their score on the first principal component is below and it is similar to the rank of the nations in problem 8.18. I prefer the first analysis in 8.18. Since the performances at different running distances are not comparable, it is better standardize the variables. #8.28 a) Scatterplots of the two pairs of variables specifies are shown below. Based on these scatterplots, I removed the four outliers (observations 25,34,69,72 ) from the dataset. b) Based on the cumulative proportion of variance explained by the principal component vectors, which are (0.465 0.625 0.745 0.833 0.900 0.941 0.968 0.987 1.000), and the screeplot shown below, I would like choose to summarize this dataset with the first five principal component vectors. c) The first component vector appears to be farm size component, as it has positive correlation with all the counting variables, and approximately zero correlation with the distance variable. The second component vector might be differentiating between families that focus more on crops versus ones that focus more on livestock, as it is positively correlated with cattle and goats and negatively correlated with cotton and maize. The third component vector is most directly related to the distance. The fourth one is positively related to millet, and negatively related to distance to road and cattle. The fifth one might be distinguishing between families that raise cattle versus those that raise goats.