1. KNNL Problem 2.26. To change the axis on a plot, add an “order” statement to the axis definition. For example, to make the vertical axis go from -30 to 30 in increments of 10, do axis3 label=(angle=90 ‘Residuals’) order=(-30 to 30 by 10); Then use vaxis=axis3 as an option to the plot statement, along with haxis and vref=0. The mean value of Y appears as Dependent Mean in the output from proc reg. 2.26: Refer to Plastic hardness Problem 1.22. (a) Set up the ANOVA table. Analysis of Variance Source DF Sum of Mean Squares Square F Value Pr > F Model 1 5297.51250 5297.51250 Error 14 146.42500 10.45893 Corrected Total 15 5443.93750 506.51 <.0001 (b) Test by means of an F test whether or not there is a linear association between the hardness of the plastic and the elapsed time. Use a = .01. State the alternatives, decision rule, and conclusion. Testing H0: β1 = 0 vs Ha: β1 0. Reject H0 if P-value < 0.01. Since P-value < 0.0001, we reject H0 and conclude that there is a significant linear relationship between hardness and time. (c) Plot the deviations Yi Yˆi against Xi on a graph. Plot the deviations Yˆi Y against Xi on another graph, using the same scales as for the first graph. From your two graphs, does SSE or SSR (SSM) appear to be the larger component of SSTO (SST)? What does this imply about the magnitude of r2? Re s i d u a l s ( Ob s e r v e d - Pr e d i c t e d ) 30 20 10 0 - 10 - 20 - 30 10 20 30 40 t i me ( h o u r s ) De v i a t i o n s ( Pr e d i c t e d - Me a n ) 30 20 10 0 - 10 - 20 - 30 10 20 30 40 t i me ( h o u r s ) On the graphs, it is apparent that the residuals Yi Yˆi are much smaller than the deviations Yˆ Y of the predictors from the mean. Thus the SSM should be much larger i than the SSE, and we expect r 2 SSM to be close to 1. SST (d) Calculate r2 and r. The values of r2 and r can be obtained from the SAS output: r2 = 0.9731 and so r = 0.9865. (Choose the positive square root because b1 > 0). Root MSE 3.23403 R-Square 0.9731 Dependent Mean 225.56250 Adj R-Sq 0.9712 Coeff Var 1.43376 2. Problem 2.31 a) Source Regression Error Total SS 93,462,942 455,273,165 548,736,107 df MS 1 93,462,942 82 5,552,112 83 b). H0 : β1 = 0, Ha : β1 = 0. F ∗ = 93, 462, 942/5, 552, 112 = 16.8338, F (.99; 1, 82) =6.9544. If F ∗ ≤ 6.9544 conclude H0 , otherwise Ha. Conclude Ha . (t∗ )2 = (−4.102895)2 = 16.8338 = F ∗ . [t(.995; 82)]2 = (2.63712)2 = 6.9544 = F (.99; 1, 82) so the two tests are, indeed, equivalent c) SSR = 93, 462, 942 which is 17.03% or 0.1703 d) -0.4127 3. Given that R2 = SSM/SST, it can be shown that R2 / (1 – R2) = SSM / SSE. If you have n = 28 cases and R2 = 0.35, what is the F statistic for the test that the slope is equal to zero? dfR = 1 and dfE = n – 2 = 26. The F-statistic is MSM SSM / df M SSM df E R 2 26 0.35 26 F 14 MSE SSE / df E SSE df M 0.65 1 R2 1 The α = 0.05 critical value for (1,24) df is 4.26 (from page 1342) and the P-value using that df is 0.001. The results for (1,26) df will be similar. We reject H0: β1 = 0 and conclude that the slope is non-zero. For the next 3 questions use the grade point average data described in the text with KNNL problem 1.19. 7. Describe the distribution of the explanatory variable. Show the plots and output that were helpful in learning about this variable. Using PROC UNIVARIATE we see there are 20 observations ranging between 3.9 and 6.3 with a mean of 5 and median of 4.85; their standard deviation is 0.69. There are no extreme observations (i.e., ones far away from the others) as shown in the histogram plot below. The distribution appears to be reasonably symmetric but not completely so; it has a slight skew to the right. N The UNIVARIATE Procedure Variable: testscore Moments 20 Sum Weights 20 Mean Std Deviation Skewness Uncorrected SS Coeff Variation 5 Sum Observations 0.69282032 Variance 0.49695171 Kurtosis 509.12 Corrected SS 13.8564065 Std Error Mean Basic Statistical Measures Location Variability Mean 5.000000 Std Deviation Median 4.850000 Variance Mode 4.700000 Range Interquartile Range Extreme Observations ----Lowest-------Highest--Value Obs Value Obs 3.9 4 5.5 1 4.1 19 5.9 18 4.3 16 6.0 7 4.3 10 6.2 6 4.5 5 6.3 14 Stem 6 5 5 4 4 3 Leaf 023 59 0024 5677789 133 9 ----+----+----+----+ # 3 2 4 7 3 1 100 0.48 -0.595083 9.12 0.15491933 0.69282 0.48000 2.40000 0.90000 Boxplot | | +--+--+ *-----* | | 35 30 25 P e r c e n t 20 15 10 5 0 2. 7 3. 3 3. 9 4. 5 5. 1 5. 7 6. 3 6. 9 7. 5 t est scor e 6. 5 6. 0 5. 5 t e s t s c o r e 5. 0 4. 5 4. 0 3. 5 - 2 - 1 0 No r ma l 1 2 Qu a n t i l e s 8. Run the linear regression to predict GPA from the entrance test score and obtain the residuals (do not include a list of the residuals in your solution). a. Verify that the sum of the residuals is zero by running PROC UNIVARIATE with the output from the regression. The UNIVARIATE Procedure Variable: resid (Residual) Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 20 0 0.4234117 0.05677081 3.40627193 . Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 20 0 0.17927747 -1.0342174 3.40627193 0.09467773 b. Plot the residuals versus the explanatory variable and briefly describe the plot noting any unusual patterns or points. r esi dual pl ot 0. 8 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0. 0 - 0. 1 - 0. 2 - 0. 3 - 0. 4 - 0. 5 - 0. 6 - 0. 7 - 0. 8 3 4 5 Tes t 6 7 Sc o r e There does not appear to be any obvious pattern or outlier in this residual plot. It looks like a random scatter of points and the variance is reasonably constant. c. Plot the residuals versus the order in which the data appear in the data file and briefly describe the plot noting any unusual patterns or points. r esi dual pl ot 0. 8 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0. 0 - 0. 1 - 0. 2 - 0. 3 - 0. 4 - 0. 5 - 0. 6 - 0. 7 - 0. 8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Ob s e r v a t i o n S e q u e n c e There is some suggestion that the residuals may become more negative over time but it does not appear to be a big problem. d. Examine the distribution of the residuals by getting a histogram and a normal probability plot of the residuals by using the histogram and qqplot statements in PROC UNIVARIATE. What do you conclude? 30 25 20 P e r c e n t 15 10 5 0 - 1. 65 - 1. 35 - 1. 05 - 0. 75 - 0. 45 - 0. 15 0. 15 0. 45 0. 75 1. 05 1. 35 1. 65 Re s i d u a l 1. 00 0. 75 0. 50 R e s i d u a l 0. 25 0 - 0. 25 - 0. 50 - 0. 75 - 2 - 1 0 No r ma l 1 2 Qu a n t i l e s The residuals appear reasonably normal and symmetric since the histogram appears fairly normal and the qqplot is fairly linear. There is some suggestion of an S-shape to the qqplot and a flattening of the histogram, but it is not too bad.