252solnJ2 11/26/07 (Open this document in 'Page Layout' view!) J. MULTIPLE REGRESSION 1. Two explanatory variables a. Model b. Solution. 2. Interpretation Text 14.1, 14.3, 14.4 [14.1, 14.3, 14.4] (14.1, 14.3, 14.4) Minitab output for 14.4 will be available on the website; you must be able to answer the problem from it. 3. Standard errors J1, Text Problems 14.9, 14.14, 14.23, 14.26 [14.13, 14.16, 14.20, 14.23] (14.17, 14.19, 14.24, 14.27) 4. Stepwise regression Problem J2 (J1), Text exercises 14.32, 14.34[14.28, 14.29] (14.32, 14.33) (Computer Problem – instructions to be given) This document includes solutions to text problems 14.28 through 14.29 and Problem J2. Again, there are extra problems included. They are very worth looking at! ________________________________________________________________________________ Stepwise Regression Problems. Exercise 14.32 [14.28 in 9th] (14.32 in 8th edition): Assume the following ANOVA summary, where there are 2 independent variables and the regression sum of squares for X1 is 20 and the regression sum of squares for X2 is 15. a) Is there a significant relationship between Y and each of the independent variables at the 5% significance level? b) Compute rY21.2 and rY22.1 SOURCE Regression Error Total DF 2 10 12 SS 30 120 150 MS F p MS 30/2=15 120/10=12 F p Solution: Complete the ANOVA Analysis of Variance SOURCE Regression Error Total DF 2 10 12 SOURCE X1 X2 DF 1 1 14.28 (a) SS 30 120 150 SSR 20 15 For X1: SSR( X 1 X 2 ) SSR( X 1 and X 2 ) SSR( X 2 ) 30 15 15 this is the additional explanatory power from adding X1 after X2. We are itemizing the regression sum of squares. The ANOVA would read SOURCE X2 X1 Error Total DF 1 1 10 12 SS 15 15 120 150 SSR( X 1 X 2 ) MS 15 120/10=12 F p 1.25 15 1,10 4.96 . Do not reject H . 1.25 This is compared to F.05 0 MSE 120 / 10 There is not sufficient evidence that the variable X1 contributes to a model already containing X2. For X2: SSR( X 2 X 1 ) SSR( X 1 and X 2 ) SSR( X 1 ) 30 20 10 . This is additional explanatory power from adding X2 after X1. F 252solnJ2 11/26/07 (Open this document in 'Page Layout' view!) The ANOVA would read SOURCE X1 X2 Error Total DF 1 1 10 12 SS 20 10 120 150 MS 10 120/10=12 F p 0.833 SSR( X 2 X 1 ) 10 1,10 4.96 . Do not reject 0.833 This is compared to F.05 MSE 120 / 10 H0.There is not sufficient evidence that the variable X2 contributes to a model already containing X1. Neither independent variable X1 nor X2 makes a significant contribution to the model in the presence of the other variable. Also the overall regression equation involving both independent variables is not significant: MSR 30 / 2 2,10 4.10 F 1.25 This is compared to F.05 MSE 120 / 10 Neither variable should be included in the model and other variables should be investigated. F (b) rY21.2 SSR( X 1 X 2 ) 15 SST SSR( X 1 and X 2 ) SSR( X 1 X 2 ) 150 30 15 = 0.1111. The denominator is what is unexplained after adding X 2 only. Holding constant the effect of variable X2, 11.11% of the variation in Y can be explained by the variation in variable X1. rY22.1 SSR( X 2 X 1 ) 10 SST SSR( X 1 and X 2 ) SSR( X 2 X 1 ) 150 30 10 = 0.0769. For this one the denominator is what is unexplained after adding X 2 only. Holding constant the effect of variable X1, 7.69% of the variation in Y can be explained by the variation in variable X2. Exercise 14.34 [14.29 in 9th] (14.33 in 8th edition): Recall in the Warecost problem (14.4 in 10th edition) Analysis of Variance Source Regression Residual Error Total DF 2 21 23 SS 3368.1 477.0 3845.1 Source DF Seq SS Sales Orders 1 1 2726.8 641.3 MS 1684.0 22.7 F 74.13 P 0.000 Note that these two add to the Regression SS in the ANOVA. Or we could run the independent variables in opposite sequence. Source Regression Residual Error Total Source Orders Sales DF 1 1 DF 2 21 23 SS 3368.1 477.0 3845.1 Seq SS 3246.1 122.0 MS 1684.0 22.7 F 74.13 P 0.000 252solnJ2 11/26/07 (Open this document in 'Page Layout' view!) a) Find if the independent variables make a significant contribution to the regression model and what the most appropriate model is. .05 ? b) Compute rY21.2 and rY22.1 . Solution: (a) For X1: SSR( X 1 X 2 ) SSR( X 1 and X 2 ) SSR( X 2 ) 3368.1 3246.1 122.0 . SSR( X 1 X 2 ) 122 .0 1, 21 4.32 . Reject H . There 5.37 . Compare this with F.05 0 MSE 477 / 21 is evidence that the variable X1 contributes to a model already containing X2. For X2: SSR( X 2 X 1 ) SSR( X 1 and X 2 ) SSR( X 1 ) 3368.1 2726.8 641.3 F SSR( X 2 X 1 ) 641 .3 1, 21 4.32 . Reject H . 28 .23 . Compare this with F.05 0 MSE 477 / 21 There is evidence that the variable X2 contributes to a model already containing X1. Since each independent variable X1 and X2 makes a significant contribution to the model in the presence of the other variable, both variables should be included in the model. F Analysis of Variance Source Regression Residual Error Total DF 2 21 23 SS 3368.1 477.0 3845.1 Source DF Seq SS Sales Orders 1 1 2726.8 641.3 MS 1684.0 22.7 F 74.13 P 0.000 Note that these two add to the Regression SS in the ANOVA. Or we could run the independent variables in opposite sequence. Source Regression Residual Error Total Source Orders Sales DF 1 1 DF 2 21 23 SS 3368.1 477.0 3845.1 MS 1684.0 22.7 F 74.13 P 0.000 Seq SS 3246.1 122.0 (b) rY21.2 SSR( X 1 X 2 ) SST SSR ( X 1 and X 2 ) SSR ( X 1 X 2 ) 122 .0 .2037 Holding constant the effect of the number of orders, 3845 .1 3368 .1 122 .0 20.37% of the variation in Y can be explained by the variation in sales. rY22.1 SSR ( X 2 X 1 ) SST SSR ( X 1 and X 2 ) SSR ( X 2 X 1 ) 641 .3 .5735 . Holding constant the effect of sales, 57.35% of the 3845 .1 3368 .1 641 .3 variation in Y can be explained by the variation in the number of orders. 252solnJ2 11/26/07 (Open this document in 'Page Layout' view!) More of Old text exercise 11.5: The Minitab printout read The regression equation is y = 1.10 + 1.64 x - 0.160 x*x Predictor Constant x x*x Coef 1.09524 1.63571 -0.15952 s = 0.1047 Stdev 0.09135 0.07131 0.01142 R-sq = 99.7% t-ratio 11.99 22.94 -13.97 p 0.000 0.000 0.000 R-sq(adj) = 99.6% Analysis of Variance SOURCE Regression Error Total DF 2 4 6 SS 15.0305 0.0438 15.0743 SOURCE x x*x DF 1 1 SEQ SS 12.8929 2.1376 MS 7.5152 0.0110 F 686.17 p 0.000 Two sections remain unexplained: First R-squared adjusted . R 2 adjusted for degrees of freedom is Ra2 or Rk2 n 1R 2 k , where k is the number of independent n k 1 variables and n is the number of observations. It is intended to compensate for the fact that increasing the number of independent variables always raises R 2 . In this version of the regression, we have n 7 7 1 0.997 2 .9955 observations and k 2 independent variables, so R k2 . If this does not go up as 7 2 1 you add new independent variables, you can be rather sure that the new variables accomplish nothing. Second sequential sums of squares: The two values given, 12.8929 and 2.1376 represent an itemization of the regression sum of squares, 15.0305. This means that we could split up the ANOVA to read SOURCE x x*x Error Total DF SS MS 1 1 4 6 12.8929 2.1376 0.0438 15.0743 12.8929 2.1376 0.0110 F p 1172.08 194.32 1, 4 , for example, we will see that both Fs are highly significant indicating that If we compare these Fs to F.05 x explained Y well, but that adding x*x definitely improved the explanation. Note that, for the coefficient of x*x has a t-ratio of -13.97. If this is squared, it will give us, except for rounding errors 194.32, so the test is essentially the same as a t-test on the last independent variable added. 252solnJ2 11/26/07 (Open this document in 'Page Layout' view!) Problem J1: n = 80, k = 3, R2 = .95 n = 80, k = 4, R2 = .99 Use an F test to show if the second regression is an improvement. Solution: There are two ways to do this. a) Fake an ANOVA. Call the first result R32 .95 and the second R42 .99 . Remember that R 2 so that if SST 100 and Source 3 Xs Error Total R32 .95 , then SSR 95 . For the two regressions we get SS DF Source SS 95 3 and 4 Xs 99 5 76 Error 1 100 79 Total 100 SSR , SST DF 4 75 79 If we combine these and get new values of F by dividing the MS values by 0.013333, our new error MS, we get Source SS DF MS F F.05 3 Xs 95 3 31.67 4 1 4 1 more X 2375.24 300.00 3,75 2.78 F.05 1,75 3.97 F.05 Error 1 75 0.013333 Total 100 79 The second F test gives us our answer. We reject the hypothesis that the 4 th x does not contribute to the explanation of Y. b). If we add r independent variables so that we end with k independent variables, use the formula 2 2 n k 1 Rk Rk r F r ,n k 1 . Here k 4 and r 1 , so n k 1 80 4 1 75 2 r 1 Rk F 1,75 75 .99 .95 300 The test gives the same results as in a) 1 1 .99 Exercise 11.87 in James T. McClave, P. George Benson and Terry Sincich, Statistics for Business and Economics, 8th ed. , Prentice Hall, 2001, last year’s text: a)Minitab was used to fit the complete model, Yˆ 14.6 0.611X 1 0.439X 2 0.080X 3 0.064X 4 and the reduced model, Yˆ 14.0 0.642X 0.396X n 20. 1 2 The ANOVAs follow. Complete model: Analysis of Variance SOURCE Regression Error Total DF 4 15 19 SS 831.09 152.66 983.75 MS 207.77 10.18 F 20.41 p 0.002 SS 831.31 160.44 983.75 MS 411.66 9.44 F 43.61 p 0.000 Reduced model: Analysis of Variance SOURCE Regression Error Total DF 2 17 19 252solnJ2 11/26/07 (Open this document in 'Page Layout' view!) b) The Minitab printout shows that in the complete model the error sum of squares is 152.66 and in the reduced model it is 160.44. These represent the unexplained part of each model. The amount of reduction of the unexplained part was thus only 7.78 out of 160.44. c) We have 5 parameters in the complete model and 3 in the reduced model. d) We can investigate the null hypothesis H 0 : 3 4 0 against the alternative that at least one of the betas is significant. d) We can do this using an ANOVA or using the formula in problem J1. Note that between the two regressions, the regression sum of square rose from 832.31 to 831.09, an increase of 7.78. If we combine these two ANOVA tables and get new values of F by dividing the MS values by the new MSE we get Source SS DF MS F F.05 2 Xs 823.31 2 411.66 40.449 2,15 3.68 2 more Xs 7.78 2 3.89 0.3822 F.05 Error 152.66 15 10.17733 Total 983.75 19 We cannot reject the null hypothesis because our computed F is less than the table F. Alternately, if we add r independent variables so that we end with k independent variables, use the formula 2 2 n k 1 Rk Rk r F r ,n k 1 . Here k 4 and r 2 , so n k 1 20 4 1 15 2 r 1 Rk F 1,75 15 .845 .837 0387 2 1 .845 f) By comparing 0.38 with other values of F 2,15 on the F table, you should be able to figure out that the p-value is above 10%.