252y0741 5/7/07 ECO252 QBA2 Final EXAM May , 2007 Version 1 Name and Class hour:______KEY_______________ I. (18+ points) Do all the following. Note that answers without reasons and citation of appropriate statistical tests receive no credit. Most answers require a statistical test, that is, stating or implying a hypothesis and showing why it is true or false by citing a table value or a p-value. If you haven’t done it lately, take a fast look at ECO 252 - Things That You Should Never Do on a Statistics Exam (or Anywhere Else) In his text Allen L. Webster presents the following data. The hope was to explain three year returns of the funds and to compare them to one year returns. The data given is as follows. MTB > print c1 c2 c3 c4 c5 c6 Data Display Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 3YRR 5.6 4.7 4.5 4.8 5.7 4.1 4.7 4.1 5.2 3.7 6.2 6.6 5.2 5.5 5.6 1YRR 0.1 1.9 2.6 2.0 3.5 -4.3 3.2 -4.1 2.2 2.1 5.3 11.0 0.3 -2.1 4.7 LOAD 0 1 1 1 0 1 1 1 0 1 0 0 0 0 0 3YRT 112 95 241 87 98 102 72 96 78 118 98 87 117 87 85 1YRT 58 62 65 61 57 66 63 65 59 87 47 41 61 46 35 ASSETS 220.00 158.00 227.25 242.40 287.85 207.05 237.35 207.05 262.60 186.85 313.10 333.30 262.60 277.75 282.00 In the above, ‘1YRR’ is the rate of return over a 1-year period. ‘LOAD’ is a dummy variable that is 1 when there is a load and zero if it is a no-load fund. ‘1YRT’ is a turnover rate for the fund and tells you the percent of the fund bought or sold during the year, ‘ASSETS’ is the assets in $billions when the fund opened. The remaining columns are for 3-year data and are not used here. I played with these data for a while and got very discouraging results for the 3-rear data and nearly as discouraging results for the 1- year data as you will see in Regression 3 below. I finally created the set of new independent variables that appear below. MTB > print c6 c7 c9 c11 c13 c14 c15 Data Display Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ASSETS 220.00 158.00 227.25 242.40 287.85 207.05 237.35 207.05 262.60 186.85 313.10 333.30 262.60 277.75 282.00 ASSTsq 48400 24964 51643 58758 82858 42870 56335 42870 68959 34913 98032 111089 68959 77145 79524 lnAssets 5.39363 5.06260 5.42605 5.49059 5.66244 5.33296 5.46954 5.33296 5.57063 5.23031 5.74652 5.80904 5.57063 5.62672 5.64191 1YRTsq 3364 3844 4225 3721 3249 4356 3969 4225 3481 7569 2209 1681 3721 2116 1225 AssetsL 0.00 158.00 227.25 242.40 0.00 207.05 237.35 207.05 0.00 186.85 0.00 0.00 0.00 0.00 0.00 1YRTL 0 62 65 61 0 66 63 65 0 87 0 0 0 0 0 ln1YRT 4.06044 4.12713 4.17439 4.11087 4.04305 4.18965 4.14313 4.17439 4.07754 4.46591 3.85015 3.71357 4.11087 3.82864 3.55535 ‘ASSTsq’ is the square of the ‘ASSETS’ variable. ‘lnASSETS’ is the natural logarithm of the ‘ASSETS’ variable. ‘1YRTsq’ is the square of the one-year turnover. ‘AssetsL’ is an interaction variable, the product of ‘ASSETS’ and ‘LOAD.’ ‘1YRTL’ is the product of ‘1YRT’ and ‘LOAD.’ Ln1YRT’ is the natural logarithm of ‘1YRT.’ 1 252y0741 5/7/07 ————— 5/1/2007 9:16:52 PM ———————————————————— Welcome to Minitab, press F1 for help. Results for: 252x07041-01A.MTW MTB > let c14 = c5*c3 MTB > Stepwise c2 c3 c5 c6 c7 c9 c11 c13 c14; SUBC> Backward; SUBC> ARemove 0.1; SUBC> Best 0; SUBC> Constant. Regression 1 Stepwise Regression: 1YRR versus LOAD, 1YRT, ... Backward elimination. Alpha-to-Remove: 0.1 Response is 1YRR on 8 predictors, with N = 15 Step Constant 1 1485 2 1458 3 1099 LOAD T-Value P-Value -2 -0.04 0.971 1YRT T-Value P-Value -2.08 -1.74 0.132 -2.12 -3.65 0.008 -2.10 -3.81 0.005 ASSETS T-Value P-Value 1.80 0.84 0.432 1.75 1.11 0.303 0.96 5.37 0.001 ASSTsq T-Value P-Value -0.0009 -0.41 0.698 -0.0008 -0.50 0.630 -332 -1.23 0.265 -325 -1.73 0.127 -234 -5.03 0.001 1YRTsq T-Value P-Value 0.0216 1.74 0.133 0.0221 3.81 0.007 0.0218 3.98 0.004 AssetsL T-Value P-Value 0.256 2.23 0.067 0.253 3.73 0.007 0.251 3.90 0.005 1YRTL T-Value P-Value -0.92 -1.26 0.255 -0.94 -3.61 0.009 -0.94 -3.79 0.005 S R-Sq R-Sq(adj) Mallows C-p 2.03 87.79 71.51 9.0 1.88 87.79 75.57 7.0 1.79 87.34 77.85 5.2 lnAssets T-Value P-Value More? (Yes, No, Subcommand, or Help) SUBC> y No variables entered or removed More? (Yes, No, Subcommand, or Help) SUBC> n 1) Regression 1 is a reverse stepwise equation that I ran on the ‘full’ model to see if there were any obvious candidates for elimination. The third column is the 3 rd regression that was run. What can you say about the significance of the coefficients of the variables that ‘stepwise’ forced out. Why? Do any of the variables left in have a suspicious sign? (2) 2 252y0741 5/7/07 Solution: Both of the variables (‘LOAD’ and ‘ASSTsq’) had extremely high p-values (.971 and .630), which means that their coefficients were not significant. Their lack of explanatory power is also brought out by the fact that, while R-squared fell somewhat as variables were removed, R-squared adjusted actually rose. We would expect most of the signs of the coefficients in the last column. There is nothing wrong with a negative sign unless there is a good reason not to expect it. Generally a high turnover rate is considered a bad reflection on investor’s opinion of a fund’s prospects, so it ought to have a negative sign. We would expect the size of assets to have a positive effect on valuation so the positive sign of ‘ASSETS’ is reasonable. The negative sign of ‘lnAssets’ seems to indicate a nonlinear relationship between ‘1YRR’ and Assets which is offset by the positive sign of ‘Assets.’ The negative sign of ‘1YRTL’ is no surprise, since it is made up of both ‘Load’ and ‘1YRT,’ both of which should depress the incentive of an investor to buy the fund. It should be added that the extremely low value of C-p (5.2) relative to the number of ‘independent’ variables (6) and the fact that every coefficient has a p-value below 1% makes this equation look good. In fact, the major liability of the equation is that analysts rarely use both a variable and its logarithm in the same equation. MTB > Regress c2 6 SUBC> Constant; SUBC> VIF; SUBC> Brief 2. c5 c6 Regression 2 c9 c11 c13 c14; Regression Analysis: 1YRR versus 1YRT, ASSETS, ... The regression equation is 1YRR = 1099 - 2.10 1YRT + 0.962 ASSETS - 234 + 0.251 AssetsL - 0.945 1YRTL Predictor Coef SE Coef T P Constant 1098.8 220.7 4.98 0.001 1YRT -2.1015 0.5518 -3.81 0.005 ASSETS 0.9623 0.1792 5.37 0.001 lnAssets -233.89 46.50 -5.03 0.001 1YRTsq 0.021819 0.005488 3.98 0.004 AssetsL 0.25128 0.06438 3.90 0.005 1YRTL -0.9446 0.2491 -3.79 0.005 S = 1.79353 R-Sq = 87.3% Analysis of Variance Source DF SS Regression 6 177.595 Residual Error 8 25.734 Total 14 203.329 Source 1YRT ASSETS lnAssets 1YRTsq AssetsL 1YRTL DF 1 1 1 1 1 1 lnAssets + 0.0218 1YRTsq VIF 203.2 320.4 379.9 289.5 217.9 332.7 R-Sq(adj) = 77.9% MS 29.599 3.217 F 9.20 P 0.003 Seq SS 38.826 33.901 48.554 4.363 5.702 46.249 2) Regression 2 is the regression suggested by the ‘stepwise’ command. What is most alarming about it? (1) Solution: The worst part of this is the high values of VIF, indicating that large amounts of collinearity are present. Unless we know where this collinearity is coming from, we must be quite suspicious of the results. 3 252y0741 5/7/07 MTB > Regress c2 3 SUBC> Constant; SUBC> VIF; SUBC> Brief 2. Regression 3 c3 c5 c6 ; Regression Analysis: 1YRR versus LOAD, 1YRT, ASSETS The regression equation is 1YRR = - 13.1 + 1.76 LOAD - 0.011 1YRT + 0.0599 ASSETS Predictor Coef SE Coef Constant -13.11 13.22 LOAD 1.758 2.801 1YRT -0.0106 0.1153 ASSETS 0.05992 0.03328 S = 3.38563 R-Sq = 38.0% Analysis of Variance Source DF SS Regression 3 77.24 Residual Error 11 126.09 Total 14 203.33 Source DF Seq SS LOAD 1 26.01 1YRT 1 14.07 ASSETS 1 37.16 T P VIF -0.99 0.343 0.63 0.543 2.6 -0.09 0.929 2.5 1.80 0.099 3.1 R-Sq(adj) = 21.1% MS 25.75 11.46 F 2.25 P 0.140 3) Regression 3 is an attempt to look at the original independent variables. Only one part of the results is encouraging. Comment on the coefficient of determination and the tests for significance and (multi)collinearity. (2.5) After the high apparent significance of the coefficients in Regression 2, the p-values here are a shock. All are way above 10%, so none of the coefficients are significant. This is, of course, echoed in the high p-value from the ANOVA and a pretty poor R-squared. At this point, if we had not seen regression 2, we would suspect that our independent variables are complete duds. However, the good news is the extremely low VIFs, indicating a total absence of collinearity. MTB > Regress c2 4 SUBC> Constant; SUBC> VIF; SUBC> Brief 2. c3 c5 c11 c6 ; Regression 4 Regression Analysis: 1YRR versus LOAD, 1YRT, 1YRTsq, ASSETS The regression equation is 1YRR = 4.0 + 1.81 LOAD - 0.564 1YRT + 0.00455 1YRTsq + 0.0558 ASSETS Predictor Coef SE Coef T P VIF Constant 3.98 19.12 0.21 0.839 LOAD 1.815 2.743 0.66 0.523 2.6 1YRT -0.5635 0.4689 -1.20 0.257 42.9 1YRTsq 0.004554 0.003748 1.22 0.252 39.5 ASSETS 0.05581 0.03275 1.70 0.119 3.1 S = 3.31461 R-Sq = 46.0% R-Sq(adj) = 24.4% Analysis of Variance Source DF SS Regression 4 93.46 Residual Error 10 109.87 Total 14 203.33 Source LOAD 1YRT 1YRTsq ASSETS DF 1 1 1 1 MS 23.37 10.99 F 2.13 P 0.152 Seq SS 26.01 14.07 21.48 31.90 4 252y0741 5/7/07 Unusual Observations Obs LOAD 1YRR Fit SE Fit Residual St Resid 2 1.00 1.900 -2.820 2.381 4.720 2.05R R denotes an observation with a large standardized residual. 4) Regression 4 is a first step in building the model. What can we say about R-squared and R-squared adjusted compared to regression 3? (1) [5.5] Solution: I should have looked at graphs of the residuals, but old bad habits die hard and we already had evidence that a nonlinear equation was the way to go. R-squared and R-squared adjusted both rose, which is encouraging. The p-values for ‘1YRT’ and ‘1YRTsq’ are, at least, much less laughable than the p-values for ‘1YRT’ alone and, because of the tiny coefficient of the squared term, still seem to indicate that rising turnover will hurt market performance. The high VIFs we are seeing are not important, since we now can guess that they are caused by the necessary relationship between ‘1YRT’ and ‘1YRTsq.’ MTB > Regress c2 3 SUBC> Constant; SUBC> VIF; SUBC> Brief 2. Regression 5 c3 c15 c6 ; Regression Analysis: 1YRR versus LOAD, ln1YRT, ASSETS The regression equation is 1YRR = - 3.2 + 1.96 LOAD - 2.35 ln1YRT + 0.0554 ASSETS Predictor Coef SE Coef Constant -3.19 30.30 LOAD 1.956 2.774 ln1YRT -2.355 6.318 ASSETS 0.05541 0.03307 S = 3.36573 R-Sq = 38.7% Analysis of Variance Source DF SS Regression 3 78.72 Residual Error 11 124.61 Total 14 203.33 Source LOAD ln1YRT ASSETS DF 1 1 1 T P VIF -0.11 0.918 0.70 0.496 2.5 -0.37 0.716 2.4 1.68 0.122 3.1 R-Sq(adj) = 22.0% MS 26.24 11.33 F 2.32 P 0.132 Seq SS 26.01 20.92 31.79 5) Regression 5 investigates the possibility of replacing the two terms in ‘1YRT’ with its logarithm. Why did I decide this was a bad idea? (1) Solution: The only thing good about this regression is that ‘ln1YRT’ has the expected negative sign. Rsquared and R-squared adjusted both fell and the p-value for the coefficient of ‘ln1YRT’ is above 50%. MTB > Regress c2 4 SUBC> Constant; SUBC> VIF; SUBC> Brief 2. Identical to Regression 4 c3 c5 c11 c6 ; Regression Analysis: 1YRR versus LOAD, 1YRT, 1YRTsq, ASSETS The regression equation is 1YRR = 4.0 + 1.81 LOAD - 0.564 1YRT + Predictor Coef SE Coef T Constant 3.98 19.12 0.21 LOAD 1.815 2.743 0.66 1YRT -0.5635 0.4689 -1.20 1YRTsq 0.004554 0.003748 1.22 ASSETS 0.05581 0.03275 1.70 S = 3.31461 R-Sq = 46.0% 0.00455 1YRTsq + 0.0558 ASSETS P VIF 0.839 0.523 2.6 0.257 42.9 0.252 39.5 0.119 3.1 R-Sq(adj) = 24.4% Analysis of Variance 5 252y0741 5/7/07 Source Regression Residual Error Total Source LOAD 1YRT 1YRTsq ASSETS DF 1 1 1 1 DF 4 10 14 SS 93.46 109.87 203.33 MS 23.37 10.99 F 2.13 P 0.152 Seq SS 26.01 14.07 21.48 31.90 Unusual Observations Obs LOAD 1YRR Fit SE Fit Residual St Resid 2 1.00 1.900 -2.820 2.381 4.720 2.05R R denotes an observation with a large standardized residual. MTB > Regress c2 5 c3 c5 c11 c6 c7; Regression 5 SUBC> Constant; SUBC> VIF; SUBC> Brief 2. Regression Analysis: 1YRR versus LOAD, 1YRT, 1YRTsq, ASSETS, ASSTsq The regression equation is 1YRR = 30.7 + 1.39 LOAD - 0.259 1YRT + 0.00242 1YRTsq - 0.260 ASSETS + 0.000653 ASSTsq Predictor Coef SE Coef T P VIF Constant 30.66 21.31 1.44 0.184 LOAD 1.387 2.408 0.58 0.579 2.6 1YRT -0.2591 0.4368 -0.59 0.568 48.8 1YRTsq 0.002416 0.003444 0.70 0.501 43.6 ASSETS -0.2596 0.1588 -1.63 0.137 96.3 ASSTsq 0.0006533 0.0003235 2.02 0.074 98.8 S = 2.89834 R-Sq = 62.8% R-Sq(adj) = 42.2% Analysis of Variance Source DF SS Regression 5 127.726 Residual Error 9 75.604 Total 14 203.329 Source LOAD 1YRT 1YRTsq ASSETS ASSTsq DF 1 1 1 1 1 MS 25.545 8.400 F 3.04 P 0.070 Seq SS 26.006 14.072 21.482 31.904 34.263 6) At long last we seem to be getting somewhere in Regression 5, though we may have to use a 10% significance level to claim any accomplishment. What cheered me up? (1) [7.5] Solution: The good news is that the p-value for the ANOVA just fell below 10%, which we can consider a minimum acceptable significance level. Both R-squared and R-squared adjusted went up. On the other hand, the negative sign of ‘ASSETS’ is disturbing, though it could be offset by the positive sign of ‘ASSETSsq.’ MTB > Regress c2 4 SUBC> Constant; SUBC> VIF; SUBC> Brief 2. c3 c5 c11 c9; Regression 6 Regression Analysis: 1YRR versus LOAD, 1YRT, 1YRTsq, lnAssets The regression equation is 1YRR = - 34.3 + 1.32 LOAD - 0.642 1YRT + 0.00497 1YRTsq + 10.1 lnAssets Predictor Coef SE Coef T P VIF Constant -34.26 49.59 -0.69 0.505 LOAD 1.324 2.886 0.46 0.656 2.6 1YRT -0.6421 0.4874 -1.32 0.217 41.9 1YRTsq 0.004971 0.003928 1.27 0.234 39.2 6 252y0741 5/7/07 lnAssets S = 3.48891 10.081 7.856 1.28 0.228 2.9 R-Sq = 40.1% R-Sq(adj) = 16.2% Analysis of Variance Source DF SS Regression 4 81.60 Residual Error 10 121.73 Total 14 203.33 Source LOAD 1YRT 1YRTsq lnAssets DF 1 1 1 1 MS 20.40 12.17 F 1.68 P 0.231 Seq SS 26.01 14.07 21.48 20.04 Unusual Observations Obs LOAD 1YRR Fit SE Fit Residual St Resid 2 1.00 1.900 -2.601 2.801 4.501 2.16R R denotes an observation with a large standardized residual. Identical to Regression 5 MTB > Regress c2 5 c3 c5 c11 c6 c7; SUBC> Constant; SUBC> VIF; SUBC> Brief 2. Regression Analysis: 1YRR versus LOAD, 1YRT, 1YRTsq, ASSETS, ASSTsq The regression equation is 1YRR = 30.7 + 1.39 LOAD - 0.259 1YRT + 0.00242 1YRTsq - 0.260 ASSETS + 0.000653 ASSTsq Predictor Coef SE Coef T P VIF Constant 30.66 21.31 1.44 0.184 LOAD 1.387 2.408 0.58 0.579 2.6 1YRT -0.2591 0.4368 -0.59 0.568 48.8 1YRTsq 0.002416 0.003444 0.70 0.501 43.6 ASSETS -0.2596 0.1588 -1.63 0.137 96.3 ASSTsq 0.0006533 0.0003235 2.02 0.074 98.8 S = 2.89834 R-Sq = 62.8% R-Sq(adj) = 42.2% Analysis of Variance Source DF SS Regression 5 127.726 Residual Error 9 75.604 Total 14 203.329 Source LOAD 1YRT 1YRTsq ASSETS ASSTsq DF 1 1 1 1 1 MS 25.545 8.400 F 3.04 P 0.070 Seq SS 26.006 14.072 21.482 31.904 34.263 7) So why did I backtrack? (1) [8.5] Solution: Again, trying to replace a quadratic equation for the relationship between Assets and the rate of return with a logarithm lowered R-squared and R-squared adjusted and raised the p-values for the ANOVA and the coefficients of assets. 7 252y0741 5/7/07 MTB > Regress c2 6 SUBC> Constant; SUBC> VIF; SUBC> Brief 2. c3 c5 c11 c6 c7 c13; Regression 7 Regression Analysis: 1YRR versus LOAD, 1YRT, ... The regression equation is 1YRR = 121 - 47.4 LOAD - 0.460 1YRT + 0.00424 1YRTsq - 0.859 ASSETS + 0.00170 ASSTsq + 0.204 AssetsL Predictor Coef SE Coef T P VIF Constant 120.50 51.93 2.32 0.049 LOAD -47.36 26.33 -1.80 0.110 392.1 1YRT -0.4604 0.4021 -1.14 0.285 52.6 1YRTsq 0.004239 0.003207 1.32 0.223 48.1 ASSETS -0.8591 0.3521 -2.44 0.041 603.0 ASSTsq 0.0017033 0.0006339 2.69 0.028 482.5 AssetsL 0.2043 0.1100 1.86 0.100 309.9 S = 2.56960 R-Sq = 74.0% R-Sq(adj) = 54.5% Analysis of Variance Source DF SS Regression 6 150.507 Residual Error 8 52.823 Total 14 203.329 Source LOAD 1YRT 1YRTsq ASSETS ASSTsq AssetsL DF 1 1 1 1 1 1 MS 25.084 6.603 F 3.80 P 0.043 Seq SS 26.006 14.072 21.482 31.904 34.263 22.781 Unusual Observations Obs LOAD 1YRR Fit SE Fit Residual St Resid 2 1.00 1.900 -0.053 2.391 1.953 2.08R R denotes an observation with a large standardized residual. 8) a) Victory? What happened in the Regression 7 ANOVA that hasn’t occurred since Regression 2? What still stinks? Why have I stopped worrying about VIFs? (3) [10.5] Solution: We finally have a p-value for the ANOVA below 5%. On the other hand, the only coefficient that is significant is that of ‘ASSTsq.’ As I stated earlier, the cause of the high VIFs seems to be the necessary fact that if one variable enters the equation in different forms they will show correlation. b) Use an F-test to check the value of adding ‘ASSTsq’ and ‘AssetsL.’ Don’t repeat any tests that have already been done. (3) Solution: Your raw materials here are the ANOVA Analysis of Variance Source DF SS Regression 6 150.507 Residual Error 8 52.823 Total 14 203.329 MS 25.084 6.603 F 3.80 P 0.043 and the sequential sum of squares relating to the mentioned variables. Source ASSTsq AssetsL Sum DF 1 1 2 Seq SS 34.263 22.781 57.044 8 252y0741 5/7/07 If we use the sum I just calculated to break up the ANOVA, we get the following. Source DF SS MS F F.05 4 indep variables 4 2 new variables Residual Error Total 2 8 14 93.463 23.365 3.54 57.044 52.823 203.329 28.522 6.603 4.32 4,8 3.84 F.05 2,8 4.46 F.05 I’m surprised. Though the improvement is evident at the 10% level, it is not at the 5% level that I used here. Regression 8 MTB > Regress c2 7 c3 c5 c11 c14 c6 c7 c13; SUBC> Constant; SUBC> VIF; SUBC> Brief 2. Regression Analysis: 1YRR versus LOAD, 1YRT, ... The regression equation is 1YRR = 161 + 47.8 LOAD - 2.77 1YRT + 0.0288 1YRTsq - 1.41 1YRTL - 0.809 ASSETS + 0.00167 ASSTsq + 0.166 AssetsL Predictor Coef SE Coef T P VIF Constant 161.11 46.37 3.47 0.010 LOAD 47.81 48.13 0.99 0.354 1947.8 1YRT -2.768 1.094 -2.53 0.039 578.5 1YRTsq 0.02882 0.01142 2.52 0.040 907.7 1YRTL -1.4055 0.6353 -2.21 0.063 1567.3 ASSETS -0.8091 0.2897 -2.79 0.027 606.7 ASSTsq 0.0016686 0.0005201 3.21 0.015 482.9 AssetsL 0.16616 0.09185 1.81 0.113 321.2 S = 2.10735 R-Sq = 84.7% R-Sq(adj) = 69.4% Analysis of Variance Source DF SS Regression 7 172.243 Residual Error 7 31.087 Total 14 203.329 Source LOAD 1YRT 1YRTsq 1YRTL ASSETS ASSTsq AssetsL DF 1 1 1 1 1 1 1 MS 24.606 4.441 F 5.54 P 0.019 Seq SS 26.006 14.072 21.482 2.573 41.804 51.772 14.535 MTB > Name c16 "RESI1" 9 252y0741 5/7/07 Regression 9 MTB > Regress c2 6 c5 c11 c14 c6 c7 c13; SUBC> Residuals 'RESI1'; SUBC> GHistogram; SUBC> GNormalplot; SUBC> NoDGraphs; SUBC> RType 1; SUBC> Constant; SUBC> VIF; SUBC> Brief 2. Regression Analysis: 1YRR versus 1YRT, 1YRTsq, ... The regression equation is 1YRR = 163 - 1.88 1YRT + 0.0193 1YRTsq - 0.842 + 0.00189 ASSTsq + 0.222 AssetsL Predictor Coef SE Coef T P Constant 162.54 46.31 3.51 0.008 1YRT -1.8806 0.6305 -2.98 0.018 1YRTsq 0.019313 0.006217 3.11 0.015 1YRTL -0.8416 0.2848 -2.96 0.018 ASSETS -0.9475 0.2537 -3.73 0.006 ASSTsq 0.0018892 0.0004699 4.02 0.004 AssetsL 0.22154 0.07293 3.04 0.016 S = 2.10557 R-Sq = 82.6% Analysis of Variance Source DF SS Regression 6 167.862 Residual Error 8 35.467 Total 14 203.329 Source 1YRT 1YRTsq 1YRTL ASSETS ASSTsq AssetsL DF 1 1 1 1 1 1 1YRTL - 0.947 ASSETS VIF 192.4 269.5 315.5 466.3 394.9 202.8 R-Sq(adj) = 69.5% MS 27.977 4.433 F 6.31 P 0.010 Seq SS 38.826 22.179 0.663 31.076 34.203 40.915 9) a) So why am I very happy with Regression 9? (1.5) Solution: What happened in regressions 7 and 9 was that I removed ‘LOAD’ and added the interaction variable ‘1YRTL.’ While ‘LOAD’ never had a significant coefficient, all the coefficients now are significant. a) Why was I willing to accept a lower value of R-squared than in regression 8? (1) Solution: R-squared adjusted actually rose. b) Lets check out the effect of the signs of the coefficients? To start with we have what are essentially two equations here because of the interaction variables. What are they? (2) The regression equation is 1YRR = 162.54 - 1.8806 1YRT + 0.019313 1YRTsq - 0.8416 1YRTL - 0.9475 ASSETS + 0.0018892 ASSTsq + 0.22154 AssetsL For a Firm with no load ‘1YRTL’ and ‘AssetsL’ are zero so the equation reads as below. The regression equation is 1YRR = 162.54 - 1.8806 1YRT + 0.019313 1YRTsq - 0.9475 ASSETS + 0.0018892 ASSTsq For a firm with a load, we can add the coefficient of ‘1YRTL’ to the coefficient of ‘1YRT’ and add the coefficient of ‘AssetsL’ to the coefficient of ‘ASSETS.’ The regression equation is 1YRR = 162.54 – 2.7222 1YRT + 0.01931 1YRTsq - 0.7257 ASSETS + 0.00189 ASSTsq 10 252y0741 5/7/07 c) The mean value of 1YRT is 58.2 and the mean value of ASSETS is 247.01. If these two values apply to a given firm, what is the predicted value of 1YRR for (i) a no-load firm and (ii) a load firm? (2) We have ‘1YRT’ = 58.2, ‘1YRTsq’ = 3387.24, ‘ASSETS’ = 247.01 and ‘Asstsq’ = 61013.94 For a no-load firm 1YRR = 162.54 - 1.8806 1YRT + 0.019313 1YRTsq - 0.9475 ASSETS + 0.0018892 ASSTsq = 162.54 -1.8806(58.2) + 0.019313(3387.24) - 0.9475(247.01) + 0.0018892(61013.94) =162.54 – 109.45 + 65.42 – 234.04 + 115.26 = -0.27 Minitab gets -0.27 For a firm with a load 1YRR = 162.54 – 2.7272 1YRT + 0.01931 1YRTsq - 0.7257 ASSETS + 0.00189 ASSTsq = 162.54 – 2.7222(58.2) + 0.01931(3387.24) - 0.7257(247.01) + 0.00189(6103.94) = 162.54 – 158.43 + 65.42 – 179.32 + 115.26 = 5.47 Minitab gets 5.54 d) So how do rises of 10 in 1-year turnover and 10 in Assets affect these firms? You need four answers here. (4) After a rise of 10 in ‘1YRT’ we have ‘1YRT’ = 68.2, ‘1YRTsq’ = 4651.24, ‘ASSETS’ = 247.01 and ‘Asstsq’ = 61013.94 For a no-load firm 1YRR = 162.54 - 1.8806(68.2) + 0.019313(4651.24) - 0.9475(247.01) + 0.0018892(61013.94)= 5.34 from -0.27 For a firm with a load 1YRR = 162.54 – 2.7222(68.2) + 0.01931(4651.24) - 0.7257(247.01) + 0.00189(61013.94)= 2.73 from 5.54 After a rise of 10 in ‘ASSETS,’ we have ‘1YRT’ = 58.2, ‘1YRTsq’ = 3387.24, ‘ASSETS’ = 257.01 and ‘Asstsq’ = 66054.14 For a no-load firm 1YRR = 162.54 - 1.8806(58.2) + 0.019313(3387.24) - 0.9475(257.01) + 0.0018892(66054.14)= -0.22 from -0.27 For a firm with a load 1YRR = 162.54 – 2.7272(58.2) + 0.01931(3387.24) - 0.7257(257.01) + 0.00189(66054.14)= 7.80 from 5.54 The results for no-load firms don’t seem to make much sense. e) So, in view of d) are there any coefficients here that don’t seem reasonable? Why? (1.5) [25.5] Solution: The results for no-load firms don’t seem to make much sense. Rises in assets hurt the rate of return and rises in turnover help it. Normplot of Residuals for 1YRR Residual Histogram for 1YRR MTB > NormTest c16; SUBC> KSTest. Probability Plot of RESI1 MTB > NormTest c16. Probability Plot of RESI1 MTB > NormTest c16; SUBC> RJTest. 11 252y0741 5/7/07 Probability Plot of RESI1 Wording in little boxes is: Mean StDev N KS p-value 6.14175E-14 1.592 15 .186 RJ >0.150 0.463 AD >0.100 .506 .119 Regression 10 MTB > BReg c2 c5 c11 c14 c6 c7 c13 ; SUBC> NVars 1 6; SUBC> Best 2; SUBC> Constant. Best Subsets Regression: 1YRR versus 1YRT, 1YRTsq, ... Response is 1YRR Vars 1 1 2 2 3 3 4 4 5 5 6 R-Sq 41.7 35.7 58.2 47.0 60.8 60.7 61.9 61.4 63.5 63.2 82.6 MTB > Save 01A.MTW"; R-Sq(adj) 37.2 30.8 51.3 38.1 50.1 50.0 46.7 46.0 43.2 42.7 69.5 Mallows C-p 15.7 18.5 10.2 15.3 11.0 11.0 12.5 12.7 13.7 13.9 7.0 S 3.0203 3.1705 2.6599 2.9980 2.6918 2.6938 2.7826 2.7998 2.8710 2.8851 2.1056 1 Y R T 1 Y R T s q 1 Y R T L X X X X X X X X X X X X A S S E T S A S S T s q X X X X X X X X X X X X X X X X X X X A s s e t s L X X X X X "C:\Documents and Settings\RBOVE\My Documents\Minitab\252x07041- 12 252y0741 5/7/07 SUBC> Replace. Saving file as: 'C:\Documents and Settings\RBOVE\My Documents\Minitab\252x07041-01A.MTW' Existing file replaced. 10) So now I am even happier. The plots are for the Lilliefors test and two other similar tests. What comforting news are they telling me? Regression 10 didn’t tell me much that I didn’t know already, except for one thing. But how is it telling me that I have finished my job? (1.5) [27] Solution: The only regression that has a satisfactory C-p is the one we just did. The C-p is supposed to be 1 plus the number of independent variables, which is exactly what it did. According to the plots, at least for this model, I seem to have Normal residuals, as assumed. Note: The general technique employed here was to look at all possible variables first. The high VIFs killed that approach. I then restricted myself to the three basic variables that I had started. After assuring myself that there was no collinearity present, I started worrying about significance and R-squared. I probably should have given the computer the option of putting back some of the excluded variables when I used the best subsets approach at the end. II. Hand in your fourth computer problem. (2 to 8 points) 13 252y0741 5/7/07 III. Do at least 4 of the following 6 Problems (at least 12 each) (or do sections adding to at least 50 points – (Anything extra you do helps, and grades wrap around). It is especially important to do more if you have skipped much of parts I or II. You must do part b) of problem 1. Show your work! State H 0 and H1 where applicable. Use a significance level of 5% unless noted otherwise. Do not answer questions without citing appropriate statistical tests – That is, explain your hypotheses and what values from what table were used to test them. Clearly label what section of each problem you are doing! The entire test has about 160 points, but 70 is considered a perfect score. Don’t waste our time by telling me that two means, proportions, variances or medians don’t look the same to you. You need statistical tests! There are some blank pages below. Put your name on as many loose pages as possible! 1) The data below represents four independent random samples. Each column represents hours between breakdowns for the machines named. We assume that the underlying distribution is Normal. Please do the following. Mark sections of your answer clearly. a) If I want to test to see if the mean of x1 is larger than the mean of x 2 my null hypotheses are: (Note: D 1 2 ) Only check one answer! i) 1 2 and D 0 (2) v) 1 2 and D 0 vi) 1 2 and D 0 ii) 1 2 and D 0 iii) 1 2 and D 0 vii) 1 2 and D 0 iv) 1 2 and D 0 viii) 1 2 and D 0 Solution: Since this question will appear on future exams, it is left to the student. Machine 1 Machine 2 Machine 3 Machine 4 6.5 8.7 11.1 9.9 7.9 7.4 10.3 12.8 5.4 9.4 9.7 12.1 7.5 10.1 10.3 10.8 8.5 9.2 9.2 11.3 7.3 9.8 8.8 11.5 Note the following: n1 n2 n3 n4 6 For machine 1 x 43.1, x12 315.61 , s12 1.201667 For machine 2 x For machine 3 x For machine 4 x 1 2 54.6 x 591.56 , s 0.700000 68.4, x 784.84 , s 1.016000 b) Find the sample variance for machine 2. Note that you may need x later in the problem. (2) 3 4 59.4, 2 3 2 3 2 4 2 4 2 2 c) Test the hypothesis that the mean for machine 2 is 10. Find an approximate p-value. (3) d) Assume that the variances are the same for machine 2 and machine 3 and test to see if the mean of machine 3 is larger than the mean of machine 2. Use a confidence interval, a test ratio and a critical value for the mean (6). Or use just one of these three methods (3) [13] e) Just to be on the safe side, test to see if the variances of the two machines were similar (3) f) Assume equality of variances for all of the machines and test the hypothesis that all of the machines have equal means. (6) g) Explain how to test these columns to see if they have equal variances (1) [23] x 22 later in the problem. Solution: b) Find the sample variance for machine 2. Note that you may need (2) x2 x 22 8.7 7.4 9.4 10.1 9.2 9.8 54.6 75.69 54.76 88.36 102.01 84.64 96.04 501.50 14 252y0741 5/7/07 n2 6 , s22 x x 22 2 54.6, nx 22 n 1 x 2 2 50150 . . So x 2 5015 . 6 9.10 x n 2 54.6 9.10 and 6 2 0.9280 . Some of you still couldn’t do this! 5 c) Test the hypothesis that the mean for machine 2 is 10. Find an approximate p-value. (3) Table 3 has the material below. But if we want a p-value we should use a test ratio. Interval for Confidence Hypotheses Test Ratio Critical Value Interval Mean ( x 0 x t 2 s x xcv 0 t 2 s x H0 : 0 t unknown) sx H1 : 0 DF n 1 s sx n Difference H : D D * d D D d t 2 s d d D0 0 0 cv 0 t 2 s d t between Two H 1 : D D0 , sd 1 1 Means ( sd s p D 1 2 n 1s12 n2 1s22 n1 n2 unknown, sˆ 2p 1 n1 n2 2 variances DF n1 n2 2 assumed equal) H 0 : 10 sx2 H 1 : 10 t s2 n2 s 22 0.9280 0.1546667 0.39328 n2 6 9.10 10 2.288 . We have 5 degrees of freedom and the relevant part of the t table is below. 0.39328 df .45 .40 .35 .30 .25 .20 .15 .10 .05 .025 .01 .005 .001 1 2 3 4 5 6 0.158 0.142 0.137 0.134 0.132 0.131 0.325 0.289 0.277 0.271 0.267 0.265 0.510 0.445 0.424 0.414 0.408 0.404 0.727 0.617 0.584 0.569 0.559 0.553 1.000 0.816 0.765 0.741 0.727 0.718 1.376 1.061 0.978 0.941 0.920 0.906 1.963 1.386 1.250 1.190 1.156 1.134 3.078 1.886 1.638 1.533 1.476 1.440 6.314 2.920 2.353 2.132 2.015 1.943 12.71 4.303 3.182 2.776 2.571 2.447 31.82 6.965 4.541 3.747 3.365 3.143 63.66 9.925 5.841 4.604 4.032 3.707 318.3 22.33 10.21 7.173 5.893 5.208 5 5 2.015 and t .025 2.571 , for a left-sided test we would say .025 < p-value < Since 2.288 lies between t .05 .05. But since this is a 2-sided test, we must double these numbers to .05 < p-value < .10. Thus we cannot reject the null hypothesis at the 95% confidence level. d) Assume that the variances are the same for machine 2 and machine 3 and test to see if the mean of machine 3 is larger than the mean of machine 2. Use a confidence interval, a test ratio and a critical value for the mean (6). Or use just one of these three methods (3) [13] Let D 2 3 . Our alternative hypothesis is 2 3 , since it does not contain an equality. If this is H 0 : 2 3 H 0 : D 0 true, our hypotheses read or Wake up! This is a one-sided test. This means H 1 : D 0 H 1 : 2 3 that confidence intervals and critical values must be one-sided. and s32 0.700000. We thus have x 2 x 2 54.6 , x 3 59.4 , s 22 0.9280 59 .4 54 .6 9.90 , d x 2 x3 0.8 and 9.10 and x 3 6 6 50.9280 50.7000 1 2 1 0.8140 s d sˆ 2p 0.8148 0.2716 0.5212 6 10 n 2 n3 Degrees of freedom are 6 + 6 – 2 = 10. I have copied part of the t-table below, and we can see that 10 t .05 1.812 . sˆ 2p 15 252y0741 5/7/07 df .45 .40 .35 .30 .25 .20 .15 .10 .05 .025 .01 .005 .001 10 0.129 0.260 0.397 0.542 0.700 0.879 1.093 1.372 1.812 2.228 2.764 3.169 4.144 11 0.129 0.260 0.396 0.540 0.697 0.876 1.088 1.363 1.796 2.201 2.718 3.106 4.025 12 0.128 0.259 0.395 0.539 0.695 0.873 1.083 1.356 1.782 2.179 2.681 3.055 3.930 Confidence interval: If we check the alternate hypothesis H 1 : D 0 and form a 1-sided confidence interval in the same direction we will make the given formula D d t 2 s d into D d t s d 0.8 1.812 0.5212 0.8 0.944 0.144 . If D 0.144 , H 0 : D 0 can be true. So we do not reject the null hypothesis. Critical value for d : If we look at the alternate hypothesis, we are looking for a critical value for d that is below zero. The given formula d cv D0 t 2 s d becomes d cv 0 t s d 1.812 0.5212 0.944 . We will reject the null hypothesis if d is below -0.944. Since d 0.8 is not below the critical value, we cannot reject the null hypothesis. Make a diagram. Show an almost Normal curve centered at zero and a ‘reject region below -0.944. This is a left- sided test unless you defined D as D 3 2 . d D0 0.8 0 1.535 . Make a diagram. sd 0.5212 Show an almost Normal curve centered at zero and a ‘reject region below t 10 1.812 . Since our Test Ratio: The format for the test ratio was given as t .05 computed value of t is not below -1.812, do not reject the null hypothesis. e) Just to be on the safe side, test to see if the variances of the two machines were similar (3). H 0 : 2 3 s 2 0.9280 s2 0.7000 We test the two variance ratios 22 1.326 and 22 1 against H : 0 . 7000 0 ..9280 s3 s3 3 1 2 10,10 3.72 . Since the computed ratio is below the table value, reject the null hypothesis. F.025 1 df 2 10 0.100 3.29 0.050 4.96 0.025 6.94 0.010 10.04 2 * * 2.92 4.10 5.46 7.56 3 4 * 2.73 3.71 4.83 6.55 df1 6 5 * 2.61 3.48 4.47 5.99 * 2.52 3.33 4.24 5.64 * 2.46 3.22 4.07 5.39 7 * 8 * 2.41 3.14 3.95 5.20 * 2.38 3.07 3.85 5.06 9 * 2.35 3.02 3.78 4.94 10 11 12 * 2.32 2.98 3.72 4.85 * 2.30 2.94 3.66 4.77 * 2.28 2.91 3.62 4.71 f) Assume equality of variances for all of the machines and test the hypothesis that all of the machines have equal means. (6) A multiple test of means from random samples is 1-way ANOVA. H0 : 1 2 3 4 Sum x.1 6.5 7.9 5.4 7.5 8.5 7.3 43.1 x .2 8.7 7.4 9.4 10.1 9.2 9.8 54.6 x .3 11.1 10.3 9.7 10.3 9.2 8.8 59.4 x .4 9.9 12.8 12.1 10.8 11.3 11.5 68.4 nj 6 6 6 6 x. j 7.1833 9.1000 9.9000 11.400 SS x .2j 315.61 501.50 591.56 784.84 x 2 219351 . 2 51.5998 82.8100 98.0100 129.960 x. j 362 .3898 SOURCE machine error (within) total SS DF 55.534 3 19.231 20 74.765 23 MS F 18.511 19.25 0.9616 SST and SSB are computed below. x 2255 . n 24 3,20 Since F.05 310 . , reject H 0 . x 9.3958 16 252y0741 5/7/07 n j x.2j 6362 .380 2174 .279 . 24 9.3958 x nx 219351 SSB n x nx 2174.279 24 9.3958 SST 2 2 2 j .j 2 2 2 74.7646 55534 . g) Explain how to test these columns to see if they have equal variances (1) [23] Since we are assuming that the parent distributions are Normal, the most appropriate test is a Bartlett test. 17 252y0741 5/7/07 2) Let us go back to the data in Problem 1. The data is now presented with their ranks. Assume that the underlying distribution was different from the Normal. Row 1 2 3 4 5 6 Mach1 RankM1 6.5 2 7.9 6 5.4 1 7.5 5 8.5 7 7.3 3 Mach2 8.7 7.4 9.4 10.1 9.2 9.8 RankM2 8.0 4.0 12.0 16.0 10.5 14.0 Mach3 RankM3 11.1 20.0 10.3 17.5 9.7 13.0 10.3 17.5 9.2 10.5 8.8 9.0 Mach4 RankM4 9.9 15 12.8 24 12.1 23 10.8 19 11.3 21 11.5 22 a) Assume that we believe that Machine 1 and Machine 4 have the same median. Treat them as a single sample and test that the median is 8. (2 or 3 depending on method) It may simplify your work in a-c) if I present the data as a single sample. The original numbers are in ‘Mc1,4stacked.’ ‘McNo1,4’ just identifies the machine that they came from. ‘RankMc1,4’ ranks the numbers from 1 through 12. ‘Mc1,4-8’ is simply ‘Mc1, 4stacked’ with 8 subtracted from it. Row 1 2 3 4 5 6 7 8 9 10 11 12 Mc1,4stacked 6.5 7.9 5.4 7.5 8.5 7.3 9.9 12.8 12.1 10.8 11.3 11.5 McNo1,4 Mc1 Mc1 Mc1 Mc1 Mc1 Mc1 Mc4 Mc4 Mc4 Mc4 Mc4 Mc4 RankMc1,4 2 5 1 4 6 3 7 12 11 8 9 10 Mc1,4-8 -1.5 -0.1 -2.6 -0.5 0.5 -0.7 1.9 4.8 4.1 2.8 3.3 3.5 The most appropriate method is the Wilcoxon signed rank test, though the sign test can be done with the same data. We are testing H 0 : 8 . Row 1 2 3 4 5 6 7 8 9 10 11 12 x 6.5 7.9 5.4 7.5 8.5 7.3 9.9 12.8 12.1 10.8 11.3 11.5 Mc1 Mc1 Mc1 Mc1 Mc1 Mc1 Mc4 Mc4 Mc4 Mc4 Mc4 Mc4 r x 8 x 8 rank 2 5 1 4 6 3 7 12 11 8 9 10 -1.5 -0.1 -2.6 -0.5 0.5 -0.7 1.9 4.8 4.1 2.8 3.3 3.5 1.5 0.1 2.6 0.5 0.5 0.7 1.9 4.8 4.1 2.8 3.3 3.5 5.0 1.0 7.0 2.5 2.5 4.0 6.0 12.0 11.0 8.0 9.0 10.0 signed rank -5.0 -1.0 -7.0 -2.5 2.5 -4.0 6.0 12.0 11.0 8.0 9.0 10.0 12 13 . 2 We check Table 7 and find that for a 2-sided 95% test with n 12 , we use 14 as a critical value. Because 19.5 is not below the critical value, we cannot reject the null hypothesis. To do a sign test, note that there are 5 numbers below 8 and 7 above 8. We have T 19.5 and T 58.5 . To check for accuracy of ranking note that 19 .5 58 .5 78 b) Remember that these are actually two independent samples. Test for the equality of the medians of Machines 1and 4. (3) If we look at the column marked r , we can sum the ranks for each machine. We get SR1 21 and 12 13 . Table 6 says that for the null 2 hypothesis of equality to be true the smaller of these two must be above 26. It isn’t, so we reject the null hypothesis of equal medians. It says ‘independent samples’ here, so why did some people use a Wilcoxon signed rank test? SR4 57 . To check for accuracy of ranking note that 21 57 78 18 252y0741 5/7/07 c) There is another test for the median available. The Wald-Wolfowitz test is a version of the runs test. Put the numbers in order (Ranking has been done for you.) Now mark each number with a + or a – depending on whether the number came from Machine 1 or Machine 4. Do a runs test. If the null hypothesis of randomness is rejected, you can say that the medians are not the same. What do you find? (3) The original data follows. Row 1 2 3 4 5 6 7 8 9 10 11 12 Mc1,4stacked 6.5 7.9 5.4 7.5 8.5 7.3 9.9 12.8 12.1 10.8 11.3 11.5 McNo1,4 Mc1 Mc1 Mc1 Mc1 Mc1 Mc1 Mc4 Mc4 Mc4 Mc4 Mc4 Mc4 RankMc1,4 2 5 1 4 6 3 7 12 11 8 9 10 Mc1,4-8 -1.5 -0.1 -2.6 -0.5 0.5 -0.7 1.9 4.8 4.1 2.8 3.3 3.5 If we put the data in order as suggested we get the table below. Of course, since we have ‘RankMc1,’ we do not need the ‘Mc1,4stacked’ column. Row Mc1,4stacked McNo1,4 RankMc1,4 Sign 1 5.4 Mc1 1 + 2 6.5 Mc1 2 + 3 7.3 Mc1 3 + 4 7.5 Mc1 4 + 5 7.9 Mc1 5 + 6 8.5 Mc1 6 + 7 9.9 Mc4 7 8 10.8 Mc4 8 9 11.3 Mc4 9 10 11.5 Mc4 10 11 12.1 Mc4 11 12 12.8 Mc4 12 - The sequence is . So n1 6 , n 2 6 and the number of runs is r 2 . The runs test table says to reject randomness if the number of runs is 3 or lower or 11 or higher. So we reject the null hypothesis of equal means. d) Do a test of the four independent samples to see if their medians differ (4) [12, 35] x1 r1 x2 r2 x3 r3 x4 r4 1 2 3 4 5 6 6.5 7.9 5.4 7.5 8.5 7.3 2 6 1 5 7 3 24 8.7 7.4 9.4 10.1 9.2 9.8 8.0 4.0 12.0 16.0 10.5 14.0 64.5 11.1 10.3 9.7 10.3 9.2 8.8 20.0 17.5 13.0 17.5 10.5 9.0 87.5 9.9 12.8 12.1 10.8 11.3 11.5 15 24 23 19 21 22 124 Solution: If the columns are independent random samples and the distribution is not Normal, we have a Kruskal-Wallis test. The null hypothesis is that we have equal medians. The sums of ranks should add to 24 25 300 . The number of items in all the columns is n 24 , the number of items in each column is 2 ni 6 and we have 4 columns, which is too many items in a column to use the K-W table, so we say that we will consider the K-W statistic to have a Chi-squared distribution with 4 – 1 = 3 degrees of freedom. 12 SRi 2 3n 1 We must compute the Kruskal-Wallis statistic H nn 1 i ni 12 24 2 64 .52 87 .52 124 2 325 12 27768 .5 75 92 .5617 75 17 .5617 6 6 6 600 6 24 25 6 Since this is larger than .2053 7.8147 , we reject our null hypothesis. 19 252y0741 5/7/07 3) Curses! It seems that these were not random samples at all. Each row represents a set of measurements that were taken on one of 6 randomly chosen days. Thus the data is blocked by days. Day Mach1 Mach2 Mach3 Mach4 RowSum RowSum of Squares 1 2 3 4 5 6 6.5 7.9 5.4 7.5 8.5 7.3 8.7 7.4 9.4 10.1 9.2 9.8 11.1 10.3 9.7 10.3 9.2 8.8 9.9 12.8 12.1 10.8 11.3 11.5 36.2 38.4 36.6 38.7 38.2 37.4 339.16 387.10 358.02 380.99 369.22 359.02 a) Using the fact that the data is cross classified and assuming that the underlying distribution is Normal, test that the means of Machines 1 and 4 are equal. (4) Note that the credit on this problem is large because none of the computations above or in previous problems will help you here. The table below is the setup for both a) and c) The differences and their squares that are needed for a) are calculated. For the Wilcoxon signed rank test in c) the absolute values of the differences are calculated in d . These are ranked in rank and have their signs restored in signed rank . Row 1 2 3 4 5 6 x 4 d x1 x 4 x1 6.5 7.9 5.4 7.5 8.5 7.3 9.9 12.8 12.1 10.8 11.3 11.5 -3.4 -4.9 -6.7 -3.3 -2.8 -4.2 -25.3 d2 11.56 24.01 44.89 10.89 7.84 17.64 116.83 Interval for Confidence Interval Hypotheses Difference between Two Means (paired data.) D d t 2 s d H 0 : D D0 * d x1 x 2 df n 1 where n1 n 2 n d rank signed rank 3.4 4.9 6.7 3.3 2.8 4.2 3 5 6 2 1 4 -3 -5 -6 -2 -1 -4 -21 Test Ratio t H 1 : D D0 , D 1 2 Critical Value d D0 sd sd d cv D0 t 2 s d sd n H 0 : 1 4 H 0 : D 0 Our hypotheses read or H 1 : 1 4 H 1 : D 0 d 25.3 d 25.3 d 116 .83 . So d n 6 4.2167 and d nd 116 .83 6 4.2167 2.0967 df n 1 5 t 2.571 n 6, 2 2 s d2 2 2 5 .025 n 1 5 Use only one of the following methods. 2.02967 6 4.2167 2.5710.5817 4.2167 1.4593 Since the error term is smaller in absolute value than the mean difference, we must reject the null hypothesis. Critical value for d : The given formula d cv D0 t 2 s d becomes Confidence interval: D d t 2 s d is now D 4.2167 2.571 d cv 0 2.5710.5817 1.4593. Make a diagram. Show an almost Normal curve centered at zero and ‘reject’ regions below -1.4593 and above 1.4593. Since d 4.2167 falls in the lower reject region, we reject the null hypothesis. Test Ratio: The format for the test ratio was given as t d D0 4.2167 7.249 . 0.5817 sd Make a diagram. Show an almost Normal curve centered at zero and ‘reject’ regions 20 252y0741 5/7/07 5 5 below t .025 2.571 and above t .025 2.571 . Since our computed value of t is below - 2.571, reject the null hypothesis. b) Using the fact that the data is cross classified and assuming that the underlying distribution is Normal, test that the means of all four machines are equal. Note that you already should have done about half of this problem in 1f). (4) If we copy the data at the beginning of the problem we have Day Mach1 1 2 3 4 5 6 Mach2 8.7 7.4 9.4 10.1 9.2 9.8 6.5 7.9 5.4 7.5 8.5 7.3 Mach3 11.1 10.3 9.7 10.3 9.2 8.8 Mach4 9.9 12.8 12.1 10.8 11.3 11.5 RowSum n i x i 36.2 4 9.050 38.4 4 9.600 36.6 4 9.150 38.7 4 9.675 38.2 4 9.550 37.4 4 9.350 225.5 24 (9.3958) Sum 43.1 6 nj 54.6 6 59.4 6 x x n x 2255 . n 24 68.4 6 339.16 81.9025 387.10 92.1600 358.02 83.7225 380.99 93.6056 369.22 91.2025 359.02 87.4225 2193.51 530.0156 x 2 x 2 i.. ( x 9.3958 ) This is not a sum. x. j 7.1833 9.1000 9.9000 SS 315.61 x .2j 51.5998 x2 2193 .51 82.8100 98.0100 129.960 x.2j 362 .3898 501.50 591.56 xi 2 SS 11.400 784.84 We already know . 24 9.3958 74.7646 x nx 219351 SSC n x n x R x nx 2174 .279 24 9.3958 55 .534 SSR C x nx 4530.0156 249.3958 2120.0624 2118.7454 1.3170 SST 2 2 2 2 j .j 2 i. 2 2 .j 2 2 Source SS Rows 1.317 Columns 2 2 55.534 DF 5 3 MS 0.2634 18.5113 F F.05 0.2206ns F 5,15 2.90 15.5010s F 3,15 3.29 H0 Row means equal Column means equal Within 17.9136 15 1.1942 Total 74.7646 23 Predictably, a Minitab run shows us to have generated a lot of rounding error, but the facts are the same. Two-way ANOVA: McStacked versus Row, McNo Source DF Row 5 McNo 3 Error 15 Total 23 S = 1.093 SS MS F P 1.3021 0.2604 0.22 0.949 55.5213 18.5071 15.49 0.000 17.9263 1.1951 74.7496 R-Sq = 76.02% R-Sq(adj) = 63.23% c) But let’s go back to the possibility that the underlying distribution is not Normal. Using the fact that the data is cross classified and assuming that the underlying distribution is not Normal, test that the medians of Machines 1 and 4 are equal. Remember that none of the methods that we used to compare medians actually involves computing the median! (3) If we go back to the table in a), we find that the two signed rank sums are T 0 and T 21. According to Table 7, for a 2-sided 5% test, we reject the null hypothesis of equal medians if the smaller of the two rank sums is 1 . Thus we reject the null hypothesis. 21 252y0741 5/7/07 d) Using the fact that the data is cross classified and assuming that the underlying distribution is not Normal, test that the medians of all four machines are equal. (4) Solution: This is a Friedman test. H0: 1 2 3 4 Where 1 is A, 2 is B, 3 is C and 4 is D. H1: At least one of the medians differs. First we rank the data within rows. The data appears below in columns marked x1 to x 4 and the ranks are in columns marked r1 to r4 . 1 2 3 4 5 6 x1 r1 x2 r2 x3 r3 x4 r4 6.5 7.9 5.4 7.5 8.5 7.3 1 2 1 1 1 1 7 8.7 7.4 9.4 10.1 9.2 9.8 2 1 2 2 2.5 3 12.5 11.1 10.3 9.7 10.3 9.2 8.8 4 3 3 3 2.5 2 17.5 9.9 12.8 12.1 10.8 11.3 11.5 3 4 4 4 4 4 23 To check the ranking, note that the sum of the four rank sums is 7 + 12.5 + 17.5 + 23 = 60, and rcc 1 645 SRi 60 . Now compute the Friedman that the sum of the rank sums should be 2 2 12 12 7 2 12 .52 17 .52 23 2 365 SRi2 3r c 1 statistic F2 6 4 5 rc c 1 i 12 1040 .5 90 14 .05 . This problem is too large for the Friedman table., so we say that we will 120 consider the Friedman statistic to have a Chi-squared distribution with 4 – 1 = 3 degrees of freedom. Since this is larger than .2053 7.8147 , we reject our null hypothesis of equal column medians. e) To check whether c) or d) is correct do a Normality test on the entire data set. Most of this is done for you. Column (1) is the original data. Column (2) is the original data standardized. Column (3) is a cumulative Normal probability. Column (5) is column (4) divided by 24. Column (6) is the difference between Column (3) and Column (5). Please explain carefully how you complete this test, including what table you are using. (3) [18, 53] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 (1) 5.4 6.5 7.3 7.4 7.5 7.9 8.5 8.7 8.8 9.2 9.2 9.4 9.7 9.8 9.9 10.1 10.3 10.3 10.8 11.1 11.3 11.5 12.1 12.8 (2) -2.21650 -1.60632 -1.16256 -1.10709 -1.05162 -0.82974 -0.49692 -0.38598 -0.33051 -0.10863 -0.10863 0.00231 0.16872 0.22419 0.27966 0.39060 0.50154 0.50154 0.77889 0.94530 1.05624 1.16718 1.50001 1.88830 (3) 0.013329 0.054101 0.122504 0.134127 0.146486 0.203343 0.309623 0.349756 0.370507 0.456748 0.456748 0.500922 0.566992 0.588696 0.610132 0.651954 0.692005 0.692005 0.781979 0.827748 0.854572 0.878432 0.933194 0.970507 (4) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 (5) 0.04167 0.08333 0.12500 0.16667 0.20833 0.25000 0.29167 0.33333 0.37500 0.41667 0.45833 0.50000 0.54167 0.58333 0.62500 0.66667 0.70833 0.75000 0.79167 0.83333 0.87500 0.91667 0.95833 1.00000 (6) -0.0283379 -0.0292319 -0.0024965 -0.0325396 -0.0618468 -0.0466575 0.0179560 0.0164224 -0.0044926 0.0400817 -0.0015850 0.0009221 0.0253256 0.0053627 -0.0148684 -0.0147122 -0.0163279 -0.0579946 -0.0096878 -0.0055851 -0.0204282 -0.0382345 -0.0251398 -0.0294930 22 252y0741 5/7/07 This problem was 95% done for you, but no one did it. The best method to use here is Lilliefors because the data is not stated by intervals, the distribution for which we are testing is Normal, and the parameters of the distribution are unknown. We began by putting the data in order and xx computing z (actually t ) and proceeded as in the Kolmogorov-Smirnov method. For s 138 x 1.60632 , O 1 because there is only one number in example, in the second row z s each interval, n O 24 - To compute the observed relative distribution, we summed the O column into the CumO column and then divided by n 24 so that each value of Fo shows the fraction of observations that were x . Fe is a fairly exact computation of the standardized Normal probability of being below x . It can be approximated using the Normal table. For example, for row 2, z 1.60632 and the cumulative probability will be Fe F 1.60 Pz 1.60 Pz 0 P1.60 z 0 .5 .4452 .0548 .054101 . The last column should be the absolute value of the difference between Fe and Fo , but the signs have not been removed. The largest absolute value seems to be 0.0618468. According to the Lilliefors table, the 5% critical value for n 24 is between .190 and .180. Since none of the numbers in the D column are that large in absolute value, we cannot reject the null hypothesis that the distribution is Normal. x Row O CumO Fo Fe D Fe Fo z 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 5.4 6.5 7.3 7.4 7.5 7.9 8.5 8.7 8.8 9.2 9.2 9.4 9.7 9.8 9.9 10.1 10.3 10.3 10.8 11.1 11.3 11.5 12.1 12.8 -2.21650 -1.60632 -1.16256 -1.10709 -1.05162 -0.82974 -0.49692 -0.38598 -0.33051 -0.10863 -0.10863 0.00231 0.16872 0.22419 0.27966 0.39060 0.50154 0.50154 0.77889 0.94530 1.05624 1.16718 1.50001 1.88830 0.013329 0.054101 0.122504 0.134127 0.146486 0.203343 0.309623 0.349756 0.370507 0.456748 0.456748 0.500922 0.566992 0.588696 0.610132 0.651954 0.692005 0.692005 0.781979 0.827748 0.854572 0.878432 0.933194 0.970507 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 0.04167 0.08333 0.12500 0.16667 0.20833 0.25000 0.29167 0.33333 0.37500 0.41667 0.45833 0.50000 0.54167 0.58333 0.62500 0.66667 0.70833 0.75000 0.79167 0.83333 0.87500 0.91667 0.95833 1.00000 -0.0283379 -0.0292319 -0.0024965 -0.0325396 -0.0618468 -0.0466575 0.0179560 0.0164224 -0.0044926 0.0400817 -0.0015850 0.0009221 0.0253256 0.0053627 -0.0148684 -0.0147122 -0.0163279 -0.0579946 -0.0096878 -0.0055851 -0.0204282 -0.0382345 -0.0251398 -0.0294930 23 252y0741 5/7/07 4) (Allen L. Webster) A regional manager wishes to examine the relationship between the amount spent on a given day, in dollars ( y ) on the basis of income in thousands of dollars ( x1 ) and gender ( x 2 -a dummy variable – 1 for female) and an interaction variable ( x3 ) which is a product of the other two independent variables. The data is below (Use .05) . The last column was supposed to be computed by you! Row Expenditure Income Gender 1 2 3 4 5 6 7 8 9 10 y x1 x2 x3 51 30 32 45 51 31 50 47 45 39 40 25 27 32 45 29 42 38 30 29 1 0 0 1 1 0 1 1 0 1 40 0 0 32 45 0 42 38 0 29 x2 y Inter 51 0 0 45 51 0 50 47 0 39 283 The following are given to help you. y 421, y 2 18367, x 337, x y 14655, x 1 y ?, x x x 2 1 1 2 1 2 11793, x 2 6, x 2 2 6, 226 and n 10 . You do not need all of these in this problem. [53] a. Compute a simple regression of expenditure against x1 . (5) b. Compute R 2 (4) c. Compute s e (3) d. Compute s b1 (the std. deviation of the slope) and do a confidence interval for 0 . On the basis of these two exercises, are either or both significant? (4.5) e. My Aunt Gertrude, who has an income of $25 thousand has just walked into the store. Create an appropriate interval to predict her spending. (3.5) [20, 73] The quantities below are given: y 421, x 337, n 10 , x y 14655 x x and 1 1 2 1 x 2 6, y 2 18367, x 2 1 11793, x 2 2 6, 226. a). You do not need all of these. Spare Parts Computation: 421 337 6 x 2 y 283 , Y 42 .1 , X 1 33 .7 , X 2 0.60 10 10 10 X 12 nX 12 SSX1 117931033.702 436.10* †Needed only in next problem X nX SSX 6 100.60 = 2.40*† Y nY SST SSY 18367 1042.1 = 642.90* X Y nX Y SX Y 14655 1033.742.1 = 467.30 X Y nX Y SX Y 283 100.642.1 = 30.4† X X nX X SX X 226 1033.70.60 = 23.80† 2 2 2 2 2 2 2 2 2 1 1 2 2 1 2 1 2 1 2 1 2 *Must be positive. The rest may well be negative. 24 252y0741 5/7/07 Solution: The coefficients are b1 S xy SS x XY nXY X nX 2 2 467 .30 1.0715 and 436 .1 b0 Y b1 X 42 .1 1.0715 33 .7 5.9905 . So Yˆ 5.9905 1.0715 X . b) Compute R 2 (4) Solution: SSR b1 S xy 1.0715 (467 .3) 500 .7120 . So S xy 2 467 .30 2 .7789 SSR b1 S xy 500 .7120 R .7788 or SST SSy 642 .90 SS x SS y 436 .10 642 .90 2 c) Compute s e (3) Solution: s e2 SSE SST SSR SS y b1 S xy 642 .9 500 .7120 17 .7735 n2 n2 n2 8 s e 17 .7735 4.2159 d) Compute s b1 (the std deviation of the slope) and do a confidence interval for 0 . On the basis of these two exercises, are either or both significant? (4.5) Solution: We have n 2 8 degrees of freedom, so we 1 X 2 H 0 : 0 0 8 2.306 . Recall X 1 33 .7 . To test use t .025 compute s b20 s e2 H 1 : 0 0 n SS x 1 33 .7 2 17 .7735 17 .7735 2.7042 48 .0630 . So sb0 48.0630 6.9327 The confidence 10 436 .1 interval would be 0 5.9905 2.306 6.9327 5.99 15 .99 . Since the error part is larger than 5.99 in absolute value, this interval includes zero, so do not reject the null hypothesis. 1 17 .7735 H 0 : 1 0 0.04076 . So sb1 0.04076 0.20188 To test compute s b21 s e2 436 .1 H : 0 1 1 SS x t b1 0 1.0715 8 8 2.306 and above t .025 2.306 . 5.308 The ‘reject’ regions are below t .025 s b1 0.20188 Since our computed t ratio is in the upper ‘reject’ zone, reject the null hypothesis. e.)My Aunt Gertrude, who has an income of $25 thousand has just walked into the store. Create an appropriate interval to predict her spending. (3.5) [20, 73] The appropriate interval is a prediction interval. 2 X0 X 2 2 1 ˆ Solution: The Prediction Interval is Y0 Y0 t sY , where sY s e 1 and n SS x ˆ ˆ Y0 b0 b1 X 0 . X 0 = 25. X 1 33 .7 We have already found Y 5.9905 1.0715 X , s e2 17.7735 , SS 436 .10 . So Yˆ 5.9905 1.071525 = 32.778. x 0 1 25 33 .72 8 2.306 , sY2 17 .7735 1 17 .7735 1.1 0.17356 22 .6356 and sY 4.7577 . t .025 10 436 . 10 so Y0 Yˆ0 t sY 32.778 2.306 4.7577 32 .78 10 .97 . 25 252y0741 5/7/07 5) Data from the previous problem is repeated below. (Use .05) . Row Expenditure 1 2 3 4 5 6 7 8 9 10 Income Gender y x1 x2 Inter x3 51 30 32 45 51 31 50 47 45 39 40 25 27 32 45 29 42 38 30 29 1 0 0 1 1 0 1 1 0 1 40 0 0 32 45 0 42 38 0 29 The following are given to help you. y 421, y 2 18367, x 337, x y 14655, x 1 . y ?, x x 1 2 1 2 x 2 1 11793, x 2 x 6, 6, 2 2 226 and n 10 . a. Do a multiple regression of expenditure against x1 and x 2 . This involves a simultaneous equation solution. Attempting to recycle b1 from the previous page won’t work. (12) b. Compute R 2 and R 2 adjusted for degrees of freedom for both this and the previous problem. Compare the values of R 2 adjusted between this and the previous problem. Use an F test to compare R 2 here with the R 2 from the previous problem. The F test here is one to see if adding a new independent variable improves the regression. (4) c. Compute the regression sum of squares and use it in an ANOVA F test to test the usefulness of this regression. (5) d. My Aunt Gertrude, who has an income of $25 thousand, has just walked into the store. Use your regression to predict how much she will spend. Do a confidence interval and a prediction interval. (4) [25, 98] Solution: a) This is copied from the last problem. Spare Parts Computation: 421 337 6 Y 42 .1 , X 1 33 .7 , X 2 0.60 10 10 10 X X Y 2 1 nX 12 SSX1 11793 1033.702 436.10* 2 2 nX 22 SSX 2 6 10 0.60 2 = 2.40*† 2 nY 2 SST SSY 18367 10 42 .12 = 642.90* X Y nX Y SX Y 14655 1033.742.1 = 467.30 X Y nX Y SX Y 283 100.642.1 = 30.4† X X nX X SX X 226 1033.70.60 = 23.80† 1 1 2 2 1 2 1 2 1 2 1 2 *Must be positive. The rest may well be negative. We substitute these numbers into the Simplified Normal Equations: X 1Y nX 1Y b1 X 12 nX 12 b2 X 1 X 2 nX 1 X 2 X Y nX Y b X X 2 2 1 1 2 nX X b X 1 2 2 2 2 , nX 22 467 .30 436 .10b1 23 .80 b2 30 .40 23 .80b1 2.40 b2 and solve them as two equations in two unknowns for b1 and b2 . These are a fairly tough pair of equations which are to solve until we notice that, if we multiply 2.40 by 23 .80 2.40 9.916667 we get 23.80. We multiply the 26 252y0741 5/7/07 436 .10b1 23 .80b2 467 .30 second equation by 9.916667 and the equations become . If we 301 .46667 236 .01667 b1 23 .80 b2 subtract the second from the first, we get 165 .83333 200 .08333 b1 . This means that 165 .83333 0.8288 . Now remember that 30.40 23.80b1 2.40 b2 . This can be rewritten as 200 .08333 2.40 b2 30 .40 23.80b1 . If we divide by 2.40, we get b2 12 .6667 9.916667 b1 Let’s substitute in b1 165 .83333 0.8288 . b2 12.6667 9.916667 0.8288 12.6667 8.21945 4.4476 . So (It’s worth 200 .08333 checking your work by substituting your values of b1 and b2 back into the normal equations.) Finally we b1 get b0 by using Y 42.1 , X 1 33 .7 , X 2 0.60 in b0 Y b1 X 1 b2 X 2 42 .1 0.82888 33 .7 4.4476 0.60 42 .1 27 .9333 2.6686 11 .4981 . Thus our equation is Yˆ b0 b1 X 1 b2 X 2 11.4981 0.8288X 1 4.4476X 2 b) Compute R 2 and R 2 adjusted for degrees of freedom for both this and the previous problem. Compare the values of R 2 adjusted between this and the previous problem. Use an F test to compare R 2 here with the R 2 from the previous problem. The F test here is one to see if adding a new independent variable improves the regression. (4) Solution: From the first regression we have SST SSy 642.90 , R 2 RY2.1 0.7788 *, X Y nX Y SX Y 1 1 1 = 467.30, b1 1.0715 and SSR b1 S xy 1.0715 (467 .3) 500 .7120 . From the second regression Yˆ b0 b1 X 1 b2 X 2 11.4981 0.8288X 1 4.4476X 2 Y 56.6667 , X 1 33 .3333 , X 2 28 .7778 , X 2Y nX 2 Y SX 2 Y = 30.4. This means that SSR b1 Sx1 y b2 Sx2 y 0.8288 467 .30 4.4476 30.4 387 .2982 135 .2070 522 .5052 . SSR 522 .5052 0.8127 *. If we use R 2 , which is R 2 adjusted for degrees of freedom, for SST 642 .90 the first regression, the number of independent variables was k 1 and R 2 RY2.12 R2 n 1R 2 k 90.7788 1 .7511 R2 n 1R 2 k 90.8127 2 .7592 . R-squared adjusted is supposed to rise if our new variable has n k 1 8 and for the second regression k 2 and n k 1 7 any explanatory power. Note: * These numbers must be positive. The rest may well be negative. There are two ways to do the F test. We can use the second regression to give us SSE SST SSR2 642 .90 522 .5052 120 .3948 . In the second regression, the explained sum of squares rises by 522.5052 – 500.7120 = 21.7932. We can make an ANOVA table for looking at a new variable as follows. Assume that we have SSR1 500 .7120 for the first regression on k independent variables and add r new independent variables and get a new SSR2 522 .5052 Source SS First Regression SSR1 2nd Regression SSR 2 SSR1 Error Total SSE SST DF MS Fcalc k MSR1 MSR1 MSE F F k , nk r 1 r MSR2 MSR2 MSE F r , nk r 1 n k r 1 MSE n 1 27 252y0741 5/7/07 Source SS DF MS Fcalc First Regression 500.7120 1 500.7120 2nd Regression 1 21.7932 21.7932 F 1,7 5.59 29.11s F.05 1,7 5.59 1.26ns F.05 Error 120.3948 7 17.1993 Total 642.90 9 We can get the same results using R 2 . Remember RY2.12 0.8127 and RY2.1 0.7788 . Source SS DF MS Fcalc 0.7788 First Regression RY2.1 1 .7788 2nd Regression RY2.12 RY2.1 0.8127 0.7788 .0339 1 .0339 Error 1 RY2.12 .02676 1 0.8127 .1873 7 F 1,7 5.59 29.10s F.05 1.27ns F 1,7 5.59 .05 Total Column must add to 1.000 9 Note that this seems to show that the second independent variable made no significant improvement in the amount of variation in Y explained by the independent variables. c) Compute the regression sum of squares and use it in an ANOVA F test to test the usefulness of this regression. (5) We already did this in b) Remember SSE SST SSR2 642 .90 522 .5052 120 .3948 . Source SS nd First and 2 Regression 522.5052 DF 2 MS 261.25 Fcalc 15.19s F 2,7 4.74 F.05 Error 120.3948 7 17.20 Total 647.90 9 The null hypothesis is no connection between Y and the X’s. It is rejected because the calculated F is above the table value. d) My Aunt Gertrude, who has an income of $25 thousand, has just walked into the store. Use your regression to predict how much she will spend. Do a confidence interval and a prediction interval. Solution: From the second regression Yˆ b0 b1 X 1 b2 X 2 11.4981 0.8288X 1 4.4476X 2 The problem says X 1 25 , and, since Aunt Gertrude is female, X 2 1 . Yˆ 11.4981 0.828825 4.44761 11.4981 20.7200 4.4476 36.67 7 The error mean square in the ANOVA is 17.20 and has 7 degrees of freedom. So use t .05 1.895 and s e 17 .20 4.1473 . The confidence interval is an interval for an average value of Y when the independent variables are 35 and 1, but there is nothing average about my Aunt Gert, so the prediction interval is more appropriate than the confidence interval. s The outline says that an approximate confidence interval is Y0 Yˆ0 t e n 4.1473 36 .67 2.49 and an approximate prediction interval is 36 .67 1.895 10 Y Yˆ t s 36.67 1.895 4.1473 36.67 7.86 . 0 0 e The regression output from Minitab follows. MTB > Regress c1 1 c2 ; SUBC> Constant; SUBC> Brief 3. Regression Analysis: Expend versus Income The regression equation is Expend = 5.99 + 1.07 Income Predictor Coef SE Coef T P Constant 5.989 6.932 0.86 0.413 Income 1.0715 0.2019 5.31 0.001 S = 4.21556 R-Sq = 77.9% R-Sq(adj) = 75.1% 28 252y0741 5/7/07 Analysis of Variance Source DF SS Regression 1 500.73 Residual Error 8 142.17 Total 9 642.90 MS 500.73 17.77 F 28.18 Obs 1 2 3 4 5 6 7 8 9 10 SE Fit 1.84 2.20 1.90 1.38 2.64 1.64 2.14 1.59 1.53 1.64 Residual 2.15 -2.78 -2.92 4.72 -3.21 -6.06 -0.99 0.29 6.86 1.94 Income 40.0 25.0 27.0 32.0 45.0 29.0 42.0 38.0 30.0 29.0 Expend 51.00 30.00 32.00 45.00 51.00 31.00 50.00 47.00 45.00 39.00 Fit 48.85 32.78 34.92 40.28 54.21 37.06 50.99 46.71 38.14 37.06 P 0.001 St Resid 0.57 -0.77 -0.78 1.19 -0.98 -1.56 -0.27 0.07 1.75 0.50 MTB > Regress c1 2 c2 c3; SUBC> Constant; SUBC> Brief 3. Regression Analysis: Expend versus Income, Gender The regression equation is Expend = 11.5 + 0.829 Income + 4.45 Gender Predictor Coef SE Coef T P Constant 11.500 8.396 1.37 0.213 Income 0.8288 0.2932 2.83 0.026 Gender 4.448 3.952 1.13 0.298 S = 4.14707 R-Sq = 81.3% R-Sq(adj) = 75.9% Analysis of Variance Source DF SS Regression 2 522.51 Residual Error 7 120.39 Total 9 642.90 Source Income Gender DF 1 1 MS 261.26 17.20 F 15.19 P 0.003 Seq SS 500.73 21.78 Obs Income Expend Fit SE Fit Residual St Resid 1 40.0 51.00 49.10 1.83 1.90 0.51 2 25.0 30.00 32.22 2.22 -2.22 -0.63 3 27.0 32.00 33.88 2.09 -1.88 -0.52 4 32.0 45.00 42.47 2.37 2.53 0.74 5 45.0 51.00 53.24 2.74 -2.24 -0.72 6 29.0 31.00 35.54 2.11 -4.54 -1.27 7 42.0 50.00 50.76 2.12 -0.76 -0.21 8 38.0 47.00 47.44 1.70 -0.44 -0.12 9 30.0 45.00 36.36 2.18 8.64 2.45R 10 29.0 39.00 39.98 3.05 -0.98 -0.35 R denotes an observation with a large standardized residual. 29 252y0741 5/7/07 6) Data from the previous problem is repeated below. (Use .05) . Row 1 2 3 4 5 6 7 8 9 10 Expenditure Income Gender y x1 x2 Inter x3 51 30 32 45 51 31 50 47 45 39 40 25 27 32 45 29 42 38 30 29 1 0 0 1 1 0 1 1 0 1 40 0 0 32 45 0 42 38 0 29 Part of the printout from the last regression follows. Regression Analysis: Expend versus Income, Gender, Inter The regression equation is Expend = - 28.5 + 2.27 Income + 48.8 Gender - 1.56 Inter Predictor Coef SE Coef Constant -28.53 27.62 Income 2.2712 0.9930 Gender 48.80 29.61 Inter -1.557 1.032 S = 3.81355 R-Sq = 86.4% T P -1.03 0.342 2.29 0.062 1.65 0.150 -1.51 0.182 R-Sq(adj) = 79.6% a) Check the above printout i) to see if adding the interaction variable improved our regression and ii) to compute a partial correlation between interaction and expenditure. Explain the meaning of the partial correlation. (2.5) b) Use the R-squared here and the R-squared in your first regression to do an F-test to show if the gender and interaction variables together accomplished anything. (2) c) Compute the sample correlation between income and expenditure and test it for significance. (4.5) d) Test the same correlation to see if it is .9. (5) e) Compute the rank correlation between income and expenditure and test it for significance. Please do not forget to rank the variables first. (5) [19, 117] Solution: a) Check the above printout i) to see if adding the interaction variable improved our regression and ii) to compute a partial correlation between interaction and expenditure. Explain the meaning of the partial correlation. (2.5) i) Well R-square adjusted rose, but the fact that none of the coefficients are significant at the 5% level indicates that we have accomplished very little. t2 1.512 0.1667 . This indicated that after ii) The outline says rY23.12 2 3 t 3 df 1.512 6 Income and Gender are accounted for, there is a relatively weak linear relationship between Interaction and Expenditure. b) Use the R-squared here and the R-squared in your first regression to do an F-test to show if the gender and interaction variables together accomplished anything. (2) n 10 First regression: k 1 ; R 2 .7788 R 2 .7511 . This regression: k 3 ; R 2 .864 R 2 .796 Source SS DF MS RY2.1 0.7788 1 .7788 2nd & 3rd Regression RY2.123 RY2.1 0.864 0.7788 .085 2 .0425 6 .0227 First Regression Error Total 1 RY2.123 1 0.864 .136 Fcalc F 1,6 5.99 34.31s F.05 2,6 5.14 1.87ns F.05 Column must add to 1.000 9 No surprise! Although the adjusted R-squared rose, the insignificant F means that both variables together do little. 30 252y0741 5/7/07 c) Compute the sample correlation between income and expenditure and test it for significance. (4.5) We already know that R 2 .7788 , n 10 and df n k 1 8 . Since the regression coefficient is positive, we can say r .7766 .8825 r H 0 : xy 0 Test The test of the null hypothesis 0 is t n 2 sr H 1 : xy 0 .8825 r 1 r 2 n2 .8825 8 5.2895 . The ‘reject’ regions are below t .025 2.306 and above 0.16711 1 0.7766 8 8 t .025 2.306 . Since our computed t ratio is in the upper ‘reject’ zone, reject the null hypothesis. d) Test the same correlation to see if it is .9. (5) H 0 : xy 0.9 Test when n 10, r .8825 and r 2 .7766 .05 . H : 0 . 9 1 xy This time compute Fisher's z-transformation (because 0 is not zero) 1 1 r 1 1 .8825 1 1.8825 1 1 ~ z ln ln ln ln 16 .02128 2.77392 1.39696 2 1 r 2 1 .8825 2 0.1175 2 2 1 1 0 z ln 2 1 0 sz 1 1 .9 1 1.9 1 1 ln 2 1 .9 2 ln 0.1 2 ln 19 .0000 2 2.94444 1.47222 ~ z z 1.39696 1.47222 1 1 1 0.199 . 0.37796 . Finally t n3 10 3 7 sz 0.37796 Compare this with t n2 2 t .8025 2.306 . Since –0.591 lies between these two values, do not reject the null hypothesis. e) Compute the rank correlation between income and expenditure and test it for significance. Please do not forget to rank the variables first. (5) [19, 117]. The original data is printed below in columns (1) and (2). The expenditure numbers are ranked in (3) and the income numbers are ranked in (4). d is the difference between the ranks, which is squared in d 2 . Note that the rank columns must add to nn 1 10 11 55 , and that the d column must add to zero. 2 2 (1) (2) (3) (4) (5) ( 6) d Row Expenditure Income Rank Expenditure Rank Income d2 1 2 3 4 5 6 7 8 9 10 Sum 51 30 32 45 51 31 50 47 45 39 40 25 27 32 45 29 42 38 30 29 9.5 1.0 3.0 5.5 9.5 2.0 8.0 7.0 5.5 4.0 55.0 8.0 1.0 2.0 6.0 10.0 3.5 9.0 7.0 5.0 3.5 55.0 1.5 0.0 1.0 -0.5 -0.5 -1.5 -1.0 0.0 0.5 0.5 0.0 2.25 0.00 1.00 0.25 0.25 2.25 1.00 0.00 0.25 0.25 7.50 31 252y0741 5/7/07 A correlation coefficient between the ranks, rx and ry , can be computed as in c) above, but it is easier to compute d rx ry , and then rs 1 d 1 67.50 1 .0455 .9545 . This can be given a t 10 100 1 nn 1 2 6 2 test for H 0 : 0 as in b) above, but for n between 4 and 30, a special table should be used. Part of Table from the Supplement is repeated below. We do not reject the null hypothesis of no rank correlation if rs is between ±.6364. Since our value of the rank correlation is above +.6364, reject the null hypothesis. n .100 .050 .025 .010 .005 .001 4 5 6 7 8 9 10 .8000 .7000 .6000 .5357 .5000 .4667 .4424 .8000 .8000 .7714 .6786 .6190 .5833 .5515 .9000 .8286 .7450 .7143 .6833 .6364 .9000 .8857 .8571 .8095 .7667 .7333 .9429 .8929 .8571 .8167 .8167 .9643 .9286 .9000 .8667 32 252y0741 5/7/07 7) The following are entirely tests of proportions. Samples of senior executives were polled on three separate dates as to the business outlook. Please test the following using a 99% confidence level. a) In Year 1 a third of executives were unsure or prediction no change. (2) b) The proportion that predicted no change or were undecided did not change between Year 1 and Year 2. Do this problem using a confidence interval, a test ratio and a critical value (5) or just by using one of these three methods. (3) c) In b) do a 97% confidence interval for the difference between the proportions that predicted no change. Do not use a value from the t-table to do this. (2) d.) The executives' outlook did not change over time (5). [14, 131] Outlook Year 1 Year 2 Year3 Total Go Up 152 177 101 430 Go Down 104 72 36 212 ? or Same 144 152 265 561 Total 400 401 402 1203 Solution: a) In Year 1 a third of executives were unsure or prediction no change. (2) From Table 4 of the syllabus supplement, we have the following. Interval for Confidence Hypotheses Test Ratio Critical Value Interval p p z 2 s p Proportion sp H 0 : p p0 z H1 : p p0 pq n pcv p0 z 2 p p p0 p p0 q0 n q0 1 p0 p q 1 p 144 .360 . Note that a few 400 people told me that they either did or didn’t think that .36 was close to 13 . That’s a good way to make me suspect that you haven’t learned anything in this course. H0 : p 1 3 H1 : p 1 3. .01 , z z .005 2.576 and p 2 Test Ratio Method: p z p p0 p p0 q 0 n 1 3 2 3 400 0.02357 and p .360 .360 13 1.13137 . This is between 2.576 , so do not reject .02357 H0 . Critical Value Method: pcv p0 z p 2 1 13 0.0607 . Do not 3 2.576 0.02357 reject the null hypothesis if p is between .273 and .283. This interval includes .36 so do not reject H 0 . Confidence Interval Method: s p pq .360 .640 0.02400 n 400 p p z 2 s p .36 196 . .024 .36.047 or .313 to .407. Since this interval includes 1 3 do not reject H 0 . 33 252y0741 5/7/07 b) The proportion that predicted no change or were undecided did not change between Year 1 and Year 2. Do this problem using a confidence interval, a test ratio and a critical value (5) or just by using one of these three methods. (3) From Table 4 of the syllabus supplement, we have the following. Interval for Confidence Hypotheses Test Ratio Critical Value Interval pcv p0 z 2 p Difference p p 0 H 0 : p p 0 p p z 2 s p z between If p 0 p H 1 : p p 0 p p1 p 2 proportions 1 1 If p 0 p p 0 q 0 p 0 p 01 p 02 p1 q1 p 2 q 2 q 1 p n n 2 1 s p p 01q 01 p 02 q 02 p or p 0 0 n1 n2 n1 n2 n1 p1 n 2 p 2 p0 n1 n 2 Or use s p H 1 : p1 p 2 or p 0 . .01 , z z .005 2.576 , H 0 : p1 p 2 or p 0 2 p1 144 152 .3600 and p 2 .3791 400 401 n p n2 p2 144 152 .3695 Test Ratio Method: p0 1 1 n1 n2 400 401 p p0 q 0 1 n1 1 n2 .3695.6305 1 400 1 401 .00116 .03413 p p1 p 2 144 400 152 401 .3600 .3791 .0191 p p 0 .0191 0 z 0.5596 . Between 2.576 so do not reject H 0 . p .03413 Critical Value Method: pcv p0 z p 0 2.5760.03413 0.0879 . This 2 interval includes -.0191 so do not reject H 0 . Confidence Interval Method: s p p1 q1 p 2 q 2 .3600 .6400 .3791 .6209 400 401 n1 n2 .00057600 .00058699 .00116299 .034103 . So p p z s p 2 .0191 2.576 .034103 .0191 .08784 . Since the error part of the interval is larger in absolute value than -.0191, the interval includes zero. So do not reject H 0 . c) In b) do a 97% confidence interval for the difference between the proportions that predicted no change. Do not use a value from the t-table to do this. (2) .03 , z z .015 and s p .034103. Make a diagram! Show a Normal curve for z with a 2 center at zero. By definition z .015 is the point with 1.5% above it. Since the probability below zero is .5, the diagram should show P0 z z.015 .5 .0150 .4850 . According to the Normal table P0 z 2.17 .4850 , so z .015 2.17 . If we wish to check this, Pz 2.17 Pz 0 P0 z 2.17 .5 .4850 .0150 . So our interval is p p z 2 s p .0191 2.17 .034103 .0191 .0740 or -.093 to .055. 34 252y0741 5/7/07 d.) The executives' outlook did not change over time (5). [14, 131] A test of multiple proportions is a Chi-square Test of Homogeneity. H 0 is homogeneity. The given table is repeated below with row sums and the fraction in each row added. Outlook Year 1 Year 2 Year 3 Total pr Go Up 152 177 101 430 .3574 Go Down 104 72 36 212 .1762 ? or Same 144 152 265 561 .4663 . Total 400 401 402 1203 p r is the fraction of the total of 1203 in each row. .9999 O Year1 Year 2 Year 3 Total Up 152 177 101 430 Our observed data can be displayed as Down 104 72 36 Same 144 152 265 561 400 401 402 1203 E Up Year1 Year 2 Year 3 142 .98 143 .33 143 .69 212 and our expected Total 430 .00 values are Down 70 .49 70 .67 70 .84 212 .00 , where 142.98, for example is Same 186 .53 187 .00 187 .47 400 .00 401 .00 402 .00 1203 .00 561 .00 .3574 400 . (.3574 is the fraction of the data that was observed in the ‘Up’ row and 400 was the sum of the ‘Year 1’ column. 2 computations are done two ways below. Degrees of freedom are r 1 c 1 2 2 4 . O E O2 O E E E 152 142.98 161.59 9.024 0.570 177 70.49 444.44 106.510 160.934 101 186.53 54.69 -85.534 39.221 104 143.33 75.46 -39.333 10.794 72 70.67 73.36 1.333 0.025 36 187.00 6.93 -151.000 121.930 144 143.69 144.31 0.309 0.001 152 70.84 326.13 81.157 92.973 265 187.47 374.60 77.534 32.067 1203 1203.00 1661.51 0.000 458.514 2 2 166151 . 1203 45851 . and .05 9.4877 . So, since our computed chi-square is larger than the chi-square from the table, reject H 0 and conclude that the executives’ outlook did change over time. 2 O E 35