252y0781 12/11/07 ECO252 QBA2 Final Exam December 12-15, 2007 Version 1 Name and Class hour:__________KEY_______ I. (25+ points) Do all the following. Note that answers without reasons and/or citation of appropriate statistical tests receive no credit. Most answers require a statistical test, that is, stating or implying a hypothesis and showing why it is true or false by citing a table value or a p-value. If you haven’t done it lately, take a fast look at ECO 252 - Things That You Should Never Do on a Statistics Exam (or Anywhere Else). There are over 150 possible points, but the exam is normed on 75 points. In the Lees’ 2000 text they noted that before 1979 the Federal Reserve targeted interest rates, letting the money supply grow in such a way that the interest rates would remain stable. After 1979, the Fed switched to targeting the money supply. The Lees did a regression of Money supply against GNP (I had to replace this with GDP.), the prime rate (PrRt) and a dummy variable (Dummy) that is 1 before 1979 and zero from 1979 till 1990, when their analysis stops, They report a high R-squared, and extremely significant coefficients for the Prime Rate, GNP and the dummy variable, which seems to tell us that the Fed’s change of regime had a real effect on the money supply. Later in the text they suggest the addition of an interaction variable (GDPPR), which is the product of the Prime rate and the GDP, and a second interaction variable (GDPPR). I added the year and its square measured from 1958, population, and GDP squared. My attempt to update the Lees results was terrible discouraging. The dependent variable is M1 or its logarithm (logM1). ————— 12/3/2007 11:31:46 PM ———————————————————— Welcome to Minitab, press F1 for help. MTB > WOpen "C:\Documents and Settings\RBOVE\My Documents\Minitab\M1PrRGDP.MTW". Retrieving worksheet from file: 'C:\Documents and Settings\RBOVE\My Documents\Minitab\M1PrRGDP.MTW' Worksheet was saved on Mon Dec 03 2007 MTB > print c5 c2 c4 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 Data Display Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 C5 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 M1 140.0 140.7 145.2 147.8 153.3 160.3 167.8 172.0 183.3 197.4 203.9 214.4 228.3 249.2 262.9 274.2 287.1 306.2 330.9 357.3 381.8 408.5 436.7 474.8 521.4 551.6 619.8 724.7 750.2 786.7 792.9 824.7 896.9 1024.8 1129.7 1150.7 PrRt 4.50 5.00 4.50 4.50 4.50 4.50 4.50 5.52 5.50 6.50 8.23 8.00 5.50 5.04 7.49 11.54 7.07 7.20 6.75 8.63 11.65 12.63 20.03 16.50 10.50 12.60 9.78 8.50 8.25 9.00 11.07 10.00 8.50 6.50 6.00 7.25 GDP $506.60 $526.40 $544.70 $585.60 $617.70 $663.60 $719.10 $787.80 $832.60 $910.00 $984.60 $1,038.50 $1,127.10 $1,238.30 $1,382.70 $1,500.00 $1,638.30 $1,825.30 $2,030.90 $2,294.70 $2,563.30 $2,789.50 $3,128.40 $3,255.00 $3,536.70 $3,933.20 $4,220.30 $4,462.80 $4,739.50 $5,103.80 $5,484.40 $5,803.10 $5,995.90 $6,337.70 $6,657.40 $7,072.20 Dummy 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 GDPPr 2280 2632 2451 2635 2780 2986 3236 4349 4579 5915 8103 8308 6199 6241 10356 17310 11583 13142 13709 19803 29862 35231 62662 53708 37135 49558 41275 37934 39101 45934 60712 58031 50965 41195 39944 51273 GDPdum 506.6 526.4 544.7 585.6 617.7 663.6 719.1 787.8 832.6 910.0 984.6 1038.5 1127.1 1238.3 1382.7 1500.0 1638.3 1825.3 2030.9 2294.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 year 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 yearsq 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 289 324 361 400 441 484 529 576 625 676 729 784 841 900 961 1024 1089 1156 1225 1296 1 252y0781 12/11/07 37 38 39 40 41 42 43 44 45 46 47 48 Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 1127.4 1081.4 1072.8 1095.9 1123.0 1087.7 1182.0 1219.5 1305.5 1375.2 1373.2 1365.9 Pop 176289 179979 182992 185771 188483 191141 193526 195576 197457 199399 201385 203984 206827 209284 211357 213342 215465 217583 219760 222095 224567 227225 229466 231664 233792 235825 237924 240133 242289 244499 246819 249623 252981 256514 259919 263126 266278 269394 272647 275854 279040 282217 285226 288126 290796 293638 296507 299398 9.00 8.25 8.50 8.50 7.75 9.50 6.98 4.75 4.22 4.01 6.01 8.02 GDPsq 256644 277097 296698 342927 381553 440365 517105 620629 693223 828100 969437 1078482 1270354 1533387 1911859 2250000 2684027 3331720 4124555 5265648 6570507 7781310 9786887 10595025 12508247 15470062 17810932 19916584 22462860 26048774 30078643 33675970 35950817 40166441 44320975 50016013 54725965 61103926 68961398 76510009 85903239 96373489 102576384 109612524 120139137 136560259 154601869 174100108 $7,397.70 $7,816.90 $8,304.30 $8,747.00 $9,268.40 $9,817.00 $10,128.00 $10,469.60 $10,960.80 $11,685.90 $12,433.90 $13,194.70 log M1 4.94164 4.94663 4.97811 4.99586 5.03240 5.07705 5.12277 5.14749 5.21112 5.28523 5.31763 5.36784 5.43066 5.51826 5.57177 5.61386 5.65983 5.72424 5.80182 5.87858 5.94490 6.01249 6.07925 6.16289 6.25652 6.31282 6.42940 6.58576 6.62034 6.66785 6.67570 6.71502 6.79894 6.93225 7.02971 7.04813 7.02767 6.98601 6.97803 6.99933 7.02376 6.99182 7.07496 7.10620 7.17434 7.22635 7.22490 7.21957 0 0 0 0 0 0 0 0 0 0 0 0 66579 64489 70587 74350 71830 93262 70693 49731 46255 46860 74728 105821 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 37 38 39 40 41 42 43 44 45 46 47 48 1369 1444 1521 1600 1681 1764 1849 1936 2025 2116 2209 2304 logM1l 4.89222 4.94164 4.94663 4.97811 4.99586 5.03240 5.07705 5.12277 5.14749 5.21112 5.28523 5.31763 5.36784 5.43066 5.51826 5.57177 5.61386 5.65983 5.72424 5.80182 5.87858 5.94490 6.01249 6.07925 6.16289 6.25652 6.31282 6.42940 6.58576 6.62034 6.66785 6.67570 6.71502 6.79894 6.93225 7.02971 7.04813 7.02767 6.98601 6.97803 6.99933 7.02376 6.99182 7.07496 7.10620 7.17434 7.22635 7.22490 I followed the course suggested by the textbook to find what variables were actually important in predicting the money supply. The following is quoted from an earlier final exam solution. The best subsets approach, according to Behrenson et. al., involves: (i) Choosing a large set of candidate independent variables; (ii) running a regression with all the candidate variables and using the VIF option, which tests for collinearity; (iii) eliminating variables with a VIF over 5; (iv) continuing to run regressions and eliminate candidate variables until there are no variables with a VIF over 5; (v) performing a best-subsets regression on the model without high VIFs and computing C p ; (vi) shortlisting the 2 252y0781 12/11/07 models with a C p less than or close to k 1 , where k is the number of independent variables in that regression; (vii) choosing from the shortlist on the basis of things like significance of coefficients and R-squared; (viii) using residual analysis and influence analysis to further refine the model by adding nonlinear terms, transforming variables and eliminating suspicious observations. Note that terms like a squared term are largely exempt from the VIF rules, if they are correlated with the untransformed variable. Results for: M1PrRGDP.MTW MTB > Regress c2 5 c4 c6 c7 c10 c12; SUBC> Constant; SUBC> VIF; SUBC> Brief 2. Regression 1 Regression Analysis: M1 versus PrRt, GDP, Dummy, year, Pop The regression equation is M1 = 2874 - 19.1 PrRt + 0.0714 GDP - 115 Dummy + 46.2 year - 0.0149 Pop Predictor Coef SE Coef T Constant 2874 1232 2.33 PrRt -19.116 3.941 -4.85 GDP 0.07138 0.01762 4.05 Dummy -114.81 48.62 -2.36 year 46.23 15.57 2.97 Pop -0.014888 0.007176 -2.07 S = 57.7863 R-Sq = 98.4% R-Sq(adj) Analysis of Variance Source DF SS Regression 5 8498077 Residual Error 42 140249 Total 47 8638326 Source PrRt GDP Dummy year Pop DF 1 1 1 1 1 MS 1699615 3339 P VIF 0.025 0.000 2.241 0.000 62.461 0.023 8.260 0.005 668.523 0.044 917.418 = 98.2% F 508.98 P 0.000 Seq SS 3746 8260319 139454 80187 14371 Unusual Observations Obs PrRt M1 Fit SE Fit Residual St Resid 23 20.0 436.70 361.08 37.33 75.62 1.71 X 35 6.0 1129.70 982.60 18.35 147.10 2.68R 36 7.3 1150.70 986.80 14.01 163.90 2.92R 37 9.0 1127.40 975.89 11.81 151.51 2.68R R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large leverage. So the regression above was my first attempt. There are several questions that can be asked at this point. 1) Why does this regression look awfully good as far as significance and the amount of the variation in the Y variable that is explained by the equation? (3) Answer: The p-value for the ANOVA is below 1%, which means that there is a significant relationship between the X values and Y. We have an R-squared of 98.4%. This means that the regression explains 98.4% of the variation in Y. Also every p-value is below 5%, which means that all of the coefficients are significant at the 5% level. A coefficient, i is insignificant if the null hypothesis H 0 : i 0 cannot be shown to be false at a reasonable significance level (Usually .05 or .01 ). In practice this means that the b t-ratio t i is not between t or the pvalue 2P t t computed or, if the t-ratio is 2 s bi negative, pvalue 2P t t computed is below a reasonable significance level. 3 252y0781 12/11/07 2) There are only two coefficients here whose sign you can predict in advance. What are they, what did you predict and why and were you right? (2) Answer: The amount of money demanded should fall as the interest rate rises, so the coefficient of PrRt should be negative. On the other hand, we should expect the need for money to rise as the GDP rises, so the coefficient of GDP should be positive. These are both as forecasted in this regression. 3) What does the Analysis of Variance tell us? What hypothesis did it cause you to reject? (1) Answer: The p-value for the ANOVA is below 1%, which means that there is a significant relationship between the X values and Y. The null hypothesis for a basic regression ANOVA is that there is no linear relation between the X variables and Y. MTB > Regress c2 4 c4 c6 c7 c10 ; SUBC> Constant; SUBC> VIF; SUBC> Brief 2. Regression 2 Regression Analysis: M1 versus PrRt, GDP, Dummy, year The regression equation is M1 = 321 - 20.7 PrRt + 0.0415 GDP - 174 Dummy + 14.5 year Predictor Coef SE Coef Constant 321.24 66.06 PrRt -20.668 4.016 GDP 0.04152 0.01055 Dummy -173.71 40.96 year 14.530 3.077 S = 59.9651 R-Sq = 98.2% T P VIF 4.86 0.000 -5.15 0.000 2.160 3.94 0.000 20.791 -4.24 0.000 5.444 4.72 0.000 24.254 R-Sq(adj) = 98.0% Analysis of Variance Source DF SS Regression 4 8483706 Residual Error 43 154620 Total 47 8638326 MS 2120927 3596 Source PrRt GDP Dummy year DF 1 1 1 1 F 589.83 P 0.000 Seq SS 3746 8260319 139454 80187 Unusual Observations Obs PrRt M1 Fit SE Fit Residual St Resid 23 20.0 436.70 371.34 38.39 65.36 1.42 X 35 6.0 1129.70 982.21 19.04 147.49 2.59R 36 7.3 1150.70 988.13 14.53 162.57 2.79R 37 9.0 1127.40 980.00 12.08 147.40 2.51R R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large leverage. MTB > Regress c2 3 c4 c6 c7 ; SUBC> Constant; SUBC> VIF; SUBC> Brief 2. Regression 3 Regression Analysis: M1 versus PrRt, GDP, Dummy The regression equation is M1 = 451 - 14.3 PrRt + 0.0865 GDP - 240 Dummy Predictor Coef SE Coef T P VIF Constant 450.99 73.19 6.16 0.000 PrRt -14.269 4.605 -3.10 0.003 1.914 GDP 0.086456 0.005548 15.58 0.000 3.875 Dummy -239.76 46.90 -5.11 0.000 4.809 S = 73.0515 R-Sq = 97.3% R-Sq(adj) = 97.1% Analysis of Variance 4 252y0781 12/11/07 Source Regression Residual Error Total Source PrRt GDP Dummy DF 1 1 1 DF 3 44 47 SS 8403519 234807 8638326 MS 2801173 5337 F 524.91 P 0.000 Seq SS 3746 8260319 139454 Unusual Observations Obs PrRt M1 Fit SE Fit Residual St Resid 23 20.0 436.7 435.7 43.7 1.0 0.02 X 35 6.0 1129.7 941.0 20.6 188.7 2.69R 36 7.3 1150.7 959.0 16.0 191.7 2.69R 37 9.0 1127.4 962.1 14.0 165.3 2.30R R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large leverage. 4) What did I do to get from Regression 1 to regression 3 and why? (2) Answer: I removed ‘Year’ and ‘Pop’ because they had the highest VIF. This means that there was a great deal of collinearity present and that these variables were not really needed. Note that R-squared has barely fallen. 5) Why was I now ready to quit dropping variables and do a ‘best subsets’ regression? (1) [9] Answer: The VIFs are now all below 5. Of course, the signs of the coefficients and their significance are fine too. 6) What would the money supply be that would be predicted for 1970 assuming that the numbers given for 1970 are correct? By what percent is it off the actual value? (2) Answer: The regression equation is M1 = 450.99 - 14.269 PrRt + 0.086456 GDP – 239.86 Dummy. The line from our data says Row C5 12 1970 M1 214.4 PrRt 8.00 GDP $1,038.50 Dummy 1 GDPPr GDPdum 8308 1038.5 year 12 yearsq 144 So M1 = 450.99 - 14.269(8.00) + 0.086456(1038.50) – 239.86(1) = 450.99 - 114.152 + 89.785 – 239.86(1) = 186.763. The observed value was 214.4. The difference between the two numbers is about 13% of the observed value. 7) Can you make this into a rough prediction interval? Does this include the actual value for 1970? (2) [13] Answer: The outline says “We can use this ( s e ) to find an approximate prediction interval Y0 Yˆ0 t s e .” 44 2.015. The The ANOVA says that there are 44 degrees of freedom and that S = 73.0515. We can use t .025 interval is thus 186 .763 2.015 73.0515 186 .8 147 .2 . This rather gigantic interval includes the actual value. 5 252y0781 12/11/07 MTB > BReg c2 c4 c6 c7 ; SUBC> NVars 1 3; SUBC> Best 2; SUBC> Constant. Regression 4 Best Subsets Regression: M1 versus PrRt, GDP, Dummy Response is M1 Vars 1 1 2 2 3 R-Sq 95.6 67.8 96.7 95.7 97.3 R-Sq(adj) 95.6 67.1 96.5 95.5 97.1 Mallows Cp 26.5 477.7 11.6 28.1 4.0 S 90.432 246.02 79.727 91.197 73.051 D P u r G m R D m t P y X X X X X X X X X 8) What is Regression 4 telling me to do? Why can you say that? (2) Answer: The only satisfactory value of Mallow’s Cp is 4.0, which is the regression with all three independent variables. MTB > Regress c2 3 c4 c6 c7 ; SUBC> GFourpack; SUBC> RType 1; SUBC> Constant; SUBC> VIF; SUBC> DW; SUBC> Brief 2. Regression 5 Regression Analysis: M1 versus PrRt, GDP, Dummy The regression equation is M1 = 451 - 14.3 PrRt + 0.0865 GDP - 240 Dummy Predictor Coef SE Coef T P VIF Constant 450.99 73.19 6.16 0.000 PrRt -14.269 4.605 -3.10 0.003 1.914 GDP 0.086456 0.005548 15.58 0.000 3.875 Dummy -239.76 46.90 -5.11 0.000 4.809 S = 73.0515 R-Sq = 97.3% R-Sq(adj) = 97.1% Analysis of Variance Source DF SS Regression 3 8403519 Residual Error 44 234807 Total 47 8638326 Source PrRt GDP Dummy DF 1 1 1 MS 2801173 5337 F 524.91 P 0.000 Seq SS 3746 8260319 139454 Unusual Observations Obs PrRt M1 Fit SE Fit Residual St Resid 23 20.0 436.7 435.7 43.7 1.0 0.02 X 35 6.0 1129.7 941.0 20.6 188.7 2.69R 36 7.3 1150.7 959.0 16.0 191.7 2.69R 37 9.0 1127.4 962.1 14.0 165.3 2.30R R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large leverage. Durbin-Watson statistic = 0.445619 6 252y0781 12/11/07 Residual Plots for M1 9) Regression 5 is just a repeat of regression 3, but now I am doing residual analysis. What are the DurbinWatson statistic and the plot of residuals vs. order telling me is present? What 2 conditions for regression seem to be being violated? (3) [18] Answer: The low value of the Durbin-Watson statistics and the graphs seem to show a lot of serial correlation. The errors seem to come in waves. Furthermore, the graphs seem to indicate that the errors seem to be getting bigger as time passes (and GDP grows). The basic theorem for regression says that there should be no serial correlation and that the errors should not correlate with the independent variables. MTB > Regress c2 4 c4 c6 c7 c13; SUBC> GFourpack; SUBC> RType 1; SUBC> Constant; SUBC> VIF; SUBC> DW; SUBC> Brief 2. Regression 6 Regression Analysis: M1 versus PrRt, GDP, Dummy, GDPsq The regression equation is M1 = 131 - 13.1 PrRt + 0.187 GDP - 26.3 Dummy - 0.000007 GDPsq Predictor Coef SE Coef T P Constant 131.36 64.18 2.05 0.047 PrRt -13.142 3.050 -4.31 0.000 GDP 0.18659 0.01370 13.62 0.000 Dummy -26.33 41.88 -0.63 0.533 GDPsq -0.00000671 0.00000088 -7.59 0.000 S = 48.3231 R-Sq = 98.8% R-Sq(adj) = 98.7% Analysis of Variance Source DF SS Regression 4 8537916 Residual Error 43 100410 Total 47 8638326 Source PrRt GDP Dummy GDPsq DF 1 1 1 1 MS 2134479 2335 F 914.07 VIF 1.919 53.994 8.764 33.120 P 0.000 Seq SS 3746 8260319 139454 134396 7 252y0781 12/11/07 Unusual Observations Obs PrRt M1 Fit SE Fit Residual St Resid 23 20.0 436.70 386.21 29.65 50.49 1.32 X 35 6.0 1129.70 997.46 15.53 132.24 2.89R 36 7.3 1150.70 1020.24 13.32 130.46 2.81R 37 9.0 1127.40 1026.39 12.53 101.01 2.16R 42 9.5 1087.70 1191.94 14.93 -104.24 -2.27R 48 8.0 1365.90 1320.38 30.91 45.52 1.23 X R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large leverage. Durbin-Watson statistic = 0.551845 Residual Plots for M1 10) I now felt free to add the square of GDP as a new independent variable? What happened to the VIFs? Do I care? Why? (2) Answer: The VIFs went up. But they only show that there is a close relationship between GDP and its square. Since these are, in a sense, the same variable, I don’t care. 11) What did adding the square of GDP do to the significance of my coefficients and the fraction of the variation of Y that is explained by the equation? (2) [22] Answer: This is the real problem here. The pvalue for the coefficient of the dummy variable has gone through the roof, indicating that it is no longer significant. Both r-square and r-squared adjusted have risen, so maybe we are doing something right. MTB > let c14 = loge (c2) 8 252y0781 12/11/07 MTB > Regress c14 4 c4 c6 c7 c13; SUBC> GFourpack; SUBC> RType 1; SUBC> Constant; SUBC> VIF; SUBC> DW; SUBC> Brief 2. Regression 7 Regression Analysis: log M1 versus PrRt, GDP, Dummy, GDPsq The regression equation is log M1 = 4.79 + 0.00846 PrRt + 0.000453 GDP + 0.0289 Dummy - 0.000000 GDPsq Predictor Coef SE Coef T P Constant 4.7882 0.1358 35.26 0.000 PrRt 0.008461 0.006453 1.31 0.197 GDP 0.00045309 0.00002899 15.63 0.000 Dummy 0.02889 0.08862 0.33 0.746 GDPsq -0.00000002 0.00000000 -11.66 0.000 S = 0.102246 R-Sq = 98.5% R-Sq(adj) = 98.4% Analysis of Variance Source DF SS Regression 4 29.3981 Residual Error 43 0.4495 Total 47 29.8476 Source PrRt GDP Dummy GDPsq DF 1 1 1 1 MS 7.3495 0.0105 F 703.01 VIF 1.919 53.994 8.764 33.120 P 0.000 Seq SS 1.2680 25.6375 1.0725 1.4202 Unusual Observations Obs PrRt log M1 Fit SE Fit Residual St Resid 23 20.0 6.0792 6.1618 0.0627 -0.0826 -1.02 X 42 9.5 6.9918 7.2158 0.0316 -0.2239 -2.30R 48 8.0 7.2196 7.0393 0.0654 0.1803 2.29RX R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large leverage. Durbin-Watson statistic = 0.306367 Residual Plots for log M1 9 252y0781 12/11/07 12) I just replaced the money supply by its logarithm. The residual analysis tells me this was a sort of good idea? What does that mean? (1)[23] Answer: The last diagram seems to show that there is a somewhat smaller tendency of the error to grow with time. 13) What is really weird about these coefficients? Which one has the wrong sign? (1) Answer: The coefficients are awfully small, though this probably could be fixed for GDP and GDP squared, though better technique might have fixed this by dividing GDP by 10 3 (and GDP squared by 10 6 ) before we started. But the coefficient of the interest rate now has an unexpected positive sign, and it is also insignificant. The dummy variable also has an enormous p-value. MTB > Regress c14 3 c4 c6 SUBC> GFourpack; SUBC> RType 1; SUBC> Constant; SUBC> VIF; SUBC> DW; SUBC> Brief 2. c13; Regression 8 Regression Analysis: log M1 versus PrRt, GDP, GDPsq The regression equation is log M1 = 4.83 + 0.00732 PrRt + 0.000445 GDP - 0.000000 GDPsq Predictor Coef SE Coef T P Constant 4.83016 0.04310 112.06 0.000 PrRt 0.007316 0.005359 1.37 0.179 GDP 0.00044536 0.00001650 26.99 0.000 GDPsq -0.00000002 0.00000000 -15.60 0.000 S = 0.101203 R-Sq = 98.5% R-Sq(adj) = 98.4% Analysis of Variance Source DF SS Regression 3 29.3970 Residual Error 44 0.4506 Total 47 29.8476 Source PrRt GDP GDPsq DF 1 1 1 MS 9.7990 0.0102 F 956.75 VIF 1.351 17.854 18.176 P 0.000 Seq SS 1.2680 25.6375 2.4915 Unusual Observations Obs 23 42 48 PrRt 20.0 9.5 8.0 log M1 6.0792 6.9918 7.2196 Fit 6.1606 7.2104 7.0413 SE Fit 0.0620 0.0267 0.0644 Residual -0.0814 -0.2186 0.1783 St Resid -1.02 X -2.24R 2.28RX R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large leverage. Durbin-Watson statistic = 0.289829 10 252y0781 12/11/07 Residual Plots for log M1 MTB > Regress c14 2 SUBC> GFourpack; SUBC> RType 1; SUBC> Constant; SUBC> VIF; SUBC> DW; SUBC> Brief 2. c6 c13; Regression 9 Regression Analysis: log M1 versus GDP, GDPsq The regression equation is log M1 = 4.87 + 0.000457 GDP - 0.000000 GDPsq Predictor Coef SE Coef T P Constant 4.87027 0.03184 152.96 0.000 GDP 0.00045654 0.00001446 31.58 0.000 GDPsq -0.00000002 0.00000000 -18.76 0.000 S = 0.102169 R-Sq = 98.4% R-Sq(adj) = 98.4% Analysis of Variance Source DF SS Regression 2 29.378 Residual Error 45 0.470 Total 47 29.848 Source GDP GDPsq DF 1 1 MS 14.689 0.010 F 1407.18 VIF 13.455 13.455 P 0.000 Seq SS 25.705 3.673 Unusual Observations Obs GDP log M1 Fit SE Fit Residual St Resid 42 9817 6.9918 7.1988 0.0256 -0.2070 -2.09R 47 12434 7.2249 7.0925 0.0478 0.1324 1.47 X 48 13195 7.2196 7.0041 0.0590 0.2154 2.58RX R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large leverage. Durbin-Watson statistic = 0.208342 11 252y0781 12/11/07 Residual Plots for log M1 MTB > Regress c14 3 SUBC> GFourpack; SUBC> RType 1; SUBC> Constant; SUBC> VIF; SUBC> DW; SUBC> Brief 2. c6 c13 c8; Regression 10 Regression Analysis: log M1 versus GDP, GDPsq, GDPPr The regression equation is log M1 = 4.87 + 0.000465 GDP - 0.000000 GDPsq - 0.000001 GDPPr Predictor Coef SE Coef T P VIF Constant 4.86787 0.03240 150.23 0.000 GDP 0.00046548 0.00002208 21.08 0.000 30.892 GDPsq -0.00000002 0.00000000 -16.38 0.000 17.958 GDPPr -0.00000070 0.00000130 -0.54 0.593 5.889 S = 0.102985 R-Sq = 98.4% R-Sq(adj) = 98.3% Analysis of Variance Source DF SS Regression 3 29.3810 Residual Error 44 0.4667 Total 47 29.8476 MS 9.7937 0.0106 F 923.42 P 0.000 Source DF Seq SS GDP 1 25.7052 GDPsq 1 3.6727 GDPPr 1 0.0031 Unusual Observations Obs GDP log M1 Fit SE Fit Residual St Resid 42 9817 6.9918 7.1826 0.0396 -0.1908 -2.01R 48 13195 7.2196 6.9803 0.0741 0.2393 3.35RX R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large leverage. Durbin-Watson statistic = 0.196041 Residual Plots for log M1 Not Shown. 12 252y0781 12/11/07 14) What has happened to significance and the fraction of the variation in the dependent variable explained by the regression in Regressions 8), 9) and 10)? In terms of significance etc. which of these 3 is the ‘best’ regression? Why would the Chairman of the FRB be very annoyed? (3) [27] Answer: The values of rsquared and r-squared adjusted barely changed as we dropped the dummy variable and the interest rate. Trying to put the interest rate back in as an interaction variable did no good. Regression 9 has no insignificant coefficients, but since the dropped independent variables are ‘monetary’ variables, the equation seems to say that the FRB has no influence on the money supply. MTB > Regress c14 4 SUBC> GFourpack; SUBC> RType 1; SUBC> Constant; SUBC> VIF; SUBC> DW; SUBC> Brief 2. c6 c13 c8 c15; Regression 11 Regression Analysis: log M1 versus GDP, GDPsq, GDPPr, logM1l The regression equation is log M1 = - 0.174 + 0.000001 GDP - 0.000000 GDPsq - 0.000001 GDPPr + 1.04 logM1l Predictor Coef SE Coef T Constant -0.1738 0.2820 -0.62 GDP 0.00000085 0.00002708 0.03 GDPsq -0.00000000 0.00000000 -0.36 GDPPr -0.00000109 0.00000045 -2.39 logM1l 1.04474 0.05838 17.89 S = 0.0358443 R-Sq = 99.8% R-Sq(adj) = Analysis of Variance Source DF SS Regression 4 29.7924 Residual Error 43 0.0552 Total 47 29.8476 Source GDP GDPsq GDPPr logM1l DF 1 1 1 1 MS 7.4481 0.0013 P 0.541 0.975 0.723 0.021 0.000 99.8% F 5797.02 VIF 383.407 136.981 5.902 80.236 P 0.000 Seq SS 25.7052 3.6727 0.0031 0.4114 Unusual Observations Obs GDP log M1 28 4463 6.58576 37 7398 7.02767 38 7817 6.98601 48 13195 7.21957 Fit 6.49641 7.09767 7.07589 7.18793 SE Fit 0.00824 0.00984 0.00849 0.02830 Residual 0.08935 -0.07000 -0.08988 0.03164 St Resid 2.56R -2.03R -2.58R 1.44 X R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large leverage. Durbin-Watson statistic = 1.17315 Residual Plots for log M1 Not displayed. 15) So what problem did this fix? Incidentally what I added to the independent variables was the money supply of the previous period? (1) [28] Answer: This represents a half-hearted attempt to fix the autocorrelation by using the previous years money supply as an independent variable. It raised the D-W statistic closer to the desired 2, but, if we work with the D-W table in the text, not enough so that we can say that there is no autocorrelation. 13 252y0781 12/11/07 II. Do at least 4 of the following 8 Problems (at least 12 each) (or do sections adding to at least 50 points – (Anything extra you do helps, and grades wrap around). It is especially important to do more if you have skipped much of parts I or II. Show your work! State H 0 and H1 where applicable. Use a significance level of 5% unless noted otherwise. Do not answer questions without citing appropriate statistical tests – That is, explain your hypotheses and what values from what table were used to test them. Clearly label what section of each problem you are doing! The entire test has about 160 points, but 70 is considered a perfect score. Don’t waste our time by telling me that two means, proportions, variances or medians don’t look the same to you. You need statistical tests! There are some blank pages below. Put your name on as many loose pages as possible! Mark sections of your answer clearly. 1). Multiple choice. a) If I want to test to see if the mean of x1 is larger than the mean of x 2 my null hypothesis is: (Note: D 1 2 ) Only check one answer! (2) i) 1 2 and D 0 ii) 1 2 and D 0 v) 1 2 and D 0 vi) 1 2 and D 0 iii) 1 2 and D 0 vii) 1 2 and D 0 iv) 1 2 and D 0 viii) 1 2 and D 0 No answer is provided because this question will be repeated on future exams. b) Compared to multiple regression, simple regression is different in having only one i) Observation ii) Parameter iii) Dependent variable iv) *Independent variable v) Y-intercept vi) All of the above c) For the following quantities, mark their lines with yes (Y) or no (N) as to whether they must be positive ___ R 2 adjusted for degrees of freedom ___ The correlation rx1 x2 between two independent variables x1 and x 2 ___ S xy xy nx y ___ The coefficient b0 in a multiple regression. Solution: Though there are many quantities in a multiple regression that must be positive, none of the above is in that category. d) Assume that we wish to test the hypothesis that a mean is greater than 3 and we compute the ratio x 3 t where our sample statistics are computed from a sample of 29. If .05 , we reject the null sx hypothesis if i) t is above 1.645 or below -1.645 ii) t is above 1.960 or below -1.960 iii) t is below – 1.645 iv) t is below -1.960 v) t is above 1.645 vi) t is above 1.960 vii) *None of the above. (Fill in a more appropriate answer!) Explanation: This is for all you out there who don’t believe in the t table. We have a right sided test and 28 1.701 . df n 1 28. Reject the null hypothesis H 0 : 3 if t t .05 14 252y0781 12/11/07 e) Consumers are asked to take the Pepsi Challenge. They were asked they which cola they preferred and the number that preferred Pepsi was recorded. Sample 1 was males and sample 2 was females. The following was run on Minitab MTB > PTwo 109 46 52 13; SUBC> Pooled. Test and CI for Two Proportions Sample X N Sample p 1 46 109 0.422018 2 13 52 0.250000 Difference = p (1) - p (2) Estimate for difference: 0.172018 95% CI for difference: (0.0221925, 0.321844) Test for difference = 0 (vs not = 0): Z = 2.12 P-Value = 0.034 On the basis of the printout above we can say one of the following. i) At a 99% confidence level we can say that we have enough evidence to state that the proportion of men that prefer Pepsi differs from the proportion of women that prefer Pepsi ii) *At a 95% confidence level we can say that we have enough evidence to state that the proportion of men that prefer Pepsi differs from the proportion of women that prefer Pepsi iii) At a 99% confidence level we can say that we have enough evidence to state that the proportion of men that prefer Pepsi equals the proportion of women that prefer Pepsi. iv) At a 96% confidence level there is insufficient evidence to indicate that the proportion of men that prefer Pepsi differs from the proportion of women that prefer Pepsi Explanation: This is a two sided test. The null hypothesis is H 0 : p1 p 2 . Because of the p value of 3.4%, we reject the null hypothesis if .05 but not if .01 . On the other hand, a null hypothesis is never ‘proved.’ f) A researcher is comparing room temperatures preferred by random samples of 135 adults and 80 children. The Minitab output follows. MTB > TwoT 135 77.5 4.5 80 76.5 2.5; SUBC> Alternative 1. Two-Sample T-Test and CI Sample 1 2 N 135 80 Mean 77.50 76.50 StDev 4.50 2.50 SE Mean 0.39 0.28 Difference = mu (1) - mu (2) Estimate for difference: 1.000 95% lower bound for difference: 0.211 T-Test of difference = 0 (vs >): T-Value = 2.09 P-Value = 0.019 DF = 212 On the basis of what you see here and the way we have stated null-alternate hypothesis pairs in class we come to the following conclusion if we use a 99% confidence level. i) Do not reject H 0 : 1 2 ii) Do not reject H 0 : 1 2 iii) *Do not reject H 0 : 1 2 iv) Reject H 0 : 1 2 v) Reject H 0 : 1 2 vi) Reject H 0 : 1 2 vii) None of the above (Fill in a more appropriate answer!) Explanation: This is a one sided test. The alternate hypothesis is H 1 : 1 2 so the opposite null hypothesis is H 0 : 1 2 . Because of the p value of 1.9%, we do not reject the null hypothesis if .01 . 15 252y0781 12/11/07 2) The data below represent the sales of Friendly Autos for 7 randomly selected months. They believe that the number of cars sold depends on the average price for that month (in $ thousands), Number of advertising spots that appeared on the local TV station and whether other types of advertising were used in that month (a dummy variable that is 1 if other types of advertising were used in a given month. Row 1 2 3 4 5 6 7 Sold 10 8 12 13 9 14 15 Price 28.2 28.7 27.9 27.8 28.1 28.8 28.9 Adv 10 6 14 18 10 19 20 Type 1 1 1 0 0 1 1 Product 10 6 14 0 0 19 20 69 Sum of Sold = 81, Sum of Price = 198.4, Sum of Adv = 97, Sum of Sold squared = 979, Sum of Price squared = 5624.44, Sum of Adv squared = 1517, Sum of Sold * Price = 2297.4, Sum of Sold * Adv = 1206, Sum of Price * Adv = 2751.4. x 5 x 6 (2) a) If advertising (Adv) is x 5 (it isn’t) and Type is x 6 , compute b) Compute the coefficients of the equation 2 Yˆ b0 b1 x to predict the value of ‘Sold’ on the basis of ‘Price.’ (5) 2 c) Compute R and R adjusted for degrees of freedom. (4) d) Compute the standard error s e . (3) e) Is the slope of the simple regression significant at the 1% level? Do not answer this question without appropriate calculations! (3) [17] f) Is the sign of the coefficient of Price, what you expected? Why or why not? (1) g) Predict the average number of cars that will be sold when the price is $30 thousand using the equation you got and make it into an appropriate interval. (4) h) Do a 1% confidence interval for o , the y-intercept. (3) [24, 36] Solution: The quantities below are given: y 81, x 198.4, x 97, n 7, y 979, x 5624.44, x 1517, x y 2297.4, x y 1206 and x x 2751.4. You do not need all of these for this problem. Please note that if you told me that x ???(198 .4) or x y ???198 .481 2 2 1 1 2 2 1 2 2 1 2 2 1 2 1 or anything similar, it was instant death! Spare Parts Computation: 81 198 .4 97 Y 11 .57143 , X 1 28 .34286 , X 2 13 .85714 7 7 7 X nX SSX 5624 .44 728.34286 1.21601* †Needed only in next problem X nX SSX 1517 713.85714 = 172.8577*† Y nY SST SSY 979 711.57143 = 41.7141* X Y nX Y SX Y 2297 .4 728.34286 11.57143 = 1.62806 X Y nX Y SX Y 1206 713.85714 11.57143 = 83.5715† X X nX X SX X 2751 .4 728.34286 13.85714 = 2.14315† 2 2 1 2 1 1 2 2 2 2 2 2 2 2 2 1 1 2 2 1 2 1 2 1 2 1 2 *Must be positive. The rest may well be negative. a) If advertising (Adv) is x5 (it isn’t) and Type is x 6 , compute x 5 x6 (2) See the column labeled ‘Product’ above. 16 252y0781 12/11/07 b) Compute the coefficients of the equation Yˆ b0 b1 x to predict the value of ‘Sold’ on the basis of ‘Price.’ (5) The coefficients are b1 S xy SS x XY nXY X nX 2 2 1.62806 1.33885 and 1.21601 b0 Y b1 X 11 .57143 1.33885 28 .34286 26 .3754 . So Yˆ 26.3754 1.3389 X . c) Compute R 2 and R 2 adjusted for degrees of freedom. (4) We have found SX 1Y = 1.62806, SSX 1 = 1.21601, b1 1.33885 and SST SSY = 41.7141. This means that the regression (explained) sum of squares is SSR b1 S xy 1.33885 (1.62806 ) 2.17973 so R 2 SSR 2.17973 .05225 or SST 41 .7141 1.62806 2 .05225 . These are terribly low, but things get worse when we compute R 2 SS x SS y 1.21601 41 .7141 n 1R 2 k 60.05225 1 0.1373 . Well I did tell you that it adjusted for degrees of freedom. R 2 S xy 2 n k 1 5 could be negative, but this is ridiculous. d) Compute the standard error s e . (3) SSE SST SSR SS y b1 S xy 41 .7141 2.1797 7.30688 s e 7.30688 2.7031 n2 n2 n2 5 e) Is the slope of the simple regression significant at the 1% level? Do not answer this question without appropriate calculations! (3) 1 7.30688 6.00890 s b 2.4513 To test for significance, H 0 : 1 0 , we compute s b21 s e2 1 SS x 1.21601 s e2 t b1 0 1.62806 0.6642 . Since .01 , t.5005 = 4.032. The ‘do not reject’ zone is between ±4.032. s b1 2.4513 Since our computed t lies between these values, we cannot reject the null hypothesis and we must declare b1 insignificant. f) Is the sign of the coefficient of Price, what you expected? Why or why not? (1) If this equation represents demand for cars, we would expect quantity demanded to rise as price falls, so the coefficient of price should be negative. This is not happening in this equation. Note that, since we have already shown that the coefficient is not significant, a confidence interval will include both positive and negative values. g) Predict the average number of cars that will be sold when the price is $30 thousand using the equation you got and make it into an appropriate interval. (4) The Confidence Interval is Y0 Yˆ0 t sYˆ , where sY2ˆ 1 s e2 n 2 X 0 X , s 2 7.30688 , SS x e X 1 28 .34286 , SSX 1 = 1.21601 and X 0 30 . 1 30 28 .34286 2 So Yˆ0 26.3754 1.3389X 0 26.3754 1.3389 30 13.79 and sŶ2 7.30688 7 1.21601 30 28.34286 2 17.5450 and s 4.18867 . If .01 , t 5 = 4.032 and if 7.30688 0.14286 .005 Yˆ 1.21601 .05 , t.5025 = 2.015. The 1% interval is ` 13 .79 4.032 4.11867 = 13.79 16.61 , which is amazingly vague. 1 X 2 h) Do a 1% confidence interval for o , the y-intercept. (3) b0 26 .3754 and s b20 s e2 n SS x 17 252y0781 12/11/07 1 28 .34286 2 7.30688 689 .579 , which means s b0 26 .2598 so 0 26.3754 4.032 26.25989 1.21601 7 0 26.3754 105 .8799 , which strongly indicates that the intercept is not significant. 3) The data below represent the sales of Friendly Autos for 7 randomly selected months. They believe that the number of cars sold depends on the average price for that month (in $ thousands), Number of advertising spots that appeared on the local TV station and whether other types of advertising were used in that month (a dummy variable that is 1 if other types of advertising were used in a given month. Row 1 2 3 4 5 6 7 Sold 10 8 12 13 9 14 15 Price 28.2 28.7 27.9 27.8 28.1 28.8 28.9 Adv 10 6 14 18 10 19 20 Type 1 1 1 0 0 1 1 Sum of Sold = 81, Sum of Price = 198.4, Sum of Adv = 97, Sum of Sold squared = 979, Sum of Price squared = 5624.44, Sum of Adv squared = 1517, Sum of Sold * Price = 2297.4, Sum of Sold * Adv = 1206, Sum of Price * Adv = 2751.4. a) Do a multiple regression of ‘Sold’ against ‘Price’ and ‘Advertising.’ Attempts to recycle b1 from the previous page or to compute b2 by using a simple regression formula won’t work and won’t get any credit. (12) b) Compute R2 and R2 adjusted for degrees of freedom. (3) 2 c) i) Do an ANOVA for the simple regression using either your regression sum of squares or R (2). ii) Do a similar ANOVA for the multiple regression. (2) iii) Combine the two ANOVAs to do an F test to see if the addition of ‘Adv’ was worthwhile. (2) [21] d) Predict the average number of cars that will be sold when the price is $30 thousand and there are 15 spots using the equation you got and make it into an appropriate interval. (3) [24, 60] Solution: The Spare Parts Computation is repeated. Serious errors occurred because of excess rounding. 81 198 .4 97 Y 11 .57143 , X 1 28 .34286 , X 2 13 .85714 7 7 7 X nX SSX 5624 .44 728.34286 1.21601* X nX SSX 1517 713.85714 = 172.8577* Y nY SST SSY 979 711.57143 = 41.7141* X Y nX Y SX Y 2297 .4 728.34286 11.57143 = 1.62806 X Y nX Y SX Y 1206 713.85714 11.57143 = 83.5715 X X nX X SX X 2751 .4 728.34286 13.85714 = 2.14315 2 2 1 2 1 1 2 2 2 2 2 2 2 2 2 1 1 2 2 1 2 1 2 1 2 1 2 a) Do a multiple regression of ‘Sold’ against ‘Price’ and ‘Advertising.’ Attempts to recycle b1 from the previous page or to compute b2 by using a simple regression formula won’t work and won’t get any credit. (12) We substitute our spare parts into the Simplified Normal Equations: X 1Y nX 1Y b1 X 12 nX 12 b2 X 1 X 2 nX 1 X 2 X Y nX Y b X X 2 which are 2 1 1 2 nX X b X 1 2 2 2 2 , nX 22 1.62806 1.21601 b1 2.14315 b2 83 .5715 2.14315 b1 172 .8577 b2 and solve them as two equations in two unknowns for b1 and b2 . These are a fairly tough pair of equations to solve until we notice that, if we multiply 2.14315 by 172 .8577 2.14315 80 .6559 we get 172.8577. If we 18 252y0781 12/11/07 131 .31313 98 .07758 b1 172 .8577 b2 multiply the first equation by 80.8577, the equations become . If 83 .5715 2.14315 b1 172 .8577 b2 47 .7416 0.4976 . The first of our we subtract these, we get 47 .7416 95 .9344 b1 . This means that b1 95 .9334 equations was 1.62806 1.21601 b1 2.14315 b2 . This can be rewritten as 2.14315 b2 1.62806 1.21601 b1 . We may as well divide through by 2.14315 and then substitute in our value for b1 to get b2 0.759658 0.567394 b1 0.759658 0.567394 0.4976 0.4773 (It’s worth checking your work by substituting your values of b1 and b2 back into the normal equations.) Finally we get b0 by using Y 11.57143, X 1 28 .34286 and X 2 13 .85714 in b0 Y b1 X 1 b2 X 2 11 .57143 0.4976 28.34286 0.4773 13 .85714 11.57143 14.10341 6.614301 9.1460 . Thus our equation is Yˆ b b X b X 9.1460 0.4976X 0.4773X . 0 1 1 2 2 1 2 b) Compute R 2 and R 2 adjusted for degrees of freedom. (3) Remember that we had RY2.1 .05225 and RY2.1 0.1373 from the first regression. Our two spare parts indicating interaction between Y and the independent variables were X Y nX Y SX Y 1 1 1 = 1.62806 and X 2Y nX 2 Y SX 2 Y = 83.5715. The formula for the regression sum of squares is SSR b1 Sx1 y b2 Sx2 y 0.4976 1.62806 0.4773 83.5715 0.8101 39.8887 40.6988 . Our total sum of squares was Y 2 nY 2 SST SSY = 41.7141 so we can say RY2.12 RY2.12 SSR 40 .6988 .9757 and SST 41 .7141 n 1R 2 k 60.9757 2 .9636 . This, of course, represents a fabulous improvement over our n k 1 first regression! 4 c) i) Do an ANOVA for the simple regression using either your regression sum of squares or R 2 (2). Let’s Y 2 nY 2 SST SSY = 41.7141 and, for the collect our stuff again. For both regressions we had first regression we had SSR b1 S xy 2.17973 , SSE SST SSR 41.7141 2.17973 39.5344 and SSR .005225 . So our ANOVA table will be as below. Of course, there is no need to do both the SST ‘SS’ and ‘R-squared’ table. Source SS DF MS F F 1 , 5 Regression 2.1797 1 2.1797 0.276 F 6.61 or F 1,5 16.26 RY2.1 .01 .05 Error 39.5344 5 7.90688 Total 41.7141 6 If we recall R 2 .2405 for this regression, we can rewrite the table as below. Source DF ‘MS’ F R2 Regression 0.05225 1 0.05225 0.276 F 1,6 5.99 or F 1,5 16 .26 F.05 .01 Error 0.94775 5 0.18955 Total 1.00000 6 Note that, since the computed F is far below the table values, we cannot reject the null hypothesis of no relationship between the dependent and independent variables. ii) Do a similar ANOVA for the multiple regression. (2) SSR b1 Sx1 y b2 Sx2 y 40 .6988 , SSE SST SSR 41.7141 40.6988 1.0153 Source SS DF MS F F 19 252y0781 12/11/07 Regression 40.6988 2 20.3494 Error Total 1.0153 41.7141 4 6 0.25383 2,4 6.94 or F 2,6 18 .00 F.05 .01 80.169 If we recall RY2.12 .9757 for this regression, we can rewrite the table as below. Source DF ‘MS’ F R2 Regression 0.9757 2 0.48785 80.238 F 2,4 6.94 or F 2,6 18 .00 F.05 .01 Error 0.0243 4 0.00608 Total 1.0000 6 Note that, since the computed F is far above the table values, we can reject the null hypothesis of no relationship between the dependent and independent variables. iii) Combine the two ANOVAs to do an F test to see if the addition of ‘Adv’ was worthwhile. (2) [21] Use the SS or R-squared in the first table to get a subtotal in the second table. Source SS DF MS F F Regression 40.6988 2 2.1797 1 X1 1, 4 7.71 or F 1, 4 21 .20 38.5191 1 38.5191 151.75 F.05 X2 .01 Error Total 1.0153 41.7141 4 6 Source R2 0.9757 0.05225 0.92345 DF Regression X1 0.25383 ‘MS’ F F 2 1 1 1, 4 7.71 or F 1, 4 21 .20 0.92345 151.88 F.05 X2 .01 Error 0.0243 4 0.00608 Total 1.0000 6 Note that, since the computed F is far above the table values, we can reject the null hypothesis of no additional explanatory power from the added independent variable. d) Predict the average number of cars that will be sold when the price is $30 thousand and there are 15 spots using the equation you got and make it into an appropriate interval. (3) [24, 60] Note that in the ANOVA for the second regression the error mean square is 0.25383. Its square s root 0.5038 is the standard error. The outline tells us that Y0 Yˆ0 t e . Here n Yˆ 9.1460 0.4976 X 0.4773 X 9.1460 0.4976 30 0.4773 15 12.9415 , 0 1,0 2,0 0.25383 4 4 2.776 or t .005 4.604 . Thus the interval could be 0.19042 and we use t .025 7 n 12.94 0.05. se 20 252y0781 12/11/07 4) The data below represent the sales of Friendly Autos for 7 randomly selected months. They believe that the number of cars sold depends on the average price for that month (in $ thousands), Number of advertising spots that appeared on the local TV station and whether other types of advertising were used in that month (a dummy variable that is 1 if other types of advertising were used in a given month. Row 1 2 3 4 5 6 7 Sold 10 8 12 13 9 14 15 Price 28.2 28.7 27.9 27.8 28.1 28.8 28.9 Adv 10 6 14 18 10 19 20 Type 1 1 1 0 0 1 1 The Minitab output below gives the full regression of ‘Sold’ against all three independent variables. Regression Analysis: Sold versus Adv, Price, Type The regression equation is Sold = 8.46 + 0.487 Adv - 0.153 Price + 0.982 Type Predictor Constant Adv Price Type Coef 8.457 0.48699 -0.1530 0.9815 S = 0.218501 SE Coef 6.990 0.01696 ………… 0.2297 R-Sq = 99.7% Analysis of Variance Source DF SS Regression 3 41.571 Residual Error 3 0.143 Total 6 41.714 Source Adv Price Type DF 1 1 1 T 1.21 28.72 ………… 4.27 P 0.313 0.000 0.586 0.024 R-Sq(adj) = 99.3% MS 13.857 0.048 F 290.24 P 0.000 Seq SS 40.404 0.295 0.872 2 a) Using the material in this output find the value of R for a regression against ‘Adv’ alone. (2) b) Look at the line that represents the coefficient of ‘Price.’ What about the coefficient makes me happy? What about the coefficient makes me sad? (2) c) Find the partial correlation of ‘Type’ with ‘Sold.’ (2) d) Since you now have enough information to do it, use an F test the see whether the addition of the two advertising independent variables as a pair was worthwhile. (4) [10] Solution: a) Using the material in this output, find the value of R 2 for a regression against ‘Adv’ alone. (2) The Sequential sum of squares gives us for ‘Adv’, a sum of squares of 40.404. If we divide that by the total 40 .404 .96869 sum of squares we get RY2.2 41 .714 b) Look at the line that represents the coefficient of ‘Price.’ What about the coefficient makes me happy? What about the coefficient makes me sad? (2) The coefficient of ‘Price’ is negative as it should be. However, with a p-value of .586 it is still not significant. c) Find the partial correlation of ‘Type’ with ‘Sold.’ (2). According to the outline, the partial correlation, R2 R2 .99657 .97567 .8590 , the additional explanatory power of which comes from rY23.12 Y .123 2 Y .12 1 .97567 1 RY .12 the third independent variable after the effects of the first two are considered can be computed easily by t2 4.27 2 .8587 , using its t-ratio from the computer printout. rY23.12 2 3 4.27 2 3 t 3 df 21 252y0781 12/11/07 d) Since you now have enough information to do it, use an F test the see whether the addition of the two advertising independent variables as a pair was worthwhile. (4) [10] The ANOVA given in the printout was as below. Analysis of Variance Source DF SS Regression 3 41.571 Residual Error 3 0.143 Total 6 41.714 MS 13.857 0.048 F 290.24 P 0.000 The regression sum of squares for ‘price’ alone was 2.1797. So we can do the following. Source SS DF MS F F Regression 41.571 3 2.180 1 X1 2,3 9.55 or F 2,3 30 .82 39.391 2 19.696 412.90 F.05 X 2, X 3 .01 Error 0.143 3 0.0477 Total 41.714 6 Note that, since the computed F is far above the table values, we can reject the null hypothesis of no additional explanatory power from the added independent variables. e) Compute the correlation between ‘Adv’ and ‘Price’ and test it for significance. Try to use the spare parts that you already have. (4) [14] Solution: The relevant parts of the Spare Parts Computation are repeated. X X X 2 1 nX 12 SSX 1 5624 .44 728 .34286 2 1.21601* 2 2 nX 22 SSX 2 1517 713 .85714 2 = 172.8577* 1X 2 nX 1 X 2 SX 1 X 2 2751 .4 728.34286 13.85714 = 2.14315 The simple sample correlation coefficient is r XY nXY X nX Y 2 SX 1 X 2 correlation we need is SSX 1 SSX 2 2 2 nY 2 Sxy . So the SS x SS y 2.14315 2 0.02185 0.14782 . If we want to 1.21601 172 .8577 test H 0 : x1x2 0 against H1 : x1x2 0 and x and y are normally distributed, we use t n 2 r sr r 1 r n2 2 0.14782 1 0.02185 5 0.33421 . The rejection zone is above t 5 and below t 5 . If 2 2 5 5 we use t .005 4.023 or t .025 2.571 , it should be clear that we will not reject the null hypothesis of insignificance. f) Test the same correlation to see if it is 0.2. (4) [18, 78] To test H 0 : x1x2 0 against H 1 : x1x2 0 , 1 1 r z ln and 0 0 , use Fisher's z-transformation. Let ~ . This has an approximate mean of 2 1 r ~ n 2 z z 1 1 1 0 and a standard deviation of s z t z ln , so that . We have n3 sz 2 1 0 1 1 0.14782 1 1 z ln r 0.14782 , so ~ ln 1.34692 0.29782 0.14891 , 2 1 0.14782 2 2 1 1 0.2 1 1 ln 1.5 0.40547 0.20273 and s z 2 1 0.2 2 2 z ln 1 0.5 . Finally 73 22 252y0781 12/11/07 0.14891 0.20273 0.10164 . The rejection zone is above t 5 and below t 5 . If we use 2 2 0.5 5 5 t .005 4.023 or t .025 2.571 , it should be clear that we will not reject the null hypothesis. t g) Don’t forget to hand in your last computer problem. Check here if you did. __________________. (2 to 7) [78+] 23 252y0781 12/11/07 5) The manager of a computer network has the following data on the 200 service interruptions that have occurred over the last 100 days. O x 0 1 2 3 4 5 6 7 2 51 18 12 11 4 1 1 100 c) A coin is to be tested to see if it is fair. In order to test it the coin is given 5 flips 100 times and the number of heads in 5 flips is recorded at left. This means that there are a total of 500 flips and the coin has come up heads 255 times. Construct a 99% confidence interval for the proportion of times it comes up heads. Test the hypothesis that the proportion is 50% using this interval. (4) d) The distribution shown here should be a binomial distribution with n 5 and p .5 . A more powerful test of the fairness of the coin should be to use O x 0 1 2 3 4 5 a) Test to see if these follow a Poisson distribution (6) b) Use another method to test whether this has a Poisson distribution with a parameter of 1.8. (5) 3 16 30 29 18 4 probabilities from your cumulative binomial table to check whether this distribution is correct. (4) [19, 97] e) Assume that a coin is flipped 20 times and comes up heads half the time. If the sequence of heads and tails is HHHTTTHHHTTTHHHTTTTH, can we say that the sequence is random? (This is not a yes or no question – I want a statistical test for randomness! (2) f) Now assume that there are 5 times as many flips and 5 times as many runs and heads half the time. Can we say that the sequence is random now? (3) [24, 102] Solution: a) Since the mean is unknown, we must use a chi-squared test. If there are 200 interruptions in 100 days, we gather that the mean is 2 and use the chi-squared table. The relevant parts follow. Poisson 1.8 k 0 1 2 3 4 5 6 7 8 9 10 11 12 13 P(x=k) 0.165299 0.297538 0.267784 0.160671 0.072302 0.026029 0.007809 0.002008 0.000452 0.000090 0.000016 0.000003 0.000000 0.000000 Poisson 2.0 P(xk) 0.16530 0.46284 0.73062 0.89129 0.96359 0.98962 0.99743 0.99944 0.99989 0.99998 1.00000 1.00000 1.00000 1.00000 k 0 1 2 3 4 5 6 7 8 9 10 11 12 13 P(x=k) 0.135335 0.270671 0.270671 0.180447 0.090224 0.036089 0.012030 0.003437 0.000859 0.000191 0.000038 0.000007 0.000001 0.000000 Poisson 2.5 P(xk) 0.13534 0.40601 0.67668 0.85712 0.94735 0.98344 0.99547 0.99890 0.99976 0.99995 0.99999 1.00000 1.00000 1.00000 x If we take the Poisson frequencies and multiply by n 100 to get our E , we get into big trouble. None of the values of E for x 5 is above 5. We will thus create a category for x 5 . (Note that for accuracy this was done by computer.) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 k 0 1 2 3 4 5 6 7 8 9 10 11 12 13 f 0.135335 0.270671 0.270671 0.180447 0.090224 0.036089 0.012030 0.003437 0.000859 0.000191 0.000038 0.000007 0.000001 0.000000 P(x=k) 0.082085 0.205213 0.256516 0.213763 0.133602 0.066801 0.027834 0.009941 0.003106 0.000863 0.000216 0.000049 0.000010 0.000002 P(xk) 0.08208 0.28730 0.54381 0.75758 0.89118 0.95798 0.98581 0.99575 0.99886 0.99972 0.99994 0.99999 1.00000 1.00000 E fn O 13.5335 27.0671 27.0671 18.0447 9.0224 3.6089 1.2030 0.3437 0.0859 0.0191 0.0038 0.0007 0.0001 0.0000 5 21 18 12 11 4 1 1 0 0 0 0 0 0 24 252y0781 12/11/07 The data is put into 6 categories. Since we estimated a mean from the data, we have 6 – 1 – 1 = 4 degrees of freedom. Because there were no E cells below 5, the shortcut method was used to compute O2 n 136 .589 100 36 .589 . 2 E According to the Chi-squared table .2054 7.8147 . Since the computed value is x 0 1 2 3 4 5+ O2 O E 13.5335 27.0671 27.0671 18.0447 9.0224 5.2653 100.0001 E 2 0.2956 51 96.0947 18 11.9703 12 7.9802 11 13.4111 6 6.8372 100 136.569 larger than the table value, we can reject H 0 : x ~ Poisson . b) If the mean is known we can use a Kolmogorov-Smirnov test. The Poisson table is used to get the expected cumulative frequency FE . To get the observed cumulative frequency, FO , add the O column to get the Cum O column. Then divide the Cum O column by n 100 . The D column is the absolute difference between the FO column and the FE column. For the K-S test with a sample size above 40, the 5% critical value is 1.36 .136 . To my great surprise, we 100 cannot reject the null hypothesis H 0 : x ~ Poisson1.8 . x 0 1 2 3 4 5 6 7 8 9 10 FE .16530 .46284 .73062 .89129 .96359 .98962 .99743 .99944 .99989 .99998 O Cum O 2 51 18 12 11 4 1 1 0 0 2 53 71 83 94 98 99 100 100 100 FO .02 .53 .71 .83 .94 .98 .99 1.00 1.00 1.00 D .03470 .06716 .02062 .06129 .02359 .00962 .00743 .00056 .00011 .00002 c) A coin is to be tested to see if it is fair. In order to test it the coin is given 5 flips 100 times and the number of heads in 5 flips is recorded at left. This means that there are a total of 500 flips and the coin has come up heads 255 times. Construct a 99% confidence interval for the proportion of times it comes up heads. Test the hypothesis that the proportion is 50% using this interval. (4) The formula table excerpt is below. Interval for Confidence Hypotheses Test Ratio Critical Value Interval Proportion p p0 p p z 2 s p pcv p0 z 2 p H 0 : p p0 z H1 : p p0 p pq p0 q0 sp p n n q 1 p q0 1 p0 H 0 : p .5 Our hypotheses are H 1 : p .5 p x 255 .5100 , z.005 2.576 and s p n 500 pq .510 .490 n 500 .00050 .02236 . So p p z 2 s p .5100 2.576 .02236 .5100 .0576 or .4524 to .5576. This interval (which looks terribly large) includes .5, so we cannot reject the null hypothesis. 25 252y0781 12/11/07 d) The distribution shown here should be a binomial distribution with n 5 and p .5 . A more powerful test of the fairness of the coin should be to use probabilities from your cumulative binomial table to check whether this distribution is correct. (4) [19, 97] H 0 : x ~ Binomial p .5, n 5 . FE is copied from the Binomial table. x O Cum O FE FO D 0 .03125 3 3 .03 .00125 1 .18750 16 19 .19 .00250 2 .50000 30 49 .49 .01000 3 .81250 29 78 .78 .03250 4 .96875 18 96 .96 .00875 5 1.0000 4 100 1.00 0 100 1.36 1.36 According to the KS table, the 5% critical value for the largest number in D is 0.136 . But n 100 no number in D is that large, so we cannot reject the null hypothesis. e) Assume that a coin is flipped 20 times and comes up heads half the time. If the sequence of heads and tails is HHHTTTHHHTTTHHHTTTTH, can we say that the sequence is random? (This is not a yes or no question – I want a statistical test for randomness!) (2) This is a runs test with n1 10 and n2 10 . We count the runs and find r 7 . HHH TTT HHH TTT HHH TTTT H 1 2 3 4 5 6 7 The 5% runs test table has a null hypothesis H 0 : Randomness. The critical values given for n1 10 and n2 10 are 6 and 16. Since the number of runs is between these two numbers, we cannot reject the null hypothesis. f) Now assume that there are 5 times as many flips and 5 times as many runs and heads half the time. Can we say that the sequence is random now? (3) [24, 102] If n1 and n 2 are too large for the table, r follows 2n1 n 2 1 2 250 50 1 1 51 and 2 n 100 n 1 50 49 r 35 51 24 .74747 . If the number of runs is r 57 35 , we have z 99 24 .74747 the Normal distribution with 16 2 10 .34449 3.21629 . A conventional 5% normal test is to fail to reject the null 24 .74747 hypothesis if this value of z is between -1.960 and +1.960. It is not, so we reject the null hypothesis. Note how the larger sample size has made the runs test more powerful. 26 252y0781 12/11/07 6) Do the following. Use a 1% significance level in this problem! a) (Multiple choice) I wish to test to see if a distribution is Normal, but I must first use my data to figure out the mean and standard deviation. I have 100 data points divided into 0 to under 20, 20 to under 40, 40 to under 60, 60 to under 80 and 80 to under 100. Assume that my expected frequency is 5 or larger for each class. I could use (i) A chi-squared test with 4 degrees of freedom or a Kolmogorov – Smirnov test. (ii) A chi-squared test with 2 degrees of freedom or a Kolmogorov – Smirnov test. (iii) A chi-squared test with 4 degrees of freedom or a Lilliefors test. (iv) A chi-squared test with 2 degrees of freedom or a Lilliefors test. (v) Only a Lilliefors test. (vi) Only a Kolmogorov – Smirnov test. (vii) Only a chi-squared test. (2) b) (Bassett et al) An industrial process is run at 4 different temperatures on four different days. A random sample of 3 units is taken and scored. The results are as follows. Do the scores differ according to temperature? 100C degrees 120C degrees 140C degrees 160C degrees. 41 54 50 38 44 56 52 36 48 53 48 41 Minitab has computed the following. Sum of 100C = 133, Sum of 120C = 163, Sum of 140C = 150, Sum of 160C = 115, Sum of squares of 100C = 5921, Sum of squares of 120C = 8861, Sum of squares of 140C = 7508, Sum of squares of 160C = 4421, Bartlett's Test - Test statistic = 1.22, p-value = 0.748 and Levene's Test - Test statistic = 0.43, p-value = 0.736. Assume that the scores are not considered to come from the Normal distribution, state your null hypothesis and test it. (5). c) Assume that the scores are considered to come from the Normal distribution, state your null hypothesis and test it. (6) d) Why were the Bartlett and Levene tests run? Which of the two is correct here if the underlying distribution is Normal? What do they tell us? (2) [15] e) Ignore everything that has gone before. Assume that the Normal distribution applies and test the hypothesis that the mean of the 120C population is larger than the mean of the 100C population. Assume that the underlying distributions are Normal and have equal variances (4) or assume that the underlying distributions are Normal and do not necessarily have equal variances. (6) Do not do both! [19, 116] Solution: a) (Multiple choice) I wish to test to see if a distribution is Normal, but I must first use my data to figure out the mean and standard deviation. I have 100 data points divided into 0 to under 20, 20 to under 40, 40 to under 60, 60 to under 80 and 80 to under 100. Assume that my expected frequency is 5 or larger for each class. I could use (i) A chi-squared test with 4 degrees of freedom or a Kolmogorov – Smirnov test. (ii) A chi-squared test with 2 degrees of freedom or a Kolmogorov – Smirnov test. (iii) A chi-squared test with 4 degrees of freedom or a Lilliefors test. (iv) *A chi-squared test with 2 degrees of freedom or a Lilliefors test. (v) Only a Lilliefors test. (vi) Only a Kolmogorov – Smirnov test. (vii) Only a chi-squared test. (2) Since the parameters are not known, we cannot use a Kolmogorov – Smirnov test. We have 5 classes and have estimated two parameters, so that df 5 1 2 2 for the Chi-squared test. b) (Bassett et al) An industrial process is run at 4 different temperatures on four different days. A random sample of 3 units is taken and scored. The results are as follows. Do the scores differ according to temperature? 100C degrees 120C degrees 140C degrees 160C degrees. 41 44 48 54 56 53 50 52 48 38 36 41 Assume that the scores are not considered to come from the Normal distribution, state your null hypothesis and test it. (5). If these scores are a random sample, we must use the Kruskal-Wallis Test with H 0 : 1 2 3 4 . Replace the numbers by their ranks. x1 41 44 48 r1 x2 r2 3.5 5 6.5 15.0 54 56 53 11 12 10 33 x3 r3 x4 r4 50 52 48 38 36 41 2 1 3.5 6.5 8 9 6.5 23.5 27 252y0781 12/11/07 We must check to see that 15 .0 33 23 .5 6.5 78 12 H nn 1 i SRi 2 ni 12 13 . Now, compute the Kruskal-Wallis statistic 2 2 2 2 2 3n 1 12 15 33 23 .5 6.5 313 1 1908 39 13 3 3 3 3 12 13 3 23 9.9231 . This is too large for the Kruskal-Wallis table and should be compared with .01 11 .3449 . Since our computed statistic is smaller than the table value, do not reject the null hypothesis of equal medians. c) Assume that the scores are considered to come from the Normal distribution, state your null hypothesis and test it. (6) 100C degrees 120C degrees 140C degrees 160C degrees. 41 44 48 54 56 53 50 52 48 38 36 41 Minitab has computed the following. Sum of 100C = 133, Sum of 120C = 163, Sum of 140C = 150, Sum of 160C = 115, Sum of squares of 100C = 5921, Sum of squares of 120C = 8861, Sum of squares of 140C = 7508, Sum of squares of 160C = 4421, Bartlett's Test - Test statistic = 1.22, p-value = 0.748 and Levene's Test - Test statistic = 0.43, p-value = 0.736. Process 2 54 56 53 1 41 44 48 Sum 133 + 163 + Temp 3 50 52 48 4 38 36 41 Sum 150 + 115 561 3 12 n nj 3+ 3+ 3+ x j 44.333 54.333 50.000 38.333 SS 5921 + 8861 + 7508 + 4421` x 2j 1965.44 + 2952.07 + 2500.00 + 1469.44 ij 561 46 .750 x 12 26711 x ij2 8886 .95 x 2 j 2 xij2 nx 2 26711 1246.750 2 484 .25 2 2 2 2 2 2 2 2 . j x n j x. j nx 344 .333 354 .333 350 338 .333 12 46 .750 x SSB x SST x ij x 38886 .95 12 46 .75 2 26660 .85 26226 .75 434 .10 Source SS DF MS Between 434.10 3 144.7 Within Total 50.15 484.25 8 11 6.2688 F F.01 23.08s F 3,8 7.59 H0 Column means equal d) Why were the Bartlett and Levene tests run? Which of the two is correct here if the underlying distribution is Normal? What do they tell us? (2) [15] Since we do an ANOVA on the assumption that the underlying data is Normal, we should use the Bartlett test results. The null hypothesis is equal variances, which is another requirement of ANOVA. The extremely high p-value means that we cannot doubt the null hypothesis. 28 252y0781 12/11/07 e) Ignore everything that has gone before. Assume that the Normal distribution applies and test the hypothesis that the mean of the 120C population is larger than the mean of the 100C population. Assume that the underlying distributions are Normal and have equal variances (4) or assume that the underlying distributions are Normal and do not necessarily have equal variances. (6) Do not do both! [19, 116] The formulas for the methods requested are below. Interval for Confidence Hypotheses Test Ratio Critical Value Interval Difference H 0 : D D0 * d cv D0 t 2 s d D d t 2 s d d D0 t between Two H 1 : D D0 , sd 1 1 Means ( sd s p D 1 2 n 1s12 n2 1s22 n1 n2 unknown, sˆ 2p 1 n1 n2 2 variances DF n1 n2 2 assumed equal) H 0 : D D0 * D d t 2 s d Difference between Two Means( unknown, variances assumed unequal) s12 s22 n1 n2 sd DF s12 s22 n 1 n2 H 1 : D D0 , t D 1 2 d cv D0 t 2 s d d D0 sd 2 s12 2 n1 n1 1 s 22 2 n2 n2 1 Minitab has computed the following. Sum of 100C = 133, Sum of 120C = 163, Sum of squares of 100C = 5921, Sum of squares of 120C = 8861. To summarize n1 n 2 3 . and x 22 x 1 133, x 2 1 5921, x 2 163 8861. We have already computed x1 44.333 and x 2 54.333, so that d x1 x 2 10 We now need s 22 x 2 2 s12 x n 2 x 22 n2 1 2 1 n1 x12 n1 1 5921 344 .333 2 12 .3777 2 s1 3.5182 and 8861 354 .333 2 2.3877 s 2 1.5452 . If we assume equal variances, we get 2 n 1s12 n2 1s 22 12.3777 2.3877 1 1 7.3827 . This means s d s p2 s p2 1 2 n1 n 2 2 n1 n 2 1 1 7.3827 4.9218 2.2185 . Our degrees of freedom are n1 n 2 2 3 3 2 4 and we will 3 3 H 0 : 1 2 H 0 : D 0 4 3.747 for a left-sided test. We are testing use t .01 or, if D 1 2 , . If H 1 : 1 2 H 1 : D 0 we use a t-ratio we get t d D0 sd 10 0 4 4.5076 . Since this is below t .01 , we reject the null 2.2185 hypothesis. If we use a critical value for d , we need a single value below zero. d cv D0 t 2 s d becomes d cv D0 t s d 0 - 3.747 2.2185 8.313 . Since d x1 x 2 10 is below this number we reject the null hypothesis. If we do not assume that the variances are equal, we need to do some arithmetic with our standard errors. 29 252y0781 12/11/07 s x21 s x22 sd s12 12 .3777 4.1259 n1 3 s 22 2.3877 0.7959 n2 3 s12 s 22 = n1 n 2 . These sum to s12 s 22 4.1259 0.7959 4.9218 . This means that n1 n 2 s2 s2 2 1 2 n1 n 2 4.9218 2.2185 and that df 2 2 s2 s 22 1 n2 n1 n 1 n2 1 1 4.9218 2 2 2 4.1259 0.7959 2 2 24 .2241 24 .2241 8.51153 0.31673 8.82826 2.7439 which is rounded down to 2. H : 2 2 and we will use t .01 or, if D 1 2 , 6.965 for a left-sided test. We are testing 0 1 H 1 : 1 2 H 0 : D 0 d D0 . If we use a t-ratio we get t H : D 0 sd 1 10 0 2 4.5076 . Since this is not below t .01 , 2.2185 we cannot reject the null hypothesis. If we use a critical value for d , we need a single value below zero. d cv D0 t 2 s d becomes d cv D0 t s d 0 - 6.965 2.2185 15.452 . Since d x1 x 2 10 is not below this number we cannot reject the null hypothesis. If we do not assume that the variances are equal, we need to do some arithmetic with our standard errors 30 252y0781 12/11/07 7) The following are tests of proportions. (Bassett et al). You must do legitimate tests at the 10% significance level. a) Is there any association between Forecasted and observed rainfall? 173 forecasts are considered. Observed Rainfall No rain Forecasted None 34 Light Rain 21 Heavy Rain 23 Light Rain Forecasted 24 4 9 Heavy Rain Forecasted. 17 3 38 State your null and alternative hypotheses and test it. (7) b) Are there significant differences in the proportions of female insects in 3 different locations? In location 1, 44% of 100 bugs are female. In location 2, 43% of 200 bugs are female. In location 3, 55% of 200 bugs are female. First test to see if there is a significant difference between the proportions in locations 1 and 2. (4) c) In b, test whether proportions of females are independent of location using all three proportions. (5) [16, 132] Solution: a) Is there any association between Forecasted and observed rainfall? 173 forecasts are considered. Observed Rainfall No rain Forecasted None 34 Light Rain 21 Heavy Rain 23 Light Rain Forecasted 24 4 9 Heavy Rain Forecasted. 17 3 38 State your null and alternative hypotheses and test it. (7) This is a chi-squared test of independence or homogeneity. The null hypothesis is homogeneity. We repeat the data as our O table. The row sums are made into proportions in the ‘rp’ column. O Row Forecast Onone Olight Oheavy Sum rp 1 2 3 None Light Heavy Sum 34 21 23 89 24 4 9 37 17 3 38 58 75 28 70 173 0.433526 0.161850 0.404624 1.000000 The E table is gotten by applying row proportions to the column totals. Elight Eheavy Sum E Row Forecast Enone 1 2 3 None Light Heavy 33.8150 12.6243 31.5607 16.0405 5.9884 14.9711 Because none of the expected cells are below 5, O2 n the shortcut formula 2 E 203 .589 173 30.589 is used to calculate the test statistic. The degrees of freedom are 3 13 1 4 . The 5% table value of chisquared with 4 degrees of freedom is 9.4877, but you were supposed to use the 10% value, 7.7794 Since our computed chi-square is above the table value, we reject the null hypothesis and conclude that there is some association between the forecast and observed weather. 25.1445 9.3873 23.4682 Row 1 2 3 4 5 6 7 8 9 75 28 70 O E O2 E 34 33.8150 34.1860 21 12.6243 34.9327 23 31.5607 16.7614 24 16.0405 35.9092 4 5.9884 2.6718 9 14.9711 5.4104 17 25.1445 11.4936 3 9.3873 0.9587 38 23.4682 61.5300 173 173.0000 203.854 31 252y0781 12/11/07 b) Are there significant differences in the proportions of female insects in 3 different locations? In location 1, 44% of 100 bugs are female. In location 2, 43% of 200 bugs are female. In location 3, 55% of 200 bugs are female. First test to see if there is a significant difference between the proportions in locations 1 and 2. (4) Behold, the usual excerpt from the formula table. Interval for Confidence Hypotheses Test Ratio Critical Value Interval pcv p0 z 2 p Difference p p 0 p p z 2 s p H 0 : p p 0 z between If p 0 p H 1 : p p 0 p p1 p 2 proportions 1 1 If p 0 p 0 p 01 p 02 p p 0 q 0 p1 q1 p 2 q 2 q 1 p n n s p 2 1 p 01q 01 p 02 q 02 p n1 n2 or p 0 0 n1 n2 n1 p1 n 2 p 2 p0 Or use s p n1 n 2 Our null hypothesis is p1 p 2 . Again, I’ll skip the confidence interval. Our facts are n1 100 , p1 .44 , n 2 200 , p 2 .43 and p p1 p 2 .01 . .10 . I’m in a hurry so we will say p 0 44 86 100 200 1 1 1 1 .43333 .56667 p 0 q 0 .0036833 100 200 n1 n 2 p p 0 .01 0 0.06069 . This is a 2-sided test, so I will use z .05 1.645 . The test ratio is z p .06069 .43333 . The standard error is p .1648 . We do not reject if our value of z is between z .05 1.645 . It isn’t. We reject the null. If we use a critical value, we need pCV p0 z p 0 1.645 .06069 .0998 . As long as p p1 p 2 2 is between these values we cannot reject the null hypothesis. .01 is in the ‘reject’ region. This was very close. c) In b, test whether proportions of females are independent of location using all three proportions. (5) [16, 132] Our facts are n1 100 , p1 .44 , n 2 200 , p 2 .43 , n3 200 and p 3 .55 . If anyone tried to use p1 p 2 p3 , I can only say that I warned you – what does this expression equal when all the ps are equal to, say, .3? Tests involving more than 2 proportions are chi-square tests. Our observed table is gotten by multiplying sample sizes by observed proportions. O Row Forecast Loc 1 Loc 2 Loc 3 Sum rp 1 2 Female Male Sum E Row Forecast 1 2 Female Male Sum 44 56 100 86 114 200 110 90 200 240 260 500 Loc 1 48 52 100 Loc 2 96 104 200 Loc 3 96 104 200 Sum rp 240 0.480 260 0.520 500 1.000 Because none of the expected cells are below 5, O2 n the shortcut formula 2 E 507 .571 500 7.571 is used to calculate the test statistic. The degrees of freedom are 3 12 1 3 . The 10% table value of chisquared with 3 degrees of freedom is 6.2514. Since our computed chi-square is above the table value, we reject the null hypothesis and must conclude that there is a significant difference between the proportions of females in each location. 0.480 0.520 1.000 Row O E 1 2 3 4 5 6 44 56 86 114 110 90 500 48 52 96 104 96 104 500 O2 E 40.333 60.308 77.042 124.962 126.042 77.885 506.571 32 252y0781 12/11/07 8) The following are odds and ends that don’t fit anywhere else. We are selling our production in an imperfect market. x1 is the number of units produced and x2 is our revenue. r1 and r2 are the ranks of the items in x1 and x2. .05 Row 1 2 3 4 5 6 7 8 9 10 x1 330 263 428 584 423 219 308 123 173 140 x2 221 194 245 243 244 171 213 108 143 120 r1 7 5 9 10 8 4 6 1 3 2 r2 7 5 10 8 9 4 6 1 3 2 Minitab has computed the following: sum of x1 = 2991, squares squares x1 x2 = Sum of x2 = 1902, Sum of of x1 = 1088721, Sum of of x2 = 386210 and Sum of 767524. a) Test x1 to see if its median is 200. Do not use the sign test or compute any medians. (4) b) Assuming that x1 and x2 are both random samples from a nonnormal distribution, test to see if they have similar medians. (4) c) Compute the correlation between x1 and x2 and the rank correlation between them. Why is the rank correlation higher? (6) d) Test the rank correlation for significance. (2) [16] Solution: a) Test x1 to see if its median is 200. Do not use the sign test or compute any medians. (4) Do a Wilcoxon signed rank test with the alleged Row x1 x 2 d x 2 x1 r d median in place of x 2 . The null hypothesis is H 0 : 200 . Compute the two totals, T+ = 43 and T- = 12. Check that these sum to 10 11 55 . Since this is a 2-sided test, look up 2 a 2.5% critical value for n 10 . the critical value is 8. If the smaller of the 2 Ts were less than or equal to 8, we would reject the null hypothesis. Since it is not we do not reject the null hypothesis. 1 2 3 4 5 6 7 8 9 10 330 263 428 584 423 219 308 123 173 140 200 200 200 200 200 200 200 200 200 200 30 63 282 384 223 19 108 -77 -27 -60 30 63 282 384 223 19 108 77 27 60 3 5 9 10 8 1 7 6 2 4 r* 3 5 9 10 8 1 7 -6 -2 -4 b) Assuming that x1 and x2 are both random samples from a nonnormal distribution, test to see if they have similar medians. (4) This is a good old Wilcoxon-Mann Whitney test. Row x1 x2 r1 r2 Rank the data within the whole sample (from 1 to 1 330 221 17. 11. 20 here). The two rank sums are 129 and 81. 2 263 194 15. 8. (We should check to see that they add to the sum 3 428 245 19. 14. 4 584 243 20. 12. 20 21 210 ). We do not of the first 20 ranks 5 423 244 18. 13. 2 6 219 171 10. 6. have a table that covers samples as large as those 7 308 213 16. 9. here. W , the smaller of the two rank sums has 8 123 108 3. 1. 9 173 143 7. 5. the normal distribution with mean 10 140 120 4. 2. 1 W 2 n1 n1 n2 1 0.510 21 105 and 129.0 81.0 variance W2 means z 1 6 n2 W 16 10105 175 . This 81 105 1.814 . Since this is 175 between 1.96 we cannot reject the null hypothesis of equal medians. 33 252y0781 12/11/07 c) Compute the correlation between x1 and x2 and the rank correlation between them. Why is the rank correlation higher? (6) I have already done one correlation. With the sums given here you should be able to find that the correlation is .913. The formula for rank 6 d2 correlation is rs 1 . Using what was n n2 1 given, we get the table to the right. We have 64 24 rs 1 1 .97575 . This is 2 1099 10 10 1 higher because the relationship between the two variables has a slight curvature. Row 1 2 3 4 5 6 7 8 9 10 x1 330 263 428 584 423 219 308 123 173 140 x2 221 194 245 243 244 171 213 108 143 120 r1 7 5 9 10 8 4 6 1 3 2 r2 7 5 10 8 9 4 6 1 3 2 d d2 0 0 -1 2 -1 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 0 4 d) Test the rank correlation for significance. (2) [16] The table says that for n 10 , the 2.5% critical value H 0 : s 0 is .6364 and the 5% critical value is .5515. This means that if our hypotheses are either or, H 1 : s 0 H 0 : s 0 more likely, we reject the null hypothesis and say that the rank correlation is significant. H1 : s 0 34