252x0541 4/22/05 ECO252 QBA2 Final EXAM May 4, 2005 Name and Class hour:_________________________ I. (25+ points) Do all the following. Note that answers without reasons receive no credit. Most answers require a statistical test, that is, stating or implying a hypothesis and showing why it is true or false by citing a table value or a p-value. If you haven’t done it lately, take a fast look at ECO 252 - Things That You Should Never Do on a Statistics Exam (or Anywhere Else) The next 12 pages contain computer output. This comes from a data set on the text CD-ROM called Auto2002. There are 121 observations. The dependent variable is MPG (miles per gallon). The columns in the data set are: Name The make and model SUV ‘Yes’ if it’s an SUV, ‘No’ if not. Drive Type All wheel, front wheel, rear wheel or four wheel. Horsepower An independent variable Fuel Type Premium or regular MPG The dependent variable Length In inches – an independent variable Width In inches – an independent variable Weight In pounds – an independent variable Cargo Volume Square feet – an independent variable Turning Circle Feet – an independent variable. I added the following SUV_D A dummy variable based on ‘SUV’, 1 for an SUV, otherwise zero. Fuel_D A dummy variable based on ‘Fuel Type’, 1 for a Premium fuel., otherwise zero SUVwt An interaction variable, the product of ‘SUV_D’ and ‘Weight’ SUVtc An interaction variable, the product of ‘SUV_D’ and ‘Turning Circle’ HPsq AWD_D A dummy variable based on ‘Drive Type’, 1 for all wheel drive, otherwise zero FWD_D A dummy variable based on ‘Drive Type’, 1 for front wheel drive, otherwise zero RWD_D A dummy variable based on ‘Drive Type’, 1 for rear wheel drive, otherwise zero SUV_L An interaction variable, the product of ‘SUV_D’ and ‘Length’ Questions are included with the regressions and thus cannot be in order of difficulty. It’s probably a good idea to look over the questions and explanations before you do anything. ————— 4/28/2005 6:18:32 PM ———————————————————— Welcome to Minitab, press F1 for help. Results for: 252x0504-4.MTW MTB > Stepwise 'MPG' 'Horsepower' 'Length' 'Width' 'Weight' 'Cargo Volume' & CONT> 'Turning Circle' 'SUV_D' 'Fuel_D' 'SUVwt' 'HPsq' 'AWD_D' & CONT> 'FWD_D' 'RWD_D' 'SUV_L'; SUBC> AEnter 0.15; SUBC> ARemove 0.15; SUBC> Best 0; SUBC> Constant. Because I had relatively little idea of what to do, I ran a stepwise regression. You probably have not seen one of these before, but they are relatively easy to read. Note that it dropped 2 observations so that the results will not be quite the same as I got later. The first numbered column represents the single independent variable that seems to have the most explanatory effect on MPG, The equation reads MPG = 38.31 – 15.34 Weight The fact that Weight 1 252x0541 4/22/05 entered first with a negative coefficient should surprise no one. At the bottom appears s e , R 2 , R 2 and the C p statistic mentioned in your text. The value of the t-ratio and its p-value appear below the coefficient. Stepwise Regression: MPG versus Horsepower, Length, ... Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is MPG on 14 predictors, with N = 119 N(cases with missing observations) = 2 N(all cases) = 121 Step Constant Weight T-Value P-Value 1 38.31 2 36.75 3 41.59 4 50.06 5 50.15 6 59.00 -0.00491 -15.34 0.000 -0.00436 -11.87 0.000 -0.00578 -12.82 0.000 -0.00495 -9.31 0.000 -0.00424 -6.74 0.000 -0.00339 -5.61 0.000 -1.72 -2.84 0.005 -33.71 -4.99 0.000 -35.29 -5.36 0.000 -35.12 -5.40 0.000 -18.68 -2.71 0.008 0.180 4.75 0.000 0.185 5.04 0.000 0.182 5.01 0.000 0.088 2.26 0.026 -0.285 -2.79 0.006 -0.292 -2.90 0.004 -0.255 -2.75 0.007 -0.0124 -2.01 0.046 -0.1619 -5.04 0.000 SUV_D T-Value P-Value SUV_L T-Value P-Value Turning Circle T-Value P-Value Horsepower T-Value P-Value HPsq T-Value P-Value S R-Sq R-Sq(adj) Mallows C-p 0.00040 4.73 0.000 2.50 66.78 66.50 71.5 2.43 68.94 68.40 61.4 2.23 74.04 73.36 34.8 2.17 75.70 74.85 27.4 2.14 76.55 75.51 24.7 1.96 80.45 79.41 4.8 More? (Yes, No, Subcommand, or Help) SUBC> y I’m greedy, so while I was surprised that Minitab had found six explanatory (independent) variables that actually seemed to affect miles per gallon I wanted more. For the first time ever (for me), Minitab found another variable 2 252x0541 4/22/05 Step Constant 7 58.50 Weight T-Value P-Value -0.00342 -5.74 0.000 SUV_D T-Value P-Value -19.0 -2.79 0.006 SUV_L T-Value P-Value 0.090 2.36 0.020 Turning Circle T-Value P-Value -0.210 -2.24 0.027 Horsepower T-Value P-Value -0.175 -5.43 0.000 HPsq T-Value P-Value 0.00042 5.03 0.000 Fuel_D T-Value P-Value 0.92 2.11 0.037 S R-Sq R-Sq(adj) Mallows C-p 1.93 81.21 80.02 2.5 More? (Yes, No, Subcommand, or Help) SUBC> y No variables entered or removed More? (Yes, No, Subcommand, or Help) SUBC> n Because I was worried about Collinearity, I had the computer do a table of correlations between all the independent variables. The table is triangular since the correlation between, say, Length and Horsepower is going to be the same as the correlation between Horsepower and Length. So, for example, the correlation between Horsepower and Length is .648 and the p-value of zero below it evaluates the null hypothesis that the correlation is insignificant. The explanation of Predicted R2 that appears below the correlation table was a new one on me, but could help you in comparing the regressions. 3 252x0541 4/22/05 MTB > Correlation 'Horsepower' 'Length' 'Width' 'Weight' 'Cargo Volume' & CONT> 'Turning Circle' 'SUV_D' 'Fuel_D' 'SUVwt' 'SUVtc' 'HPsq' 'AWD_D' & CONT> 'FWD_D' 'RWD_D' 'SUV_L'. Correlations: Horsepower, Length, Width, Weight, Cargo Volume, ... Horsepower 0.648 0.000 Length Width 0.660 0.000 0.825 0.000 Weight 0.673 0.000 0.634 0.000 0.780 0.000 Cargo Volume 0.296 0.001 0.395 0.000 0.546 0.000 0.716 0.000 Turning Circ 0.497 0.000 0.750 0.000 0.658 0.000 0.650 0.000 SUV_D 0.160 0.080 -0.102 0.265 0.180 0.049 0.535 0.000 Fuel_D 0.321 0.000 -0.013 0.886 -0.042 0.645 0.057 0.540 SUVwt 0.182 0.045 -0.077 0.403 0.206 0.023 0.562 0.000 SUVtc 0.185 0.042 -0.062 0.502 0.211 0.020 0.577 0.000 HPsq 0.989 0.000 0.632 0.000 0.645 0.000 0.668 0.000 AWD_D 0.059 0.523 -0.118 0.199 -0.037 0.691 0.065 0.483 FWD_D -0.370 0.000 -0.001 0.994 -0.163 0.076 -0.453 0.000 RWD_D 0.334 0.000 0.070 0.445 0.151 0.101 0.351 0.000 SUV_L 0.197 0.030 -0.053 0.564 0.219 0.016 0.582 0.000 Cargo Volume 0.486 0.000 Turning Circ SUV_D Fuel_D 0.459 0.000 0.139 0.127 -0.245 0.007 -0.069 0.456 -0.147 0.110 SUVwt 0.473 0.000 0.161 0.078 0.999 0.000 -0.141 0.125 SUVtc 0.484 0.000 0.196 0.031 0.996 0.000 -0.142 0.121 Length Turning Circ SUV_D Fuel_D Width Weight 4 252x0541 4/22/05 HPsq 0.289 0.001 0.480 0.000 0.173 0.058 0.296 0.001 AWD_D 0.021 0.823 -0.068 0.461 0.185 0.043 0.218 0.017 FWD_D -0.165 0.071 -0.027 0.771 -0.517 0.000 -0.280 0.002 RWD_D 0.108 0.239 0.015 0.874 0.364 0.000 0.098 0.288 SUV_L 0.487 0.000 0.181 0.047 0.996 0.000 -0.145 0.114 SUVwt 0.998 0.000 SUVtc HPsq AWD_D HPsq 0.198 0.030 0.200 0.028 AWD_D 0.184 0.044 0.174 0.057 0.040 0.667 FWD_D -0.522 0.000 -0.526 0.000 -0.369 0.000 -0.366 0.000 RWD_D 0.367 0.000 0.374 0.000 0.347 0.000 -0.137 0.135 SUV_L 0.999 0.000 0.998 0.000 0.215 0.018 0.176 0.054 FWD_D -0.810 0.000 RWD_D -0.529 0.000 0.381 0.000 SUVtc RWD_D SUV_L Cell Contents: Pearson correlation P-Value PRESS Assesses your model's predictive ability. In general, the smaller the prediction sum of squares (PRESS) value, the better the model's predictive ability. PRESS is used to calculate the predicted R 2. PRESS, similar to the error sum of squares (SSE), is the sum of squares of the prediction error. PRESS differs from SSE in that each fitted value, i, for PRESS is obtained by deleting the ith observation from the data set, estimating the regression equation from the remaining n - 1 observations, then using the fitted regression function to obtain the predicted value for the ith observation. Predicted R2 Similar to R2. Predicted R2 indicates how well the model predicts responses for new observations, 2 whereas R indicates how well the model fits your data. Predicted R2 can prevent overfitting the model and is more useful than adjusted R2 for comparing models because it is calculated with observations not included in model calculation. Predicted R2 is between 0 and 1 and is calculated from the PRESS statistic. Larger values of predicted R 2 suggest models of greater predictive ability. 5 252x0541 4/22/05 So now it’s time to get serious. My first regression was based on what I had learned from the stepwise regression. The only one of the variables that I left out from the stepwise regression was FUEL_D. 1. Look at the results of Regression 1. But don’t forget what has gone before. a. What does the Analysis of variance show us? Why? (1) b. What suggests that the relation of MPG to one of the variables is nonlinear? (1) c. What does the equation suggest that the difference is between an extra inch on an SUV and a non_SUV? (1) d. Why did I leave out FUEL_D (2) e. Which coefficients are not significant? Why? (2) f. What do the values of the VIFs tell us? (2) MTB > Regress 'MPG' 6 'Weight' 'SUV_D' 'SUV_L' 'Turning Circle' CONT> 'Horsepower' 'HPsq'; SUBC> Constant; SUBC> Brief 2. & MTB > Regress 'MPG' 6 'Weight' 'SUV_D' 'SUV_L' 'Turning Circle' CONT> 'Horsepower' 'HPsq'; SUBC> GNormalplot; SUBC> NoDGraphs; SUBC> RType 1; SUBC> Constant; SUBC> VIF; SUBC> Press; SUBC> Brief 2. & Regression Analysis: MPG versus Weight, SUV_D, ... (Regression 1) The regression equation is MPG = 63.1 - 0.00303 Weight - 14.8 SUV_D + 0.0653 SUV_L - 0.264 Turning Circle - 0.213 Horsepower + 0.000522 HPsq Predictor Constant Weight SUV_D SUV_L Turning Circle Horsepower HPsq Coef 63.105 -0.0030345 -14.812 0.06527 -0.2639 -0.21251 0.00052249 SE Coef 3.978 0.0006859 7.957 0.04478 0.1050 0.03575 0.00009459 T 15.86 -4.42 -1.86 1.46 -2.51 -5.94 5.52 P 0.000 0.000 0.065 0.148 0.013 0.000 0.000 VIF 5.6 282.1 307.9 2.0 63.5 61.3 S = 2.27485 R-Sq = 77.5% R-Sq(adj) = 76.4% PRESS = 752.906 R-Sq(pred) = 71.34% Analysis of Variance Source DF SS Regression 6 2037.34 Residual Error 114 589.95 Total 120 2627.29 Source Weight SUV_D SUV_L Turning Circle Horsepower HPsq DF 1 1 1 1 1 1 MS 339.56 5.17 F 65.62 P 0.000 Seq SS 1605.19 47.29 132.83 52.31 41.83 157.89 Unusual Observations Obs Weight MPG Fit 16 5590 13.000 15.361 34 7270 10.000 6.856 40 5590 13.000 15.361 62 4065 19.000 14.633 SE Fit 1.137 1.461 1.137 0.654 Residual -2.361 3.144 -2.361 4.367 St Resid -1.20 X 1.80 X -1.20 X 2.00R 6 252x0541 4/22/05 108 111 114 115 2150 2750 2935 2940 38.000 41.000 41.000 24.000 30.489 33.473 29.806 29.791 0.632 1.133 0.777 0.778 7.511 7.527 11.194 -5.791 3.44R 3.82RX 5.24R -2.71R R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large influence. 2. Look at the results of Regression 2. But don’t forget what has gone before. a. What variable did I drop? Why? (2) b. Are there any coefficients that have a sign that you would not expect? Why? (1) c. A Chevrolet Suburban is an SUV with rear wheel drive and 285 horsepower, that takes Regular fuel, has a length of 219 inches, a width of 79 inches, a weight of 5590 pounds, a cargo volume of 77.0 square feet and a turning circle of 46 Feet (!!! Maybe it was inches?). What miles per gallon does the equation predict? What would it be if the vehicle was not classified as an SUV? (3) d. Why do I like this regression better than the previous one? (2) [17] MTB > Regress 'MPG' 5 'Weight' 'SUV_D' CONT> 'HPsq'; SUBC> GNormalplot; SUBC> NoDGraphs; SUBC> RType 1; SUBC> Constant; SUBC> VIF; SUBC> Press; SUBC> Brief 2. 'Turning Circle' 'Horsepower' Regression Analysis: MPG versus Weight, SUV_D, ... & (Regression 2) The regression equation is MPG = 63.1 - 0.00250 Weight - 3.25 SUV_D - 0.250 Turning Circle - 0.239 Horsepower + 0.000593 HPsq Predictor Constant Weight SUV_D Turning Circle Horsepower HPsq Coef 63.137 -0.0025020 -3.2492 -0.2501 -0.23928 0.00059313 SE Coef 3.998 0.0005834 0.6272 0.1051 0.03082 0.00008163 T 15.79 -4.29 -5.18 -2.38 -7.76 7.27 P 0.000 0.000 0.000 0.019 0.000 0.000 VIF 4.0 1.7 1.9 46.7 45.2 S = 2.28595 R-Sq = 77.1% R-Sq(adj) = 76.1% PRESS = 744.047 R-Sq(pred) = 71.68% Analysis of Variance Source DF SS Regression 5 2026.35 Residual Error 115 600.94 Total 120 2627.29 Source Weight SUV_D Turning Circle Horsepower HPsq DF 1 1 1 1 1 MS 405.27 5.23 F 77.56 P 0.000 Seq SS 1605.19 47.29 46.32 51.65 275.90 Unusual Observations Obs 16 34 40 108 Weight 5590 7270 5590 2150 MPG 13.000 10.000 13.000 38.000 Fit 14.381 5.945 14.381 30.081 SE Fit 0.921 1.328 0.921 0.570 Residual -1.381 4.055 -1.381 7.919 St Resid -0.66 X 2.18RX -0.66 X 3.58R 7 252x0541 4/22/05 111 114 115 2750 2935 2940 41.000 41.000 24.000 33.910 30.060 30.047 1.098 0.761 0.762 7.090 10.940 -6.047 3.54RX 5.08R -2.81R R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large influence. Because I wanted to look at the effect of the three drive variables on MPG, I ran another stepwise regression. The first part of this is identical to the last stepwise regression, but after the 6 th regression, I forced out SUV_L and forced in AWD_D, FWD_D and RWD_D. Because I had to make the regressions comparable, I threw an observation with an anomalous drive variable out and redid my two regressions as Regressions 3 and 4. I then added in all the drive variables as a package in Regression 5. MTB > Stepwise 'MPG' 'Horsepower' 'Length' 'Width' 'Weight' 'Cargo Volume' & CONT> 'Turning Circle' 'SUV_D' 'Fuel_D' 'SUVwt' 'HPsq' 'AWD_D' & CONT> 'FWD_D' 'RWD_D' 'SUV_L'; SUBC> AEnter 0.15; SUBC> ARemove 0.15; SUBC> Best 0; SUBC> Constant. Stepwise Regression: MPG versus Horsepower, Length, ... Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is MPG on 14 predictors, with N = 119 N(cases with missing observations) = 2 N(all cases) = 121 Step Constant Weight T-Value P-Value 1 38.31 2 36.75 3 41.59 4 50.06 5 50.15 6 59.00 -0.00491 -15.34 0.000 -0.00436 -11.87 0.000 -0.00578 -12.82 0.000 -0.00495 -9.31 0.000 -0.00424 -6.74 0.000 -0.00339 -5.61 0.000 -1.72 -2.84 0.005 -33.71 -4.99 0.000 -35.29 -5.36 0.000 -35.12 -5.40 0.000 -18.68 -2.71 0.008 0.180 4.75 0.000 0.185 5.04 0.000 0.182 5.01 0.000 0.088 2.26 0.026 -0.285 -2.79 0.006 -0.292 -2.90 0.004 -0.255 -2.75 0.007 -0.0124 -2.01 0.046 -0.1619 -5.04 0.000 SUV_D T-Value P-Value SUV_L T-Value P-Value Turning Circle T-Value P-Value Horsepower T-Value P-Value HPsq T-Value P-Value S R-Sq R-Sq(adj) Mallows C-p 0.00040 4.73 0.000 2.50 66.78 66.50 71.5 2.43 68.94 68.40 61.4 2.23 74.04 73.36 34.8 2.17 75.70 74.85 27.4 2.14 76.55 75.51 24.7 1.96 80.45 79.41 4.8 More? (Yes, No, Subcommand, or Help) SUBC> remove 'SUV_L'. Step 7 8 9 8 252x0541 4/22/05 Constant 59.15 59.00 58.50 Weight T-Value P-Value -0.00267 -5.10 0.000 -0.00339 -5.61 0.000 -0.00342 -5.74 0.000 SUV_D T-Value P-Value -3.13 -5.51 0.000 -18.68 -2.71 0.008 -18.95 -2.79 0.006 0.088 2.26 0.026 0.090 2.36 0.020 SUV_L T-Value P-Value Turning Circle T-Value P-Value -0.236 -2.51 0.013 -0.255 -2.75 0.007 -0.210 -2.24 0.027 Horsepower T-Value P-Value -0.199 -7.09 0.000 -0.162 -5.04 0.000 -0.175 -5.43 0.000 0.00050 6.75 0.000 0.00040 4.73 0.000 0.00042 5.03 0.000 HPsq T-Value P-Value Fuel_D T-Value P-Value S R-Sq R-Sq(adj) Mallows C-p 0.92 2.11 0.037 2.00 79.56 78.66 7.8 1.96 80.45 79.41 4.8 1.93 81.21 80.02 2.5 More? (Yes, No, Subcommand, or Help) SUBC> enter 'AWD_D' 'FWD_D' 'RWD_D'. Step Constant 10 60.14 11 59.11 12 58.50 13 58.50 Weight T-Value P-Value -0.00355 -5.75 0.000 -0.00346 -5.72 0.000 -0.00344 -5.72 0.000 -0.00342 -5.74 0.000 SUV_D T-Value P-Value -19.5 -2.82 0.006 -19.1 -2.77 0.007 -18.8 -2.74 0.007 -19.0 -2.79 0.006 SUV_L T-Value P-Value 0.092 2.37 0.020 0.090 2.32 0.022 0.089 2.30 0.023 0.090 2.36 0.020 Turning Circle T-Value P-Value -0.207 -2.10 0.038 -0.205 -2.09 0.039 -0.202 -2.07 0.041 -0.210 -2.24 0.027 Horsepower T-Value P-Value -0.175 -5.33 0.000 -0.177 -5.42 0.000 -0.176 -5.41 0.000 -0.175 -5.43 0.000 0.00042 4.98 0.000 0.00043 5.04 0.000 0.00042 5.02 0.000 0.00042 5.03 0.000 HPsq T-Value P-Value 9 252x0541 4/22/05 Fuel_D T-Value P-Value 0.73 1.49 0.139 0.80 1.66 0.099 0.87 1.92 0.057 AWD_D T-Value P-Value -1.1 -0.76 0.451 FWD_D T-Value P-Value -1.36 -0.98 0.331 -0.51 -0.62 0.535 -0.17 -0.32 0.752 RWD_D T-Value P-Value -1.23 -0.93 0.353 -0.42 -0.55 0.586 S R-Sq R-Sq(adj) Mallows C-p 1.95 81.37 79.65 7.6 1.95 81.27 79.73 6.1 0.92 2.11 0.037 1.94 81.22 79.86 4.4 1.93 81.21 80.02 2.5 More? (Yes, No, Subcommand, or Help) SUBC> no Results for: 252x0504-41.MTW MTB > WSave "C:\Documents and Settings\rbove\My Documents\Minitab\252x050441.MTW"; SUBC> Replace. Saving file as: 'C:\Documents and Settings\rbove\My Documents\Minitab\252x0504-41.MTW' MTB > erase c21 MTB > Regress 'MPG' 6 'Weight' 'SUV_D' 'SUV_L' 'Turning Circle' & CONT> 'Horsepower' 'HPsq' ; SUBC> GNormalplot; SUBC> NoDGraphs; SUBC> RType 1; SUBC> Constant; SUBC> VIF; SUBC> Press; SUBC> Brief 2. Regression Analysis: MPG versus Weight, SUV_D, ... (Regression 3) The regression equation is MPG = 64.4 - 0.00284 Weight - 15.8 SUV_D + 0.0694 SUV_L - 0.305 Turning Circle - 0.214 Horsepower + 0.000524 HPsq Predictor Constant Weight SUV_D SUV_L Turning Circle Horsepower HPsq Coef 64.364 -0.0028431 -15.843 0.06943 -0.3045 -0.21444 0.00052386 SE Coef 3.973 0.0006832 7.867 0.04423 0.1055 0.03528 0.00009332 T 16.20 -4.16 -2.01 1.57 -2.89 -6.08 5.61 P 0.000 0.000 0.046 0.119 0.005 0.000 0.000 VIF 5.7 276.4 301.7 2.0 63.1 61.0 S = 2.24427 R-Sq = 78.3% R-Sq(adj) = 77.2% PRESS = 725.963 R-Sq(pred) = 72.34% Analysis of Variance Source DF SS Regression 6 2055.21 Residual Error 113 569.15 Total 119 2624.37 Source DF MS 342.54 5.04 F 68.01 P 0.000 Seq SS 10 252x0541 4/22/05 Weight SUV_D SUV_L Turning Circle Horsepower HPsq 1 1 1 1 1 1 Unusual Observations Obs Weight MPG 16 5590 13.000 34 7270 10.000 36 2715 24.000 40 5590 13.000 107 2150 38.000 110 2750 41.000 113 2935 41.000 114 2940 24.000 1602.61 49.58 135.39 61.04 47.88 158.71 Fit 15.259 6.907 28.432 15.259 30.543 33.747 30.000 29.985 SE Fit 1.123 1.442 0.493 1.123 0.624 1.126 0.772 0.774 Residual -2.259 3.093 -4.432 -2.259 7.457 7.253 11.000 -5.985 St Resid -1.16 X 1.80 X -2.02R -1.16 X 3.46R 3.74RX 5.22R -2.84R R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large influence. MTB > Regress 'MPG' 5 'Weight' 'SUV_D' CONT> 'HPsq' ; SUBC> GNormalplot; SUBC> NoDGraphs; SUBC> RType 1; SUBC> Constant; SUBC> VIF; SUBC> Press; SUBC> Brief 2. 'Turning Circle' 'Horsepower' Regression Analysis: MPG versus Weight, SUV_D, ... & (Regression 4) The regression equation is MPG = 64.4 - 0.00228 Weight - 3.53 SUV_D - 0.288 Turning Circle - 0.243 Horsepower + 0.000599 HPsq Predictor Constant Weight SUV_D Turning Circle Horsepower HPsq Coef 64.352 -0.0022848 -3.5330 -0.2884 -0.24278 0.00059879 SE Coef 3.999 0.0005871 0.6366 0.1057 0.03051 0.00008071 T 16.09 -3.89 -5.55 -2.73 -7.96 7.42 P 0.000 0.000 0.000 0.007 0.000 0.000 VIF 4.2 1.8 2.0 46.6 45.0 S = 2.25865 R-Sq = 77.8% R-Sq(adj) = 76.9% PRESS = 720.507 R-Sq(pred) = 72.55% Analysis of Variance Source DF SS Regression 5 2042.80 Residual Error 114 581.57 Total 119 2624.37 Source Weight SUV_D Turning Circle Horsepower HPsq DF 1 1 1 1 1 MS 408.56 5.10 F 80.09 P 0.000 Seq SS 1602.61 49.58 52.45 57.33 280.82 Unusual Observations Obs Weight MPG Fit 16 5590 13.000 14.223 34 7270 10.000 5.938 40 5590 13.000 14.223 107 2150 38.000 30.108 SE Fit 0.914 1.312 0.914 0.563 Residual -1.223 4.062 -1.223 7.892 St Resid -0.59 X 2.21RX -0.59 X 3.61R 11 252x0541 4/22/05 110 113 114 2750 2935 2940 41.000 41.000 24.000 34.201 30.262 30.251 1.095 0.759 0.760 6.799 10.738 -6.251 3.44RX 5.05R -2.94R R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large influence. Look at the results of Regression 5 and Regression 4. But don’t forget what has gone before. a. Do an F test to see if Regression 5 is better than Regression 4. If you can include the results from my forcing variables in the last stepwise regression. (6) b. Should I have included another dummy variable to represent 4-wheel drive? Why? (2) c. Are there any coefficients in Regression 5 that have a sign that you would not expect? Why? (1) d. A Chevrolet Suburban is an SUV with rear wheel drive and 285 horsepower, that takes Regular fuel, has a length of 219 inches, a width of 79 inches, a weight of 5590 pounds, a cargo volume of 77.0 square feet and a turning circle of 46 Feet (!!! Maybe it was inches?). How do the predictions for MPG in Equations 2 and 4 differ in percentage terms? (3) [29] Why do I like this regression better than the pr 3. MTB > Regress 'MPG' 8 'Weight' 'SUV_D' 'Turning Circle' 'Horsepower' CONT> 'HPsq' 'AWD_D' 'FWD_D' 'RWD_D'; SUBC> GNormalplot; SUBC> NoDGraphs; SUBC> RType 1; SUBC> Constant; SUBC> VIF; SUBC> Press; SUBC> Brief 2. Regression Analysis: MPG versus Weight, SUV_D, ... & (Regression 5) The regression equation is MPG = 66.4 - 0.00248 Weight - 3.83 SUV_D - 0.254 Turning Circle - 0.251 Horsepower + 0.000618 HPsq - 1.21 AWD_D - 2.10 FWD_D - 1.70 RWD_D Predictor Constant Weight SUV_D Turning Circle Horsepower HPsq AWD_D FWD_D RWD_D Coef 66.435 -0.0024795 -3.8302 -0.2541 -0.25082 0.00061833 -1.213 -2.103 -1.697 SE Coef 4.400 0.0006077 0.6814 0.1116 0.03122 0.00008244 1.620 1.490 1.434 T 15.10 -4.08 -5.62 -2.28 -8.03 7.50 -0.75 -1.41 -1.18 P 0.000 0.000 0.000 0.025 0.000 0.000 0.455 0.161 0.239 VIF 4.4 2.0 2.2 48.6 46.7 3.4 11.2 8.6 S = 2.26416 R-Sq = 78.3% R-Sq(adj) = 76.8% PRESS = 727.840 R-Sq(pred) = 72.27% Analysis of Variance Source DF SS Regression 8 2055.33 Residual Error 111 569.03 Total 119 2624.37 Source Weight SUV_D Turning Circle Horsepower HPsq AWD_D DF 1 1 1 1 1 1 MS 256.92 5.13 F 50.12 P 0.000 Seq SS 1602.61 49.58 52.45 57.33 280.82 2.00 12 252x0541 4/22/05 FWD_D RWD_D 1 1 3.36 7.17 Unusual Observations Obs 34 57 72 107 109 110 113 114 Weight 7270 4735 4720 2150 5435 2750 2935 2940 MPG 10.000 14.000 15.000 38.000 14.000 41.000 41.000 24.000 Fit 5.609 13.622 15.901 30.231 13.477 34.346 30.341 30.329 SE Fit 1.377 1.447 1.374 0.574 1.338 1.106 0.765 0.766 Residual 4.391 0.378 -0.901 7.769 0.523 6.654 10.659 -6.329 St Resid 2.44RX 0.22 X -0.50 X 3.55R 0.29 X 3.37RX 5.00R -2.97R R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large influence. 13 252x0541 4/22/05 II. Do at least 4 of the following 6 Problems (at least 15 each) (or do sections adding to at least 60 points – (Anything extra you do helps, and grades wrap around) . Show your work! State H 0 and H1 where applicable. Use a significance level of 5% unless noted otherwise. Do not answer questions without citing appropriate statistical tests – That is, explain your hypotheses and what values from what table were used to test them. Clearly label what section of each problem you are doing! The entire test has 175 points, but 100 is considered a perfect score. Exhibit 1. A tear-off copy of this exhibit appears at the end of the exam. An entrepreneur believes that her business is growing steadily and wants to compute a trend line for her output Y against time x1 T . She also decides to repeat the regression after adding x 2 T 2 as a second independent variable. Her data and results follow. The t statistics have been relabeled ‘t-ratio’ to prevent confusion with T . Regression Analysis: Y versus T Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Y 53.43 59.09 59.58 64.75 68.65 65.53 68.44 70.93 72.85 73.60 72.93 75.14 73.88 76.55 79.05 T 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 T2 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225 The regression equation is Y = 56.7 + 1.54 T Predictor Coef SE Coef Constant 56.659 1.283 T 1.5377 0.1411 S = 2.36169 R-Sq = 90.1% t-ratio P 44.15 0.000 10.89 0.000 R-Sq(adj) = 89.4% Regression Analysis: Y versus T, TSQ The regression equation is Y = 52.4 + 3.04 T - 0.0939 TSQ Predictor Coef SE Coef Constant 52.401 1.545 T 3.0405 0.4444 TSQ -0.09392 0.02701 S = 1.73483 R-Sq = 95.1% t-ratio P 33.91 0.000 6.84 0.000 -3.48 0.005 If you need them, her means and spare parts are below. Y 68.9600 X 22 nX 22 SSX 2 75805.3 X 1 8.00 Y 2 nY 2 SST SSY 734.556 X 2 82.6667 X 1Y nX 1Y SX 1Y 430.550 X 12 nX 12 SSX1 280 X 2 Y nX 2 Y SX 2Y 6501.33 X 1X 2 nX 1 X 2 SX 1X 2 4480.00 14 252x0541 4/22/05 1. Do the following using Exhibit 1. a) Explain what numbers in the printout were used to compute the t-ratio 6.84, what table value you would compare it with to do a 2-sided 1% significance test and whether and why the coefficient is significant. (3) b) The entrepreneur looked at the residual analysis of the first regression and decided that she needs a time squared term. What is she likely to have seen to cause her to make that decision? (1) c) Use the values of R 2 to do an F test to see if the addition of the T 2 term makes a significant improvement in the regression. (4) d) Get R 2 ( R 2 adjusted for degrees of freedom) for the second regression and explain what it seems to show. (2) e) In the first regression, the Durbin-Watson statistic was 1.07 and for the second it was 1.94. What do these numbers indicate? (Do a significance test.) (5) f) For the second regression, make a prediction of the output in the 16 th year and use the suggestion in the outline to make it into a prediction interval. Why would a confidence interval be inappropriate here? (3) [15] 15 252x0541 4/22/05 2. Do the following using Exhibit 1. a) Compute the (Pearson’s) sample correlation between output and time and test it for significance. (5) b) Test the hypothesis that the population correlation 0.8 . (5) c) Do Spearman’s rank correlation between output and time and test it for significance. Why is the rank correlation higher than Pearson’s? (6) [16, 31] 16 252x0541 4/22/05 3. (Berenson et. al.) The operations manager of a light bulb factory wants to determine if there is any difference between the average life expectancy of a light bulb manufactured by two different machines. A random sample of 25 light bulbs from machine 1 has a sample mean of 375 hours. A random sample of 25 light bulbs from machine 2 has a sample mean of 362 hours. a) Test whether the mean lives of the bulbs differ at the 5% significance level. Assume that 110 is the population standard deviation for machine 1 and that 125 is the population standard deviation for machine 2. Do not use a confidence interval. State your null hypothesis! (4) b) Find a p-value for the null hypothesis in part a) and interpret it. Do not use the t-table. (3) c) Test whether the mean lives of the bulbs differ at the 5% significance level. Assume that 110 is the sample standard deviation for machine 1 and that 125 is the sample standard deviation for machine 2. Do not use a confidence interval. State your null hypothesis! Make and state an assumption about the equality of the two standard deviations. (3 or 5) d) Test the assumption about the standard deviations that you made in c). State your null hypothesis! (2) e) Make the following confidence intervals. (i) A confidence interval for the difference between the means in a). (1) (ii) A confidence interval for the difference between the means in c) (2) (iii) A confidence interval for the ratio of the population variances in d) (2) [15, 46] 17 252x0541 4/22/05 4. (Berenson et. al.) A student team is investigating the size of the bubbles that can be blown with various brands of bubble gum. The data below is the total diameter in inches of the bubbles and is presented in two different ways. These Exhibits are repeated as a tear-off sheet at the end of the exam with the sums and sums of squares computed for you. Exhibit 2: Size of Bubbles Blocked by Blower Brand 1 Row 1 2 3 4 5 Student Loopy Percival Poopsy Dizzy Booger Brand 2 Brand 3 Brand 4 x1 x 2 x 3 x 4 8.75 9.50 9.25 9.50 9.25 9.5 4.0 5.5 8.5 4.5 8.5 8.5 7.5 7.5 8.0 11.5 11.0 7.5 7.5 8.0 Exhibit 3: Size of Bubbles Blocked by Blower but ranked as four independent random samples. Row 1 2 3 4 5 Student Loopy Percival Poopsy Dizzy Booger x1 r1 x 2 r2 x 3 r3 x 4 r4 8.75 9.50 9.25 9.50 9.25 13.0 17.0 14.5 17.0 14.5 9.5 4.0 5.5 8.5 4.5 17 1 3 11 2 8.5 8.5 7.5 7.5 8.0 11.0 11.0 5.5 5.5 8.5 11.5 11.0 7.5 7.5 8.0 20.0 19.0 5.5 5.5 8.5 Compare the data in 3 different ways. a) Do only one of the following. (i) Consider the data random samples from Normally distributed populations and compare means. (6) (ii) Consider the data blocked data from Normally distributed populations and compare means. (8) b) Consider the data random samples from non-Normally distributed populations and compare medians. (5) c) Consider the data blocked data from non-Normally distributed populations and compare medians. (5) [24, 70] 18 252x0541 4/22/05 5. (Berenson et. al.) The time it takes to design and launch a marketing campaign is called a cycle time. Marketing campaigns are classified by cycle time (in months) and effectiveness. Don’t even think of answering any part of this question without doing a statistical test! Effectiveness D u r a t i o n < 1 mo. 1-2 mo. 2-4 mo. >4 mo. Total Very Effective 15 28 24 6 73 Effective 9 26 33 19 87 Ineffective 5 2 3 5 15 Total 29 56 60 30 175 a. Test to see if the proportion in the various effectiveness categories is related to cycle time. (8) b. Of the campaigns that took 0 – 2 months 7 were ineffective. Of the campaigns that took more than two months, 8 were ineffective. Is the fraction that were ineffective in the first category below the fraction in the second category? (5) c. Test the hypothesis that 45% of campaigns are very effective. (4) d. As you know a Jorcillator has two components, a Phillinx and a Flubberall. We recorded the order in which they were replaced over the last year to see if there was a pattern or the replacement sequence was just random. We got PPPFFFPPPFFPPFFPPFFF. Test it! (3) e. (Anderson et. al.) The number of emergency calls our Fire department receives is believed to have a Poisson distribution with a parameter of 3. Test this against data for a period of 120 days : 0 calls on 9 days, 1 call on 12 days, 2 calls on 30 days, 3 calls on 27 days, 4 calls on 22 days. 5 calls on 13 days and 7 calls on 6 days. (5) [25, 95] 19 252x0541 4/22/05 6. Test to see if the price of new homes rose between 2001 and 2002. The following data represents a random sample of typical prices in thousands in 10 zip codes in 2001 and 2002. Row Location 2001 2002 x1 1 2 3 4 5 6 7 8 9 10 Alexandria Boston Decatur Kirkland New York Philadelphia Phoenix Raleigh San Bruno Tampa 245.795 391.750 205.270 326.524 545.363 185.736 170.413 210.015 385.387 194.205 293.266 408.803 227.561 333.569 531.098 197.874 175.030 196.094 391.409 199.858 Some of the following data may be of use to you. x 2860.458, d -94.104, 1 x d 2 1 2 change d x1 x 2 x2 953941.216, x -47.471 -17.053 -22.291 -7.045 14.265 -12.138 -4.617 13.921 -6.022 -5.653 2 2954.562, x 2 2 999628.915, 3724.975 If you want to receive full credit, you must clearly label each section that you do! a) Remember that the data is cross classified. Assume that the underlying distribution is not Normal and compare medians. (5) b) Remember that the data is cross classified. Assume that the underlying distribution is Normal and compare means. (4) c) Forget that the data is cross classified. Assume instead that it represents two random samples from Chester County, one for each year and that the underlying distribution is not Normal. Compare medians. (6) [15, 110] 20 252x0541 4/22/05 (Blank) 21 252x0541 4/22/05 ECO252 QBA2 Final EXAM May 2-6, 2005 TAKE HOME SECTION Name: _________________________ Student Number: _________________________ Class days and time : _________________________ 1) Please Note: computer problems 2,3 and 4 should be turned in with the exam (2). In problem 2, the 2 way ANOVA table should be checked. The three F tests should be done with a 5% significance level and you should note whether there was (i) a significant difference between drivers, (ii) a significant difference between cars and (iii) significant interaction. In problem 3, you should show on your third graph where the regression line is. Check what your text says about normal probability plots and analyze the plot you did. Explain the results of the t and F tests using a 5% significance level. (2) 2) 4th computer problem (4+) This is an internet project. You are trying to answer the question, ‘how well does manufacturing explain differences in income?’ You should use some measure of income per person or family in each state as your dependent variable and try to explain it as a function of (to start with) percent of output or labor force in manufacturing. This should start out as a simple regression. Then you should try to see whether there are other variables that explain the differences as well. One possibility is the per cent of the adult population with college or high school diplomas. Possible sources of data are below, but think about what you use, and try to find some other sources. Total income of a state, for example is a very poor choice, rather than some per capita measure because it is simply going to be high for places with a lot of people without indicating how well off they are. Similarly the fraction of the workforce with a certain education level is far better then the number. For instructions on how to do a regression, try the material in Doing a Regression. http://www.nam.org/s_nam/sec.asp?CID=5&DID=3 Manufacturing share in state economies (http://www.nam.org/Docs/IEA/26767_20002001ManufacturingShareandChangeinStateEconomies.pdf?DocTypeID=9&TrackID=&Param=@CategoryI D=1156@TPT=2002-2001+Manufacturing+Share+and+Change+in+State+Economics) http://www.nemw.org/data.htm Per capita income by state. http://www.nemw.org/data.htm State personal income per capita. http://www.bea.doc.gov/bea/regional/data.htm Personal income per capita by state. http://www.census.gov/statab/www/ Many state statistics, including persons with bachelor’s degrees. http://www.epinet.org/content.cfm/datazone_index Income inequality, median income, unemployment rates. Anyway, your job is to add whatever variable you think ought to explain your income measure. Consider all 50 states your sample. Your report should tell what numbers you used, from where and from what years. What coefficients were significant and do you think on the basis of your results that manufacturing is an important predictor of a state’s prosperity? Mark all significant F and t coefficients using a 5% significance level. Explain VIFs. Of course, if you don’t like this assignment, get approval to research something else on the internet. For example, does the per cent of the population in prison affect the crime rate (maybe with a few years’ lag)? Or are there better predictors? And get out the Durbin-Watson, prison vs. crime rate is a time series project. [8] 3) Hotshot Associates is afraid of sex discrimination charges and collects the data below. The dependent variable is income in thousands of dollars and the two independent variables are education in years and a dummy variable indicating sex (1 means a female). The lines in the middle are missing because the totals 22 252x0541 4/22/05 are reliable and you don’t need them. The only thing that is missing is you. Add yourself to the sample as a 21st observation with 12 years of education and an income of 100.0 (thousand) plus the last two digits of your student number as hundreds. For example Roland Dough’s student number is 123689, so he adds $8900 to $100000 to get 108900, which he records as 108.9. y Row 1 2 3 4 5 INC 39.0 43.7 62.6 42.8 55.0 17 72.9 18 56.1 19 67.1 20 82.3 1168.5 x1 x2 x12 x 22 EDUC 2 4 8 8 8 SEX 0 1 0 1 0 4 16 64 64 64 0 1 0 1 0 16 16 17 21 241 0 1 0 0 7 256 256 289 441 3285 y2 1521.00 1909.69 3918.76 1831.84 3025.00 x1 y 78.0 174.8 500.8 342.4 440.0 x2 y x1 x 2 0.0 43.7 0.0 42.8 0.0 0 4 0 8 0 0 5314.41 1166.4 0.0 1 3147.21 897.6 56.1 0 4502.41 1140.7 0.0 0 6773.29 1728.3 0.0 7 70091.67 14783.9 370.6 0 16 0 0 81 a. Compute the regression equation Yˆ b0 b1 x1 to predict salaries the basis of education only. (2) b. Compute R 2 . (2) c. Compute s e . (2) d. Compute s b1 and do a significance test on b1 (1.5) e. Compute s b0 and do a confidence interval for b0 (1.5) f. You are about to hire your nephew for the summer and want to know how much to pay him He has 14 years of education. Using this create a prediction interval his salary. Explain why a confidence interval for the price is inappropriate. (3) g. Do an ANOVA table for the regression. What conclusion can you draw from the hypothesis test in the ANOVA? (2) [22] Extra credit from here on. h. Do a multiple regression of price against education and sex.(5) i. Compute R-squared and R-squared adjusted for degrees of freedom for this regression and compare them with the values for the previous problem. (2) j. Using either R – squares or SST, SSR and SSE do F tests (ANOVA). First check the usefulness of the simple regression and then the value of ‘sex’ as an improvement to the regression. How should this impact Hotshot Associates’ discrimination problem? (Don’t say a word without referring to a statistical test.) (3) k. Predict what you will pay your nephew now. How much change is there from your last prediction? (2) 4) An airport authority wants to compare training of air traffic controllers at three locations. Data is on the next page. To personalize these data add the last two digits of your student number as a 9 th number to column C. a. Compare the performance of locations A, B, and C assuming that the underlying distribution is nonNormal. (4) [26] b. Use a one-way ANOVA to test the hypothesis of equal means. (5) It is legitimate to check your results by computer, but I expect to see hand computations every step of the way. [31] c. (Extra Credit) Decide between the methods that you used in a) and b). To do this test for equal variances and for Normality on the computer. What is your decision? Why? (4) You can do most of this with the following commands in Minitab if you put your data in 3 columns of Minitab with A, B, and C above them. MTB > AOVOneway A B C MTB > stack A B C c11; #Does a 1-way ANOVA # Stacks the data in c12, col.no. in c12. 23 252x0541 4/22/05 SUBC> SUBC> MTB > MTB > subscripts c12; UseNames. rank c11 c13 vartest c11 c12 #Puts the ranks of the stacked data in c13 #Does a bunch of tests, including Levene’s On stacked data in c11 with IDs in c12. MTB > Unstack (c13); SUBC> Subscripts c12; SUBC> After; SUBC> VarNames. #Unstacks the ranks in the next 5 available # columns. Uses IDs in c12. MTB > NormTest 'A'; SUBC> KSTest. #Does a test (apparently Lilliefors)for Normality # on column A. Data for Problem 4 Row 1 2 3 4 5 6 7 8 A 96 82 88 70 90 91 87 88 B 65 74 72 66 79 82 73 C 60 73 85 61 79 85 88 79 This might help. MTB > sum c1 Sum of A Sum of A = 692 MTB > ssq c1 Sum of Squares of A Sum of squares (uncorrected) of A = 60278 MTB > sum c2 Sum of B Sum of B = 511 MTB > ssq c2 Sum of Squares of B Sum of squares (uncorrected) of B = 37535 24 252x0541 4/22/05 Name:_______________________ Exhibit 1. An entrepreneur believes that her business is growing steadily and wants to compute a trend line for her output Y against time x1 T . She also decides to repeat the regression after adding x 2 T 2 as a second independent variable. Her data and results follow. The t statistics have been relabeled ‘t-ratio’ to prevent confusion with T . Regression Analysis: Y versus T Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Y 53.43 59.09 59.58 64.75 68.65 65.53 68.44 70.93 72.85 73.60 72.93 75.14 73.88 76.55 79.05 T 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 T 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225 2 The regression equation is Y = 56.7 + 1.54 T Predictor Coef SE Coef Constant 56.659 1.283 T 1.5377 0.1411 S = 2.36169 R-Sq = 90.1% t-ratio P 44.15 0.000 10.89 0.000 R-Sq(adj) = 89.4% Regression Analysis: Y versus T, TSQ The regression equation is Y = 52.4 + 3.04 T - 0.0939 TSQ Predictor Coef SE Coef Constant 52.401 1.545 T 3.0405 0.4444 TSQ -0.09392 0.02701 S = 1.73483 R-Sq = 95.1% t-ratio P 33.91 0.000 6.84 0.000 -3.48 0.005 If you need them, her means and spare parts are below. Y 68.9600 X 22 nX 22 SSX 2 75805.3 X 1 8.00 Y 2 nY 2 SST SSY 734.556 X 2 82.6667 X 1Y nX 1Y SX 1Y 430.550 X 12 nX 12 SSX1 280 X 2 Y nX 2 Y SX 2Y 6501.33 X 1X 2 nX 1 X 2 SX 1X 2 4480.00 25 252x0541 4/22/05 Name:_______________________ Exhibit 2: Size of Bubbles Blocked by Blower Brand 1 Row 1 2 3 4 5 Student Loopy Percival Poopsy Dizzy Booger Brand 2 Brand 3 Brand 4 x1 x 2 x 3 x 4 8.75 9.50 9.25 9.50 9.25 9.5 4.0 5.5 8.5 4.5 8.5 8.5 7.5 7.5 8.0 11.5 11.0 7.5 7.5 8.0 Column sums and sums of squares are as follows. x x 1 46.25, 3 40, x x 2 1 428.188, 2 3 321, x x 2 32, 4 45.5, x x 2 2 229, 2 4 429.75 Exhibit 3: Size of Bubbles Blocked by Blower but ranked as four independent random samples. Row 1 2 3 4 5 Student Loopy Percival Poopsy Dizzy Booger x1 r1 x 2 r2 x 3 r3 x 4 r4 8.75 9.50 9.25 9.50 9.25 13.0 17.0 14.5 17.0 14.5 9.5 4.0 5.5 8.5 4.5 17 1 3 11 2 8.5 8.5 7.5 7.5 8.0 11.0 11.0 5.5 5.5 8.5 11.5 11.0 7.5 7.5 8.0 20.0 19.0 5.5 5.5 8.5 Row Sums and Sums of squares are as follows. 1 3 5 x x x 1 38.25 3 29.75 5 29.75 x x x 2 1 371.313 2 3 228.313 4 2 5 233.813 2 x 4 x 2 33.00 33.00 x 2 4 x 2 2 299.500 275.000 26