Midterm Exam Stat 530 Spring 2007 Name ______________________________ Exam Rules: This exam is closed book, closed notes. However, you are permitted to have one cheat sheet (regular sized paper, with writing on both sides). Use of a calculator is recommended. When answering the problems below, make your methods clear. You may make use of any of the Minitab output that you see on the exam. You do not need to redo calculations if they appear on the output. There are a total of 40 points on the exam. When performing a hypothesis test, use a level of = 0.05. If using a p-value to make the decision for a test, reject if the p-value equals . If in need of a t-multiplier or critical value, use 2, but clearly state the degrees of freedom. If in need of an F critical value, use 3, but clearly state the degrees of freedom for the test. 1. Data was collected on economic expenditures and several other variables in 48 U.S. states (Alaska and Hawaii are excluded). The data were analyzed in an attempt to construct a model describing expenditures (Expen) as a function of Ecab - economic ability index, created from a variety of other variables Met - percentage of population living in metropolitan areas Met^2 - Met squared Grow - percent change in population, 1950 - 1960 Young - percent of population aged 5 - 19 years Old - percent of population aged more than 65 years West - an indicator of western states Matrix Plot of Expen, Ecab, Met 60 120 180 400 Expen 300 200 180 Ecab 120 60 80 40 Met 0 200 300 400 0 40 80 a. (3 pts) Assume that we will fit the model Expen = Ecab. Circle the case with the highest leverage on the plot of Expen vs. Ecab. Circle the same point on the plots of Expen vs. Met and Ecab vs. Met. b. (2 pts) These data were run through a stepwise selection procedure (see the output on the Data Sheet). Describe the final model resulting from the stepwise procedure in short-hand notation. c. (3 pts) If you were to remove one variable from the model in part b, which variable would you remove? Why? Make your decision on the basis of the limited output available to you. d. (2 pts) The plot below is a normal probability plot of the standardized residuals after fitting a particular regression model to these data. Do the residuals appear to be normally distributed? If not, briefly describe the departure(s) from normality. Probability Plot of SRES1 Normal - 95% CI 99 Mean StDev N AD P-Value 95 90 -0.01778 1.058 48 0.269 0.666 Percent 80 70 60 50 40 30 20 10 5 1 -3 -2 -1 0 SRES1 1 2 3 2. Forbes collected data relating the boiling point of water to the logarithm of barometric pressure. Another investigator, Hooker, did the same. In this problem, we seek to create a model that encompasses both investigators data. Forbes' apparent outlier was removed before running the data through Minitab. The variables that are under consideration are: Temp Temp^2 LPres Hooker H * Temp the boiling point of water, measured in degrees Farenheit the square of Temp 100 times the logarithm (base 10) of barometric pressure, in inches of mercury 0 if the case was collected by Forbes, 1 if collected by Hooker the product of Hooker and Temp The Data Sheet contains a variety of Minitab output that will be useful for this problem. Some of the output has been trimmed--for example, the sequential sums of squares and unusual observations portion of the output has been deleted from most of the regressions. a. (5 pts) Is the model of separate regression lines for Hooker and Forbes useful for predicting LPres? Perform a formal F-test for model utility at the 0.05 level. Clearly state your null and alternative hypotheses, state the degrees of freedom for the test, and reach a conclusion. b. (7 pts) You have output for several models on the Data Sheet. Among the models which can be described as "a single regression line", "parallel regression lines" for the two investigators, and "separate regression lines" for the two investigators, which seems to be most appropriate? Base your choice on a set of hypothesis tests. Make your method clear. c. (3 pts) Do Forbes' and Hooker's data appear to follow a single multiple linear regression model? Briefly comment on the assumptions of linearity and constant variance. If they do not follow a single model, how would you modify the model to better fit the data? d. (5 pts) Peform a formal F-test for lack of fit of the model LPres = Temp. Clearly state your null and alternative hypotheses, and make your method clear. e. (3 pts) Use the Minitab output on the Data Sheet to compare the model LPres = Temp and the model LPres = Temp + Temp^2. Do the data suggest the need for a quadratic term in the model? If your answer to this part seems to differ from that in part d, how do you reconcile your two answers? f. (5 pts) Using the model LPres = Temp, form a 95% confidence interval for the median barometric pressure when the boiling point of water is 186.0 degrees. Note: Case 21 has a boiling point of water of 186.0 degrees. Note: This question is about barometric pressure, not 100 * log(base 10) barometric pressure! g. (2 pts) It is possible that Hooker's and Forbes' thermometers were calibrated differently. Suppose that miscalibration, if it exists, is additive. For example, Forbes' thermometer might always measure 0.2 degrees warmer than Hooker's thermometer. Translate this notion of miscalibration into a formal hypothesis that would generalize the model LPres = Temp. State your hypothesis in terms of a welldefined parameter. Use the output from the Data Sheet to test your hypothesis. DATA SHEET Problem 1. Stepwise Regression: Expen versus Ecab, Met, ... Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is Expen on 7 predictors, with N = 48 Step Constant 1 119.1 2 143.3 3 174.3 Ecab T-Value P-Value 1.40 3.65 0.001 1.37 4.80 0.000 1.41 4.98 0.000 Met T-Value P-Value -3.04 -4.01 0.000 -3.04 -4.06 0.000 -3.00 -4.01 0.000 Grow T-Value 0.70 1.83 0.69 1.90 0.52 1.60 P-Value 0.074 Young T-Value P-Value 0.6 0.09 0.931 Old T-Value P-Value 4.1 0.63 0.534 3.6 1.04 0.304 West T-Value P-Value 34 2.78 0.008 34 3.08 0.004 35 3.12 0.003 Met^2 T-Value P-Value 0.0309 3.45 0.001 0.0307 3.64 0.001 0.0305 3.62 0.001 35.4 69.13 63.73 8.0 35.0 69.12 64.61 6.0 35.0 68.31 64.54 5.1 S R-Sq R-Sq(adj) Mallows C-p 0.064 0.116 Problem 2. Descriptive Statistics: LPres Variable LPres Hooker 0 1 N 16 31 N* 1 0 Variable LPres Hooker 0 1 Q3 145.19 134.10 Mean 139.43 129.44 SE Mean 1.32 1.42 StDev 5.29 7.93 Minimum 131.79 118.68 Maximum 147.80 146.55 Regression Analysis: LPres versus Temp The regression equation is LPres = - 43.8 + 0.903 Temp Predictor Constant Temp Coef -43.8105 0.903329 S = 0.302950 SE Coef 0.9251 0.004725 R-Sq = 99.9% T -47.36 191.18 P 0.000 0.000 R-Sq(adj) = 99.9% Analysis of Variance Source Regression Residual Error Total DF 1 45 46 SS 3354.4 4.1 3358.6 MS 3354.4 0.1 F 36548.96 P 0.000 Unusual Observations Obs 21 22 31 Temp 186 186 181 LPres 123.606 123.203 118.684 Fit 124.209 123.847 119.331 SE Fit 0.063 0.065 0.083 Residual -0.603 -0.644 -0.646 St Resid -2.03R -2.18R -2.22R Q1 135.77 122.74 Median 138.02 128.75 R denotes an observation with a large standardized residual. Regression Analysis: LPres versus Temp, Hooker The regression equation is LPres = - 43.8 + 0.904 Temp + 0.006 Hooker Predictor Constant Temp Hooker Coef -43.849 0.903506 0.0063 S = 0.306363 SE Coef 1.173 0.005770 0.1139 R-Sq = 99.9% T -37.38 156.59 0.05 P 0.000 0.000 0.956 R-Sq(adj) = 99.9% Analysis of Variance Source Regression Residual Error Total DF 2 44 46 SS 3354.4 4.1 3358.6 MS 1677.2 0.1 F 17869.60 P 0.000 Problem 2, continued Regression Analysis: LPres versus Temp, H * Temp The regression equation is LPres = - 43.9 + 0.904 Temp + 0.000053 H * Temp Predictor Constant Temp H * Temp Coef -43.868 0.903590 0.0000535 S = 0.306343 SE Coef 1.119 0.005523 0.0005667 R-Sq = 99.9% T -39.21 163.59 0.09 P 0.000 0.000 0.925 R-Sq(adj) = 99.9% Analysis of Variance Source Regression Residual Error Total DF 2 44 46 SS 3354.4 4.1 3358.6 MS 1677.2 0.1 F 17872.00 P 0.000 Regression Analysis: LPres versus Temp, Hooker, H * Temp The regression equation is LPres = - 41.3 + 0.891 Temp - 3.06 Hooker + 0.0153 H * Temp Predictor Constant Temp Hooker H * Temp Coef -41.335 0.89111 -3.056 0.01525 S = 0.306137 SE Coef 2.704 0.01332 2.970 0.01478 R-Sq = 99.9% T -15.29 66.88 -1.03 1.03 P 0.000 0.000 0.309 0.308 R-Sq(adj) = 99.9% Analysis of Variance Source Regression DF 3 SS 3354.5 MS 1118.2 F 11931.04 P 0.000 Residual Error Total 43 46 4.0 3358.6 0.1 Regression Analysis: LPres versus Temp, Temp^2 The regression equation is LPres = - 82.3 + 1.30 Temp - 0.00100 Temp^2 Predictor Constant Temp Temp^2 Coef -82.35 1.2973 -0.0010047 S = 0.293182 SE Coef 19.17 0.1959 0.0004993 R-Sq = 99.9% T -4.29 6.62 -2.01 P 0.000 0.000 0.050 R-Sq(adj) = 99.9% Analysis of Variance Source Regression Residual Error Total DF 2 44 46 SS 3354.8 3.8 3358.6 MS 1677.4 0.1 F 19514.47 P 0.000 Problem 2, continued One-way ANOVA: LPres versus Temp Source Temp Error Total DF 44 2 46 S = 0.3026 SS 3358.369 0.183 3358.552 R-Sq = MS 76.327 0.092 99.99% F 833.50 P 0.001 R-Sq(adj) = 99.87% Scatterplot of SRES7 vs FITS7 2 Hook er 0 1 SRES7 1 0 -1 -2 120 125 130 135 FITS7 140 Standardized residuals vs. fits from the model LPres = Temp 145 150