Statistics 401B December 14, 2004 Final Exam Name: INSTRUCTIONS: Read the questions carefully and completely. Answer each question and show work in the space provided. Partial credit will not be given if work is not shown. Use the JMP output when appropriate. When asked to explain, describe, or comment, do so within the context of the problem. Be sure to include units where appropriate. 1. [55 pts] A random sample of 25 cities from the northeastern U.S. and a second random sample of 25 cities from the midwestern U.S. are selected. For each city the population and the number of auto thefts are recorded. Refer to the JMP output “Analysis of auto thefts in two regions of the U.S.” (a) [10] Consider the simple linear model that has auto thefts as the response variable and population as the explanatory variable. i. [3] Give the prediction equation. ii. [3] Give an interpretation of the estimated slope coefficient. It may be easier to consider a change in population of 1000. iii. [1] How much of the variation in auto thefts is explained by the linear relationship with population? iv. [3] What does the plot of residuals by population tell you about the fit of the simple linear model? Explain briefly. 1 (b) [6] Look at the quadratic model that has auto thefts as the response variable and population as the explanatory variable. i. [2] What is the difference in explained variation between the quadratic model and the simple linear model? ii. [4] Is this difference in explained variation statistically significant? Support your answer. (c) [19] The models in a) and b) do not take into account the region of the country. The dummy variable Region = 0 if the city is in the Northeast and Region = 1 if the city is in the Midwest. Refer to the model that has Population, Region and Population*Region as explanatory variables. i. [3] Give the prediction equation. ii. [3] Why doesn’t the intercept have a sensible interpretation within the context of the problem? iii. [4] Give an interpretation of the estimated slope for Population. Again it may be easier to consider a change in population of 1000. 2 iv. [5] Give an interpretation of the estimated slope for Population*Region. v. [4] Is the interaction term statistically significant? Support your answer. (d) [20] Various residuals and measures of leverage and influence are obtained for the model with Population, Region and Population*Region. When conducting a set of 50 tests the critical values are; zcritical = 3.29, tcritical = 3.52 and Fcritical = 6.42. i. [4] Are there any significant outliers? Support your answer and identify any significant outliers by giving the name of the city. ii. [3] Give the values and identify the cities that have high leverage. iii. [3] What is causing these cities to have high leverage? 3 iv. [6] Do the cities with high leverage have statistically significant leverage? Support your answer. v. [4] Do the cities with Cook’s D greater than 1, have statistically significant influence? Support your answer? 2. [40 pts] Data are collected on Height (cm), Weight (kg), Age (years), Sex (1=male, 0=female), Smoker (1=yes, 0=no), Alcohol (1=yes, 0=no), Exercise (1=high, 2=moderate, 3=low), and Pulse Rate (beats per minute) for 108 students in an introductory statistics class. In addition to the original variables, 4 new variables are constructed: Height*Sex, Weight*Sex, Age*Sex and Exercise*Sex. Refer to the JMP output “Predicting Pulse Rate.” (a) [4] What is the best single variable for predicting Pulse Rate? Give the value of R 2 for the simple linear regression of Pulse Rate on this explanatory variable. 4 (b) [4] After two steps of the Forward selection procedure (using the default settings), Height and Exercise are included in the model. What will be the next variable added to the model? Explain briefly. (c) [4] What is the first variable removed from the full model in the Backward selection procedure? Explain briefly. (d) [4] After four steps of the Backward procedure (using the default settings) the model contains: Height, Weight, Age, Sex, Height*Sex, Weight*Sex and Exercise*Sex. What will be the next variable removed from the model? Explain briefly. (e) [4] Running the Backward procedure with Prob to Leave = 0.05 ends with the model with Weight, Sex, Weight*Sex and Exercise*Sex. Give the prediction equation for this model. 5 (f) [5] Could the model in e) be the “best” model? Explain briefly. (g) [3] Using the prediction equation in e), predict the pulse rate for a female who weighs 54 kg. (h) [8] Looking at the partial listing of all possible models which model (this can be identified by giving the number of variables in the model) has the • most desirable R2 ? What is the value of R2 ? • most desirable adjR2 ? What is the value of adjR 2 ? • most desirable RMSE? What is the value of RMSE? • most desirable Cp ? What is the value of Cp (i) [4] The best five variable model has the following prediction equation. Predicted Pulse Rate = 169.65 − 0.49*Height − 0.58*Age − 82.8*Sex + 0.41*Height*Sex + 5.95*Exercise*Sex. Give separate prediction equations for males and females. 6 3. [30 pts] A subset of the data from problem 2 is used to look at differences between the three exercise categories. There are 13 students in the high category, 54 students in the moderate category, and 35 students in the low category. Two dummy variables are constructed. High = 1 if Exercise is high, High = 0 otherwise. Low = 1 if Exercise is low, Low = 0 otherwise. Refer to the JMP output “Relationship between Pulse Rate and Exercise.” For this problem the t* value for 95% confidence is 1.984. (a) [3] Give the equation for predicting Pulse Rate using the two dummy variables High and Low as explanatory variables. (b) [9] Give an interpretation, within the context of the problem, of each estimated parameter in the model. • Intercept • High • Low (c) [5] Construct a 95% confidence interval for the difference between the mean pulse rate for students in the high exercise category and the mean pulse rate for students in the moderate exercise category. 7 (d) [4] Based on your interval in c), is the difference in means statistically significant? Explain briefly. (e) [5] Construct a 95% prediction interval (confidence interval for an individual) for the pulse rate of an individual in the moderate exercise group. (f) [4] What does the plot of residuals by predicted values tell you about the conditions necessary for the statistical analysis? Explain briefly. (g) [5 pts Extra Credit] Is the difference in means for the high category compared to the low category statistically significant? Explain briefly. 8