Statistics 401B Final Exam Name: December 14, 2004

advertisement
Statistics 401B
December 14, 2004
Final Exam
Name:
INSTRUCTIONS: Read the questions carefully and completely. Answer each question and show
work in the space provided. Partial credit will not be given if work is not shown. Use the JMP
output when appropriate. When asked to explain, describe, or comment, do so within the context
of the problem. Be sure to include units where appropriate.
1. [55 pts] A random sample of 25 cities from the northeastern U.S. and a second random
sample of 25 cities from the midwestern U.S. are selected. For each city the population and
the number of auto thefts are recorded. Refer to the JMP output “Analysis of auto thefts in
two regions of the U.S.”
(a) [10] Consider the simple linear model that has auto thefts as the response variable and
population as the explanatory variable.
i. [3] Give the prediction equation.
ii. [3] Give an interpretation of the estimated slope coefficient. It may be easier to
consider a change in population of 1000.
iii. [1] How much of the variation in auto thefts is explained by the linear relationship
with population?
iv. [3] What does the plot of residuals by population tell you about the fit of the simple
linear model? Explain briefly.
1
(b) [6] Look at the quadratic model that has auto thefts as the response variable and population as the explanatory variable.
i. [2] What is the difference in explained variation between the quadratic model and
the simple linear model?
ii. [4] Is this difference in explained variation statistically significant? Support your
answer.
(c) [19] The models in a) and b) do not take into account the region of the country. The
dummy variable Region = 0 if the city is in the Northeast and Region = 1 if the city is
in the Midwest. Refer to the model that has Population, Region and Population*Region
as explanatory variables.
i. [3] Give the prediction equation.
ii. [3] Why doesn’t the intercept have a sensible interpretation within the context of
the problem?
iii. [4] Give an interpretation of the estimated slope for Population. Again it may be
easier to consider a change in population of 1000.
2
iv. [5] Give an interpretation of the estimated slope for Population*Region.
v. [4] Is the interaction term statistically significant? Support your answer.
(d) [20] Various residuals and measures of leverage and influence are obtained for the model
with Population, Region and Population*Region. When conducting a set of 50 tests the
critical values are; zcritical = 3.29, tcritical = 3.52 and Fcritical = 6.42.
i. [4] Are there any significant outliers? Support your answer and identify any significant outliers by giving the name of the city.
ii. [3] Give the values and identify the cities that have high leverage.
iii. [3] What is causing these cities to have high leverage?
3
iv. [6] Do the cities with high leverage have statistically significant leverage? Support
your answer.
v. [4] Do the cities with Cook’s D greater than 1, have statistically significant influence? Support your answer?
2. [40 pts] Data are collected on Height (cm), Weight (kg), Age (years), Sex (1=male, 0=female), Smoker (1=yes, 0=no), Alcohol (1=yes, 0=no), Exercise (1=high, 2=moderate, 3=low),
and Pulse Rate (beats per minute) for 108 students in an introductory statistics class. In
addition to the original variables, 4 new variables are constructed: Height*Sex, Weight*Sex,
Age*Sex and Exercise*Sex. Refer to the JMP output “Predicting Pulse Rate.”
(a) [4] What is the best single variable for predicting Pulse Rate? Give the value of R 2 for
the simple linear regression of Pulse Rate on this explanatory variable.
4
(b) [4] After two steps of the Forward selection procedure (using the default settings), Height
and Exercise are included in the model. What will be the next variable added to the
model? Explain briefly.
(c) [4] What is the first variable removed from the full model in the Backward selection
procedure? Explain briefly.
(d) [4] After four steps of the Backward procedure (using the default settings) the model
contains: Height, Weight, Age, Sex, Height*Sex, Weight*Sex and Exercise*Sex. What
will be the next variable removed from the model? Explain briefly.
(e) [4] Running the Backward procedure with Prob to Leave = 0.05 ends with the model
with Weight, Sex, Weight*Sex and Exercise*Sex. Give the prediction equation for this
model.
5
(f) [5] Could the model in e) be the “best” model? Explain briefly.
(g) [3] Using the prediction equation in e), predict the pulse rate for a female who weighs
54 kg.
(h) [8] Looking at the partial listing of all possible models which model (this can be identified
by giving the number of variables in the model) has the
• most desirable R2 ? What is the value of R2 ?
• most desirable adjR2 ? What is the value of adjR 2 ?
• most desirable RMSE? What is the value of RMSE?
• most desirable Cp ? What is the value of Cp
(i) [4] The best five variable model has the following prediction equation.
Predicted Pulse Rate = 169.65 − 0.49*Height − 0.58*Age − 82.8*Sex + 0.41*Height*Sex
+ 5.95*Exercise*Sex.
Give separate prediction equations for males and females.
6
3. [30 pts] A subset of the data from problem 2 is used to look at differences between the three
exercise categories. There are 13 students in the high category, 54 students in the moderate
category, and 35 students in the low category. Two dummy variables are constructed. High
= 1 if Exercise is high, High = 0 otherwise. Low = 1 if Exercise is low, Low = 0 otherwise.
Refer to the JMP output “Relationship between Pulse Rate and Exercise.” For this problem
the t* value for 95% confidence is 1.984.
(a) [3] Give the equation for predicting Pulse Rate using the two dummy variables High
and Low as explanatory variables.
(b) [9] Give an interpretation, within the context of the problem, of each estimated parameter in the model.
• Intercept
• High
• Low
(c) [5] Construct a 95% confidence interval for the difference between the mean pulse rate
for students in the high exercise category and the mean pulse rate for students in the
moderate exercise category.
7
(d) [4] Based on your interval in c), is the difference in means statistically significant? Explain briefly.
(e) [5] Construct a 95% prediction interval (confidence interval for an individual) for the
pulse rate of an individual in the moderate exercise group.
(f) [4] What does the plot of residuals by predicted values tell you about the conditions
necessary for the statistical analysis? Explain briefly.
(g) [5 pts Extra Credit] Is the difference in means for the high category compared to the
low category statistically significant? Explain briefly.
8
Download