Solutions to Homework 5 - Wharton Statistics Department

advertisement
Homework 5, Statistics 112, Fall 2004
This homework is due Tuesday, October 19th at the beginning of class.
1. In most jurisdictions, driving an automobile with a blood alcohol level in excess of .08
is a felony. Because of a number of factors, it is difficult to provide guidelines on when it
is safe for someone who has consumed alcohol to drive a car. In an experiment to
examine the relationship between blood alcohol level and the weight of a drinker, 50 men
of varying weights were each given three beers to drink and 1 hour later their blood
alcohol level was measured. The data are stored in bloodalcohol.JMP on the web site.
(a) Fit a simple linear regression model to predict blood alcohol level based on weight.
Check the assumptions of the simple linear regression model by constructing a residual
plot and a normal quantile plot of the residuals. Do these plots indicate any problems
with the assumptions of the simple linear regression model? If yes, what problems are
indicated and what indicates the problem. If no, what indicates that there are no
problems.
Solution:
Bivariate Fit of B/A Level By Weight
0.13
0.12
0.11
B/A Level
0.10
0.09
0.08
0.07
0.06
0.05
0.04
140
160
180
200 220
Weight
240
260
280
Linear Fit
Linear Fit
B/A Level = 0.0331795 + 0.000225 Weight
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.174495
0.157297
0.013979
0.0774
50
Parameter Estimates
Term
Intercept
Weight
Estimate
0.0331795
0.000225
Std Error
0.014023
0.000071
T Ratio
2.37
3.19
Prob>|t|
0.0221
0.0025
Residual
0.03
0.01
-0.01
-0.03
140
160
180
200
220
Weight
240
260
280
3
.99
2
.95
.90
1
.75
.50
0
Normal Quantile Plot
Distributions
Residuals B/A Level
.25
.10
.05
-1
-2
.01
-0.03
-0.01
0
.01
.02
.03
.04
We can see that there is no obvious pattern in the residual plot, in particular the mean of
the residuals for all ranges of X appears to be roughly zero and the spread of the residuals
appears to be roughly constant. From the normal quantile plot, we see that all points are
within the 95% confidence bands so the normality assumption appears reasonable. Thus,
there are no clear problems with the assumptions of the simple linear regression model.
For the rest of the problem, we will assume that the simple linear regression model holds
in spite of any problems you may have found in part (a)
(b) Give a 95% confidence interval for the amount by which the mean blood alcohol level
changes for a one pound increase in weight.
Solution:
We need a 95% confidence interval for the slope, which is
(0.000225-2*0.000071, 0.000225+2*0.000071)=(0.000083,0.00036771)
(c) Is there strong evidence that weight is associated with blood alcohol level? State
hypotheses, give a p-value and state your conclusion.
Solution:
H0: blood alcohol level is not linear related to weight. (slope=0).
H1: blood alcohol level is related to weight. (slope>0 or slope<0).
Using the t test the p-value is 0.0025. Because the p-value is <0.05, we reject the null
hypotheses. There is strong evidence that weight is associated with blood alcohol level.
2. Problem 1 continued.
(a) Calculate a 95% confidence interval for the mean blood alcohol level one hour later
after drinking three beers for the population of 160 pound men.
Solution: We use JMP to find 95% confidence intervals for the mean response and 95%
prediction intervals.
B/A Level
Bivariate Fit of B/A Level By Weight
0.14
0.13
0.12
0.11
0.10
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
140
160
180
200
220
240
260
280
Weight
Using the crosshair tool, the 95% confidence interval for the mean blood alcohol level
one hour after drinking three beers for 160 pound men is approximately ( 0.063, 0.076)
(b) Steve is 160 pounds and thinks he can drive legally one hour after drinking three
beers. Give a 95% prediction interval for Steve’s BAC. Given that driving with a blood
alcohol level greater than .08 is illegal, can Steve be confident that he won’t be arrested if
he drives and is stopped?
Solution: Using the crosshair tool, a 95% prediction interval for Steve’s BAC is
approximately (0.040,0.098).
Because 0.08 is in the 95% prediction interval, Steve cannot be confident that he won’t
be arrested if he drives.
(c) The police want to establish guidelines for whether it is safe for a 160 pound man to
drive one hour after drinking three beers. What would you advise the police based on the
regression analysis?
I would think that the police are conservative, and would only advise that it is safe for
someone to drive if they think it is unlikely that the person will have a blood alcohol level
above 0.08. The 95% prediction interval for the blood alcohol level of a 160 pound man
one hour after drinking three beers is (0.040,0.098). Because the 95% prediction interval
contains 0.08, it is not unlikely that a 160 pound man will have a blood alcohol level
above 0.08 one hour after drinking three beers. I would advise the police to recommend
that it is not safe for a 160 pound man to drive one hour after drinking three beers.
3. The data in wineheart.JMP are the average wine consumption rates (in liters per
person) and number of ischemic heart disease deaths (per 1,000 men aged 55 to 64 years
old) for 18 industrialized countries (Data from A.S. St Leget et al., “Factors Associated
with Cardiac Mortality in Developed Countries with Particular Reference to the
Consumption of Wine”, Lancet, 1979).
(a) Fit a simple linear regression to predict mortality from heart disease based on wine
consumption. Construct a residual plot. What is the most obvious problem you see with
the residual plot compared to what you would expect to see if the ideal simple linear
regression model holds?
Solution:
Bivariate Fit of Heart Disease Mortality By Wine Consumption
Heart Disease Mortality
12
10
8
6
4
2
0
10
Linear Fit
20
30
40
50
60
Wine Consumption
70
80
Linear Fit
Heart Disease Mortality = 7.6865549 - 0.0760809 Wine Consumption
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.555872
0.528114
1.618923
6.433333
18
Analysis of Variance
Source
Model
Error
C. Total
DF
1
16
17
Sum of Squares
52.485428
41.934572
94.420000
Mean Square
52.4854
2.6209
F Ratio
20.0256
Prob > F
0.0004
Parameter Estimates
Residual
Term
Intercept
Wine Consumption
Estimate
7.6865549
-0.076081
Std Error
0.473322
0.017001
t Ratio
16.24
-4.48
Prob>|t|
<.0001
0.0004
3
2
1
0
-1
-2
-3
0
10
20
30
40
50
Wine Consumption
60
70
80
The residual plot has a pattern in the mean of the residuals like a "U". In an ideal linear
regression residual plot, there is no pattern.
(b) Using Tukey’s Bulging rule, try three appropriate transformations to try to achieve a
better fit. Use the transformation of x to log(x) and y to log(y) as one of your
transformations. Report the transformations you tried. Which achieves the best fit
(explain the reason for your answer)?
Solution: I tried the following three transformations.
1. Transform X to log X and Y to log Y. The root mean square error measured on
the original scale is: 1.6116274.
2. Transform X to X and Y to Y . The root mean square error is: 1.475877.
3. Transform X to 1/X and Y to 1/Y. The root mean square error is: 3.0843388.
So, the transformation of X to X and Y to Y has the smallest RMSE. It achieves the
best fit.
For the remaining part of the problem, we use the transformation of x to log (x) and y to
log(y).
Bivariate Fit of Heart Disease Mortality By Wine Consumption
Heart Disease Mortality
12
10
8
6
4
2
0
10
20
30
40
50
60
70
80
Wine Consumption
Transformed Fit Log to Log
Transformed Fit Log to Log
Log(Heart Disease Mortality) = 2.5555519 - 0.3555959 Log(Wine Consumption)
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.738433
0.722085
0.228537
1.78335
18
Analysis of Variance
Source
Model
Error
C. Total
DF
1
16
17
Sum of Squares
2.3591756
0.8356647
3.1948403
Mean Square
2.35918
0.05223
F Ratio
45.1698
Prob > F
<.0001
Parameter Estimates
Term
Intercept
Log(Wine Consumption)
Estimate
2.5555519
-0.355596
Fit Measured on Original Scale
Sum of Squared Error
Root Mean Square Error
RSquare
Sum of Residuals
41.557487
1.6116274
0.5598656
2.3201106
Std Error
0.126897
0.052909
t Ratio
20.14
-6.72
Prob>|t|
<.0001
<.0001
Residual
3
1
-1
-3
0
10
20
30
40
50
60
70
80
Wine Consumption
(c) Using the transformation of x to log (x) and y to log (y), which country’s heart disease
mortality rate is most surprisingly high given its wine consumption rate? Which
country’s heart disease mortality rate is most surprisingly low given its wine consumption
rate? Using the rule of thumb that a point with a residual that is more than three root
mean square errors away from zero is an outlier in the direction of the scatterplot, would
you consider either of these two countries outliers in the direction of the scatterplot?
Solution:
From saving the residuals, we find that the country whose mortality rate is most
surprisingly high given its wine consumption is Australia (residual = 3.03) and the
country whose mortality rate is most surprisingly low given its wine consumption is
Norway (residual = -2.73). The root mean square error of the fit measured on the original
scale is 1.61. Neither of these countries has a residual that is more than three root mean
square errors away from zero, so neither would be considered an outlier in the direction
of the scatterplot using the rule of thumb.
(d) Using the transformation of x to log (x) and y to log (y), predict the heart disease
mortality rate for a country with a wine consumption of 6 liters per person.
Solution:
From the regression:
Log(Heart Disease Mortality) = 2.5555519 - 0.3555959 Log(Wine Consumption)
So, the estimated heart disease mortality rate for a country with a wine consumption of 6
liters per person is
E ( HeartDisea se | WineConsum ption  6)  exp{ E (log( HeartDisea se) | WineConsum ption  6)} 
exp{ E (log( HeartDisea se) | log( WineConsum ption)  log( 6))}  exp{ E (log( HeartDisea se) | log( WineConsum
exp{ 2.5556  .3556 *1.792}  exp(1.918)  6.81
4. Problem 3 continued.
(a) Is there strong evidence that wine consumption is associated with heart disease
mortality? State hypotheses, give a p-value and state your conclusion. If you found that
there is strong evidence that wine consumption is associated with heart disease mortality,
what is the direction of the association?
Solution:
Assuming the simple linear regression model holds for Y=log(heart disease mortality)
and X=log(wine consumption), log( HeartDisea se)   0  1 log( WineConsumption) , we
can test if heart disease mortality is associated with wine consumption by testing whether
the slope is zero for the regression of log(heart disease mortality) on log(wine
consumption). The null hypothesis is H0: 1  0 and the alternative hypothesis is
H a : 1  0 for the regression of log(heart disease mortality) on log(wine consumption).
The t statistic is -6.72 and the p-value is <0.0001. Thus, we reject H0. There is strong
evidence that wine consumption is associated with heart disease mortality, and they are
negative related. From the sign of the slope, more wine consumption is associated with
lower heart disease mortality.
(b) Based on your regression analysis, your friend decides to drink more wine. Perhaps
your friend is just using your regression analysis as an excuse, but anyhow, comment on
whether your regression analysis justifies your friend’s decision to drink more wine.
Discuss some additional data you would be interested in collecting to better understand
the causal relationship between wine drinking and heart disease (see Section 2.5 of
Moore and McCabe on Establishing Causation).
The regression analysis establishes a negative association between wine consumption and
heart disease mortality, but it does not establish that more wine consumption causes
lower heart disease mortality. An important lurking variable is diet. For example,
countries which consume less wine might consume more red meat. It would be useful to
collect additional data on the diet of the different countries and to see whether or not
there is still an association between heart disease mortality and wine consumption if we
hold fixed diet. It would also be good to see if the association is consistent by doing
studies of the association between wine consumption and heart disease mortality in
different regions and on individuals rather than countries/regions.
Download