Homework 5, Statistics 112, Fall 2004 This homework is due Tuesday, October 19th at the beginning of class. 1. In most jurisdictions, driving an automobile with a blood alcohol level in excess of .08 is a felony. Because of a number of factors, it is difficult to provide guidelines on when it is safe for someone who has consumed alcohol to drive a car. In an experiment to examine the relationship between blood alcohol level and the weight of a drinker, 50 men of varying weights were each given three beers to drink and 1 hour later their blood alcohol level was measured. The data are stored in bloodalcohol.JMP on the web site. (a) Fit a simple linear regression model to predict blood alcohol level based on weight. Check the assumptions of the simple linear regression model by constructing a residual plot and a normal quantile plot of the residuals. Do these plots indicate any problems with the assumptions of the simple linear regression model? If yes, what problems are indicated and what indicates the problem. If no, what indicates that there are no problems. Solution: Bivariate Fit of B/A Level By Weight 0.13 0.12 0.11 B/A Level 0.10 0.09 0.08 0.07 0.06 0.05 0.04 140 160 180 200 220 Weight 240 260 280 Linear Fit Linear Fit B/A Level = 0.0331795 + 0.000225 Weight Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.174495 0.157297 0.013979 0.0774 50 Parameter Estimates Term Intercept Weight Estimate 0.0331795 0.000225 Std Error 0.014023 0.000071 T Ratio 2.37 3.19 Prob>|t| 0.0221 0.0025 Residual 0.03 0.01 -0.01 -0.03 140 160 180 200 220 Weight 240 260 280 3 .99 2 .95 .90 1 .75 .50 0 Normal Quantile Plot Distributions Residuals B/A Level .25 .10 .05 -1 -2 .01 -0.03 -0.01 0 .01 .02 .03 .04 We can see that there is no obvious pattern in the residual plot, in particular the mean of the residuals for all ranges of X appears to be roughly zero and the spread of the residuals appears to be roughly constant. From the normal quantile plot, we see that all points are within the 95% confidence bands so the normality assumption appears reasonable. Thus, there are no clear problems with the assumptions of the simple linear regression model. For the rest of the problem, we will assume that the simple linear regression model holds in spite of any problems you may have found in part (a) (b) Give a 95% confidence interval for the amount by which the mean blood alcohol level changes for a one pound increase in weight. Solution: We need a 95% confidence interval for the slope, which is (0.000225-2*0.000071, 0.000225+2*0.000071)=(0.000083,0.00036771) (c) Is there strong evidence that weight is associated with blood alcohol level? State hypotheses, give a p-value and state your conclusion. Solution: H0: blood alcohol level is not linear related to weight. (slope=0). H1: blood alcohol level is related to weight. (slope>0 or slope<0). Using the t test the p-value is 0.0025. Because the p-value is <0.05, we reject the null hypotheses. There is strong evidence that weight is associated with blood alcohol level. 2. Problem 1 continued. (a) Calculate a 95% confidence interval for the mean blood alcohol level one hour later after drinking three beers for the population of 160 pound men. Solution: We use JMP to find 95% confidence intervals for the mean response and 95% prediction intervals. B/A Level Bivariate Fit of B/A Level By Weight 0.14 0.13 0.12 0.11 0.10 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 140 160 180 200 220 240 260 280 Weight Using the crosshair tool, the 95% confidence interval for the mean blood alcohol level one hour after drinking three beers for 160 pound men is approximately ( 0.063, 0.076) (b) Steve is 160 pounds and thinks he can drive legally one hour after drinking three beers. Give a 95% prediction interval for Steve’s BAC. Given that driving with a blood alcohol level greater than .08 is illegal, can Steve be confident that he won’t be arrested if he drives and is stopped? Solution: Using the crosshair tool, a 95% prediction interval for Steve’s BAC is approximately (0.040,0.098). Because 0.08 is in the 95% prediction interval, Steve cannot be confident that he won’t be arrested if he drives. (c) The police want to establish guidelines for whether it is safe for a 160 pound man to drive one hour after drinking three beers. What would you advise the police based on the regression analysis? I would think that the police are conservative, and would only advise that it is safe for someone to drive if they think it is unlikely that the person will have a blood alcohol level above 0.08. The 95% prediction interval for the blood alcohol level of a 160 pound man one hour after drinking three beers is (0.040,0.098). Because the 95% prediction interval contains 0.08, it is not unlikely that a 160 pound man will have a blood alcohol level above 0.08 one hour after drinking three beers. I would advise the police to recommend that it is not safe for a 160 pound man to drive one hour after drinking three beers. 3. The data in wineheart.JMP are the average wine consumption rates (in liters per person) and number of ischemic heart disease deaths (per 1,000 men aged 55 to 64 years old) for 18 industrialized countries (Data from A.S. St Leget et al., “Factors Associated with Cardiac Mortality in Developed Countries with Particular Reference to the Consumption of Wine”, Lancet, 1979). (a) Fit a simple linear regression to predict mortality from heart disease based on wine consumption. Construct a residual plot. What is the most obvious problem you see with the residual plot compared to what you would expect to see if the ideal simple linear regression model holds? Solution: Bivariate Fit of Heart Disease Mortality By Wine Consumption Heart Disease Mortality 12 10 8 6 4 2 0 10 Linear Fit 20 30 40 50 60 Wine Consumption 70 80 Linear Fit Heart Disease Mortality = 7.6865549 - 0.0760809 Wine Consumption Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.555872 0.528114 1.618923 6.433333 18 Analysis of Variance Source Model Error C. Total DF 1 16 17 Sum of Squares 52.485428 41.934572 94.420000 Mean Square 52.4854 2.6209 F Ratio 20.0256 Prob > F 0.0004 Parameter Estimates Residual Term Intercept Wine Consumption Estimate 7.6865549 -0.076081 Std Error 0.473322 0.017001 t Ratio 16.24 -4.48 Prob>|t| <.0001 0.0004 3 2 1 0 -1 -2 -3 0 10 20 30 40 50 Wine Consumption 60 70 80 The residual plot has a pattern in the mean of the residuals like a "U". In an ideal linear regression residual plot, there is no pattern. (b) Using Tukey’s Bulging rule, try three appropriate transformations to try to achieve a better fit. Use the transformation of x to log(x) and y to log(y) as one of your transformations. Report the transformations you tried. Which achieves the best fit (explain the reason for your answer)? Solution: I tried the following three transformations. 1. Transform X to log X and Y to log Y. The root mean square error measured on the original scale is: 1.6116274. 2. Transform X to X and Y to Y . The root mean square error is: 1.475877. 3. Transform X to 1/X and Y to 1/Y. The root mean square error is: 3.0843388. So, the transformation of X to X and Y to Y has the smallest RMSE. It achieves the best fit. For the remaining part of the problem, we use the transformation of x to log (x) and y to log(y). Bivariate Fit of Heart Disease Mortality By Wine Consumption Heart Disease Mortality 12 10 8 6 4 2 0 10 20 30 40 50 60 70 80 Wine Consumption Transformed Fit Log to Log Transformed Fit Log to Log Log(Heart Disease Mortality) = 2.5555519 - 0.3555959 Log(Wine Consumption) Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.738433 0.722085 0.228537 1.78335 18 Analysis of Variance Source Model Error C. Total DF 1 16 17 Sum of Squares 2.3591756 0.8356647 3.1948403 Mean Square 2.35918 0.05223 F Ratio 45.1698 Prob > F <.0001 Parameter Estimates Term Intercept Log(Wine Consumption) Estimate 2.5555519 -0.355596 Fit Measured on Original Scale Sum of Squared Error Root Mean Square Error RSquare Sum of Residuals 41.557487 1.6116274 0.5598656 2.3201106 Std Error 0.126897 0.052909 t Ratio 20.14 -6.72 Prob>|t| <.0001 <.0001 Residual 3 1 -1 -3 0 10 20 30 40 50 60 70 80 Wine Consumption (c) Using the transformation of x to log (x) and y to log (y), which country’s heart disease mortality rate is most surprisingly high given its wine consumption rate? Which country’s heart disease mortality rate is most surprisingly low given its wine consumption rate? Using the rule of thumb that a point with a residual that is more than three root mean square errors away from zero is an outlier in the direction of the scatterplot, would you consider either of these two countries outliers in the direction of the scatterplot? Solution: From saving the residuals, we find that the country whose mortality rate is most surprisingly high given its wine consumption is Australia (residual = 3.03) and the country whose mortality rate is most surprisingly low given its wine consumption is Norway (residual = -2.73). The root mean square error of the fit measured on the original scale is 1.61. Neither of these countries has a residual that is more than three root mean square errors away from zero, so neither would be considered an outlier in the direction of the scatterplot using the rule of thumb. (d) Using the transformation of x to log (x) and y to log (y), predict the heart disease mortality rate for a country with a wine consumption of 6 liters per person. Solution: From the regression: Log(Heart Disease Mortality) = 2.5555519 - 0.3555959 Log(Wine Consumption) So, the estimated heart disease mortality rate for a country with a wine consumption of 6 liters per person is E ( HeartDisea se | WineConsum ption 6) exp{ E (log( HeartDisea se) | WineConsum ption 6)} exp{ E (log( HeartDisea se) | log( WineConsum ption) log( 6))} exp{ E (log( HeartDisea se) | log( WineConsum exp{ 2.5556 .3556 *1.792} exp(1.918) 6.81 4. Problem 3 continued. (a) Is there strong evidence that wine consumption is associated with heart disease mortality? State hypotheses, give a p-value and state your conclusion. If you found that there is strong evidence that wine consumption is associated with heart disease mortality, what is the direction of the association? Solution: Assuming the simple linear regression model holds for Y=log(heart disease mortality) and X=log(wine consumption), log( HeartDisea se) 0 1 log( WineConsumption) , we can test if heart disease mortality is associated with wine consumption by testing whether the slope is zero for the regression of log(heart disease mortality) on log(wine consumption). The null hypothesis is H0: 1 0 and the alternative hypothesis is H a : 1 0 for the regression of log(heart disease mortality) on log(wine consumption). The t statistic is -6.72 and the p-value is <0.0001. Thus, we reject H0. There is strong evidence that wine consumption is associated with heart disease mortality, and they are negative related. From the sign of the slope, more wine consumption is associated with lower heart disease mortality. (b) Based on your regression analysis, your friend decides to drink more wine. Perhaps your friend is just using your regression analysis as an excuse, but anyhow, comment on whether your regression analysis justifies your friend’s decision to drink more wine. Discuss some additional data you would be interested in collecting to better understand the causal relationship between wine drinking and heart disease (see Section 2.5 of Moore and McCabe on Establishing Causation). The regression analysis establishes a negative association between wine consumption and heart disease mortality, but it does not establish that more wine consumption causes lower heart disease mortality. An important lurking variable is diet. For example, countries which consume less wine might consume more red meat. It would be useful to collect additional data on the diet of the different countries and to see whether or not there is still an association between heart disease mortality and wine consumption if we hold fixed diet. It would also be good to see if the association is consistent by doing studies of the association between wine consumption and heart disease mortality in different regions and on individuals rather than countries/regions.