Homework 6, Statistics 112, Fall 2004 This homework is due Thursday, November 11th at the beginning of class. 1. Life insurance companies are keenly interested in predicting how long their customers will live because their premiums and profitability depend on such numbers. An actuary for one insurance company gathered data from 100 recently deceased male customers. She recorded the age at death of the customer (variable = longevity), the age at death of his mother (variable = mother), the age at death of his father (variable = father), the mean age at death of his grandmother (variable = gmothers) and the mean age at death of his grandfathers (variable = gfathers). The data are stored in lifetimes.JMP. (a) Report the estimated multiple linear regression coefficients for the regression of age at death of customer on age at death of mother, age at death of father, mean age at death of grandmothers and mean age at death of grandfathers. Solution: Actual by Predicted Plot 90 Longevity Actual 85 80 75 70 65 60 55 55 60 65 70 75 80 85 90 Longevity Predicted P<.0001 RSq=0.74 RMSE=2.6641 Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.74105 0.730147 2.664075 72.32 100 Analysis of Variance Source Model Error C. Total DF 4 95 99 Sum of Squares 1929.5170 674.2430 2603.7600 Mean Square 482.379 7.097 F Ratio 67.9666 Prob > F <.0001 Parameter Estimates Term Intercept Mother Father Gmothers Gfathers Estimate 3.2438212 0.4508583 0.4111835 0.016553 0.0868583 Std Error 5.423412 0.054502 0.049788 0.066107 0.065657 t Ratio 0.60 8.27 8.26 0.25 1.32 Prob>|t| 0.5512 <.0001 <.0001 0.8028 0.1890 Effect Tests Source Mother Father Nparm 1 1 DF 1 1 Sum of Squares 485.68649 484.07151 F Ratio 68.4326 68.2051 Prob > F <.0001 <.0001 Source Gmothers Gfathers Nparm 1 1 DF 1 1 Sum of Squares 0.44499 12.42104 F Ratio 0.0627 1.7501 Prob > F 0.8028 0.1890 (b) Will the multiple regression model typically be able to forecast a customer’s age at death to within one year? Will the multiple regression model typically be able to forecast a customer’s age at death to within six years? Justify your answers. Solution: Since the RMSE=2.664075, we know that about 95% of the observations will be within a 2*RMSE=5.32815 years interval. So, this model could predict a customer's age within 6 years, but not in 1 year. (c) Examine the residual plot vs. predicted, the normal quantile plot of the residuals and the Cook’s distances and leverages. Would you recommend that we try any transformations? Are there any points that are highly influential that need to be further investigated? [Don’t do any further analysis – just say what you would do next (if anything)] Solution: Bivariate Fit of Residual Longevity By Predicted Longevity Residual Longevity 10 5 0 -5 -10 60 65 70 75 Predicted Longevity 80 85 3 .99 2 .95 .90 1 .75 .50 0 Normal Quantile Plot Residual Longevity .25 .10 .05 -1 -2 .01 -10 -5 0 5 10 The residual plot vs. predicted doesn't show any patterns, so the assumption of linearity is not violated. Also, the normal quantile plot does not indicate any violation of normality. There are no points that have Cook's distance greater than 1 (indicating that there are no influential points) and no points that have leverage greater than 15/100 (indicating that there are no high leverage points). The assumptions of the multiple linear regression model appear to be satisfied. 2. Problem 1 continued. (a) Find a 95% confidence interval for the change in the mean age at death of customer that is associated with a one year increase in the age at death of mother, holding fixed age at death of father, mean age at death of grandmothers and mean at death of grandfathers. Solution: The 95% CI is (0.34, 0.56). (b) Is there strong evidence that mean age at death of grandmothers is useful for predicting customer’s age of death, not taking into account any of the other explanatory variables? Justify your answer using a test. Solution: The test is H0: grandmother 0 From the simple regression model of longevity and grandmother, we know that the pvalue of this test is 0.0002, so there is strong evidence that mean age at death of grandmothers is useful for predicting customer’s age of death. (c) Is there strong evidence that mean age at death of grandmothers is useful for predicting customer’s age of death once age at death of mother, age at death of father and mean age death of grandfathers have been taken into account? Justify your answer using a test. Solution: The test is H0: grandmother 0 When taking into account the age at death of mother, age at death of father and mean age death of grandfathers, the p-value is 0.8028. We accept the null hypothesis. This means there is no evidence that the age at death of grandmothers is useful once age at death of mother, age at death of father and mean age at death of grandfathers has been taken into account. (d) Find a 95% prediction interval for the age at death of an individual man whose mother lived to be 70, whose father lived to be 75, whose grandmothers’ average lifetime was 80 years and whose grandfathers’ average lifetime is 78 years. Solution: The 95% prediction interval is (67.76, 79.72). (e) Find a 95% confidence interval for the mean age of death of men whose mothers live to be 70, whose fathers live to be 75, whose grandmothers’ average lifetime was 80 years and whose grandfathers’ average lifetime was 78 years. Solution: The 95% confidence interval is (70.95, 76.53). 3. Some believe that individuals with a constant sense of time urgency (often called type-A behavior) are more susceptible to heart disease than are more relaxed individuals. Although most studies of this issue have focused on individuals, some psychologists have investigated geographical areas. They considered the relationship of city-wide heart disease rates and general measures of the pace of life in the city. For each region of the United States (Northeast, Midwest, South and West), they selected three large metropolitan areas, three medium-size cities and three smaller cities. In each city they measured three indicators of the pace of life. The variable walk is the walking speed of pedestrians over a distance of 60 feet during business hours on a clear summer day along a main downtown street. Bank is the average time a sample of bank clerks takes to make change for two $20 bills or to give $20 bills for change. The variable talk was obtained by recording responses of postal clerks explaining the difference between regular, certified and insured mail and by dividing the total number of syllables by the time of their response. The researchers also obtained the age-adjusted death rates from ischemic heart disease (a decreased flow of blood to the heart) for each city (heart). The data is in paceoflife.JMP. The variables have been standardized, so there are no units of measurement involved. (a) Draw a scatterplot matrix for heart, bank, walk and talk. Does the scatterplot matrix suggest that any transformations of the explanatory variables are needed for multiple regression analysis? Solution: Multivariate Correlations BANK 1.0000 0.0674 0.3520 0.3176 BANK WALK TALK HEART WALK 0.0674 1.0000 0.3274 0.3477 TALK 0.3520 0.3274 1.0000 0.0999 HEART 0.3176 0.3477 0.0999 1.0000 Scatterplot Matrix 35 30 25 BANK 20 15 30 25 WALK 20 15 30 25 TALK 20 15 10 30 25 HEART 20 15 15 20 25 30 35 15 20 25 30 10 15 20 25 30 15 20 25 30 This scatterplot matrix does not suggest that any transformation of the variables is needed. (b) Compute the multiple linear regression of heart on bank, walk and talk. Solution: Actual by Predicted Plot HEART Actual 30 25 20 15 10 10 15 20 25 30 HEART Predicted P=0.0416 RSq=0.22 RMSE=4.805 Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.223642 0.150858 4.804986 19.80556 36 Analysis of Variance Source Model Error C. Total DF 3 32 35 Sum of Squares 212.82642 738.81246 951.63889 Mean Square 70.9421 23.0879 F Ratio 3.0727 Prob > F 0.0416 Parameter Estimates Term Intercept TALK WALK BANK Estimate 3.1786957 -0.17961 0.4516011 0.405217 Std Error 6.336946 0.222215 0.200874 0.197102 t Ratio 0.50 -0.81 2.25 2.06 Prob>|t| 0.6194 0.4249 0.0316 0.0480 Residual by Predicted Plot HEART Residual 10 5 0 -5 -10 10 15 20 25 HEART Predicted 30 (c) Construct residual plots of the residuals versus each of the explanatory variables. Comment on these residual plots, the residual plot vs. predicted, normal quantile plot of the residuals and the Cook’s distances and leverages. Would you recommend that we try any transformations? Are there any points that are highly influential that need to be further investigated? [Don’t do any further analysis -- just say what you would do next (if anything)] Solution: Fit Y by X Group Bivariate Fit of Residual HEART By BANK Residual HEART 10 5 0 -5 -10 15 20 25 BANK 30 35 Linear Fit Bivariate Fit of Residual HEART By WALK Residual HEART 10 5 0 -5 -10 10 15 20 WALK 25 30 Linear Fit Bivariate Fit of Residual HEART By TALK Residual HEART 10 5 0 -5 -10 10 15 20 TALK 25 30 Linear Fit Distributions Residual HEART 10 .01 .05 .10 .25 .50 .75 .90 .95 .99 5 0 -5 -10 -2 -1 0 1 2 3 Normal Quantile Plot Bivariate Fit of Residual HEART By Predicted HEART Residual HEART 10 5 0 -5 -10 15 16 17 18 19 20 21 22 Predicted HEART 23 24 25 Linear Fit From these plots, we do not see any indication that the assumptions of linearity, constant variance or normality are violated. Also, the highest Cook’s distance is 0.22<1, so there do not seem to be any high influence points. 4. Problem 3 continued (a) Is there strong evidence that the multiple regression model using the three pace of life variables bank, walk and talk provides better predictions of heart than using the sample mean of heart to predict heart? Justify your answer using a test. Solution: Using the F-test, the p-value is 0.0416, which means the model is significant, so using this model is better than using the sample mean. (b) Although there may be many lurking variables and other problems with this study, comment on whether the signs of the coefficients on bank, walk and talk are consistent with the hypothesis that type-A individuals are more susceptible to heart disease. Solution: For cities with a fast pace of life, we would expect bank to be low, walk to be high and talk to be high. Consequently, if type-A individuals are more susceptible to heart disease, we would expect the coefficient on bank to be negative, walk to be positive and talk to be positive. In fact, the coefficient on bank is positive, walk is positive and talk is negative so the coefficient on walk is consistent with the hypothesis but the coefficients on bank and talk are contrary to the hypothesis. (c) A critic of this study says that it is not pace of life which causes heart disease but smoking, which is associated with a fast pace of life, that causes heart disease. If you had data on the smoking rates for each city, how would you use multiple regression analysis to examine the critic’s claim? Describe briefly what multiple regression model you would fit and what you would look for. The critic is claiming that smoking is a lurking variable and that once smoking is controlled for, the coefficients on bank, walk and talk should be zero. To examine the critic’s claim, I would fit a multiple regression of heart on the explanatory variables bank, walk, talk and smoking rate. I would do t-tests of whether the coefficients on bank, walk and talk are zero to test the critic’s claim. (d) Salt Lake City has a predominately Mormon population. The Mormon religion strongly encourages hard work but prohibits smoking. Compute the residual for Salt Lake City for the multiple regression in problem 3. Does the sign of the residual provide support or not provide support for the critic’s claim in part (c) that smoking, which is associated with a fast pace of life, but not fast pace of life itself causes heart disease? Explain briefly. Solution: If the critic’s claim is correct, Salt Lake City should have a relatively small heart disease rate, since it has a low smoking rate, but a relatively high predicted heart disease rate from the regression in problem 3 because it has a fast pace of life (assuming that fast pace of life is in fact associated with heart disease rate; the regression in Problem 3 provides mixed evidence for this). Thus, if the critic’s claim is correct, we would expect the residual for Salt Lake City to be negative. In fact the residual for Salt Lake City is positive, meaning that it does not provide support for the critic's claim in (c).