Solutions - Wharton Statistics Department

advertisement
Homework 6, Statistics 112, Fall 2004
This homework is due Thursday, November 11th at the beginning of class.
1. Life insurance companies are keenly interested in predicting how long their customers
will live because their premiums and profitability depend on such numbers. An actuary
for one insurance company gathered data from 100 recently deceased male customers.
She recorded the age at death of the customer (variable = longevity), the age at death of
his mother (variable = mother), the age at death of his father (variable = father), the mean
age at death of his grandmother (variable = gmothers) and the mean age at death of his
grandfathers (variable = gfathers). The data are stored in lifetimes.JMP.
(a) Report the estimated multiple linear regression coefficients for the regression of age at
death of customer on age at death of mother, age at death of father, mean age at death of
grandmothers and mean age at death of grandfathers.
Solution:
Actual by Predicted Plot
90
Longevity Actual
85
80
75
70
65
60
55
55
60 65
70 75
80 85
90
Longevity Predicted P<.0001 RSq=0.74
RMSE=2.6641
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.74105
0.730147
2.664075
72.32
100
Analysis of Variance
Source
Model
Error
C. Total
DF
4
95
99
Sum of Squares
1929.5170
674.2430
2603.7600
Mean Square
482.379
7.097
F Ratio
67.9666
Prob > F
<.0001
Parameter Estimates
Term
Intercept
Mother
Father
Gmothers
Gfathers
Estimate
3.2438212
0.4508583
0.4111835
0.016553
0.0868583
Std Error
5.423412
0.054502
0.049788
0.066107
0.065657
t Ratio
0.60
8.27
8.26
0.25
1.32
Prob>|t|
0.5512
<.0001
<.0001
0.8028
0.1890
Effect Tests
Source
Mother
Father
Nparm
1
1
DF
1
1
Sum of Squares
485.68649
484.07151
F Ratio
68.4326
68.2051
Prob > F
<.0001
<.0001
Source
Gmothers
Gfathers
Nparm
1
1
DF
1
1
Sum of Squares
0.44499
12.42104
F Ratio
0.0627
1.7501
Prob > F
0.8028
0.1890
(b) Will the multiple regression model typically be able to forecast a customer’s age at
death to within one year? Will the multiple regression model typically be able to forecast
a customer’s age at death to within six years? Justify your answers.
Solution:
Since the RMSE=2.664075, we know that about 95% of the observations will be within a
2*RMSE=5.32815 years interval. So, this model could predict a customer's age within 6
years, but not in 1 year.
(c) Examine the residual plot vs. predicted, the normal quantile plot of the residuals and
the Cook’s distances and leverages. Would you recommend that we try any
transformations? Are there any points that are highly influential that need to be further
investigated? [Don’t do any further analysis – just say what you would do next (if
anything)]
Solution:
Bivariate Fit of Residual Longevity By Predicted Longevity
Residual Longevity
10
5
0
-5
-10
60
65
70
75
Predicted Longevity
80
85
3
.99
2
.95
.90
1
.75
.50
0
Normal Quantile Plot
Residual Longevity
.25
.10
.05
-1
-2
.01
-10
-5
0
5
10
The residual plot vs. predicted doesn't show any patterns, so the assumption of linearity is
not violated. Also, the normal quantile plot does not indicate any violation of normality.
There are no points that have Cook's distance greater than 1 (indicating that there are no
influential points) and no points that have leverage greater than 15/100 (indicating that
there are no high leverage points). The assumptions of the multiple linear regression
model appear to be satisfied.
2. Problem 1 continued.
(a) Find a 95% confidence interval for the change in the mean age at death of customer
that is associated with a one year increase in the age at death of mother, holding fixed age
at death of father, mean age at death of grandmothers and mean at death of grandfathers.
Solution:
The 95% CI is (0.34, 0.56).
(b) Is there strong evidence that mean age at death of grandmothers is useful for
predicting customer’s age of death, not taking into account any of the other explanatory
variables? Justify your answer using a test.
Solution:
The test is H0:  grandmother  0
From the simple regression model of longevity and grandmother, we know that the pvalue of this test is 0.0002, so there is strong evidence that mean age at death of
grandmothers is useful for predicting customer’s age of death.
(c) Is there strong evidence that mean age at death of grandmothers is useful for
predicting customer’s age of death once age at death of mother, age at death of father and
mean age death of grandfathers have been taken into account? Justify your answer using
a test.
Solution:
The test is H0:  grandmother  0
When taking into account the age at death of mother, age at death of father and mean age
death of grandfathers, the p-value is 0.8028. We accept the null hypothesis. This means
there is no evidence that the age at death of grandmothers is useful once age at death of
mother, age at death of father and mean age at death of grandfathers has been taken into
account.
(d) Find a 95% prediction interval for the age at death of an individual man whose mother
lived to be 70, whose father lived to be 75, whose grandmothers’ average lifetime was 80
years and whose grandfathers’ average lifetime is 78 years.
Solution:
The 95% prediction interval is (67.76, 79.72).
(e) Find a 95% confidence interval for the mean age of death of men whose mothers live
to be 70, whose fathers live to be 75, whose grandmothers’ average lifetime was 80 years
and whose grandfathers’ average lifetime was 78 years.
Solution:
The 95% confidence interval is (70.95, 76.53).
3. Some believe that individuals with a constant sense of time urgency (often called
type-A behavior) are more susceptible to heart disease than are more relaxed individuals.
Although most studies of this issue have focused on individuals, some psychologists have
investigated geographical areas. They considered the relationship of city-wide heart
disease rates and general measures of the pace of life in the city.
For each region of the United States (Northeast, Midwest, South and West), they
selected three large metropolitan areas, three medium-size cities and three smaller cities.
In each city they measured three indicators of the pace of life. The variable walk is the
walking speed of pedestrians over a distance of 60 feet during business hours on a clear
summer day along a main downtown street. Bank is the average time a sample of bank
clerks takes to make change for two $20 bills or to give $20 bills for change. The
variable talk was obtained by recording responses of postal clerks explaining the
difference between regular, certified and insured mail and by dividing the total number of
syllables by the time of their response. The researchers also obtained the age-adjusted
death rates from ischemic heart disease (a decreased flow of blood to the heart) for each
city (heart). The data is in paceoflife.JMP. The variables have been standardized, so
there are no units of measurement involved.
(a) Draw a scatterplot matrix for heart, bank, walk and talk. Does the scatterplot matrix
suggest that any transformations of the explanatory variables are needed for multiple
regression analysis?
Solution:
Multivariate
Correlations
BANK
1.0000
0.0674
0.3520
0.3176
BANK
WALK
TALK
HEART
WALK
0.0674
1.0000
0.3274
0.3477
TALK
0.3520
0.3274
1.0000
0.0999
HEART
0.3176
0.3477
0.0999
1.0000
Scatterplot Matrix
35
30
25
BANK
20
15
30
25
WALK
20
15
30
25
TALK
20
15
10
30
25
HEART
20
15
15 20 25 30 35
15 20 25 30
10 15 20 25 30
15 20 25 30
This scatterplot matrix does not suggest that any transformation of the variables is needed.
(b) Compute the multiple linear regression of heart on bank, walk and talk.
Solution:
Actual by Predicted Plot
HEART Actual
30
25
20
15
10
10
15
20
25
30
HEART Predicted P=0.0416 RSq=0.22
RMSE=4.805
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.223642
0.150858
4.804986
19.80556
36
Analysis of Variance
Source
Model
Error
C. Total
DF
3
32
35
Sum of Squares
212.82642
738.81246
951.63889
Mean Square
70.9421
23.0879
F Ratio
3.0727
Prob > F
0.0416
Parameter Estimates
Term
Intercept
TALK
WALK
BANK
Estimate
3.1786957
-0.17961
0.4516011
0.405217
Std Error
6.336946
0.222215
0.200874
0.197102
t Ratio
0.50
-0.81
2.25
2.06
Prob>|t|
0.6194
0.4249
0.0316
0.0480
Residual by Predicted Plot
HEART Residual
10
5
0
-5
-10
10
15
20
25
HEART Predicted
30
(c) Construct residual plots of the residuals versus each of the explanatory variables.
Comment on these residual plots, the residual plot vs. predicted, normal quantile plot of
the residuals and the Cook’s distances and leverages. Would you recommend that we try
any transformations? Are there any points that are highly influential that need to be
further investigated? [Don’t do any further analysis -- just say what you would do next
(if anything)]
Solution:
Fit Y by X Group
Bivariate Fit of Residual HEART By BANK
Residual HEART
10
5
0
-5
-10
15
20
25
BANK
30
35
Linear Fit
Bivariate Fit of Residual HEART By WALK
Residual HEART
10
5
0
-5
-10
10
15
20
WALK
25
30
Linear Fit
Bivariate Fit of Residual HEART By TALK
Residual HEART
10
5
0
-5
-10
10
15
20
TALK
25
30
Linear Fit
Distributions
Residual HEART
10
.01
.05 .10
.25
.50
.75
.90 .95
.99
5
0
-5
-10
-2
-1
0
1
2
3
Normal Quantile Plot
Bivariate Fit of Residual HEART By Predicted HEART
Residual HEART
10
5
0
-5
-10
15
16
17
18
19 20 21 22
Predicted HEART
23
24
25
Linear Fit
From these plots, we do not see any indication that the assumptions of linearity, constant
variance or normality are violated. Also, the highest Cook’s distance is 0.22<1, so there
do not seem to be any high influence points.
4. Problem 3 continued
(a) Is there strong evidence that the multiple regression model using the three pace of life
variables bank, walk and talk provides better predictions of heart than using the sample
mean of heart to predict heart? Justify your answer using a test.
Solution:
Using the F-test, the p-value is 0.0416, which means the model is significant, so using
this model is better than using the sample mean.
(b) Although there may be many lurking variables and other problems with this study,
comment on whether the signs of the coefficients on bank, walk and talk are consistent
with the hypothesis that type-A individuals are more susceptible to heart disease.
Solution:
For cities with a fast pace of life, we would expect bank to be low, walk to be high and
talk to be high. Consequently, if type-A individuals are more susceptible to heart disease,
we would expect the coefficient on bank to be negative, walk to be positive and talk to be
positive. In fact, the coefficient on bank is positive, walk is positive and talk is negative
so the coefficient on walk is consistent with the hypothesis but the coefficients on bank
and talk are contrary to the hypothesis.
(c) A critic of this study says that it is not pace of life which causes heart disease but
smoking, which is associated with a fast pace of life, that causes heart disease. If you had
data on the smoking rates for each city, how would you use multiple regression analysis
to examine the critic’s claim? Describe briefly what multiple regression model you
would fit and what you would look for.
The critic is claiming that smoking is a lurking variable and that once smoking is
controlled for, the coefficients on bank, walk and talk should be zero. To examine the
critic’s claim, I would fit a multiple regression of heart on the explanatory variables bank,
walk, talk and smoking rate. I would do t-tests of whether the coefficients on bank, walk
and talk are zero to test the critic’s claim.
(d) Salt Lake City has a predominately Mormon population. The Mormon religion
strongly encourages hard work but prohibits smoking. Compute the residual for Salt
Lake City for the multiple regression in problem 3. Does the sign of the residual provide
support or not provide support for the critic’s claim in part (c) that smoking, which is
associated with a fast pace of life, but not fast pace of life itself causes heart disease?
Explain briefly.
Solution:
If the critic’s claim is correct, Salt Lake City should have a relatively small heart disease
rate, since it has a low smoking rate, but a relatively high predicted heart disease rate
from the regression in problem 3 because it has a fast pace of life (assuming that fast pace
of life is in fact associated with heart disease rate; the regression in Problem 3 provides
mixed evidence for this). Thus, if the critic’s claim is correct, we would expect the
residual for Salt Lake City to be negative. In fact the residual for Salt Lake City is
positive, meaning that it does not provide support for the critic's claim in (c).
Download