Document 10714943

advertisement
Stat 557
Solutions to Assignment 6
Fall 2002
1. Two answers can be derived for problem 1. We first present a solution where π i is the probability that
a female turtle emerges from a randomly selected egg incubated at the i-th temperature. Then we present a
solution where π i is the probability that a male turtle emerges from a randomly selected egg incubated at
the i-th temperature. The data file for the Illinois turtle eggs has three sets of counts for each temperature. I
made a new data file with five lines, one for each temperature, where I combined the counts at each
temperature. Values of goodness-of-fit statistics were obtained by applying the GENMOD procedure to
this second data file. The same parameter estimates are obtained with either of the data files.
(a)
Using π i to represent the probability that a female turtle emerges from a randomly selected egg
incubated at the i-th temperature, the estimated logistic regression model is
∧
∧
log( π i / (1- π i ) = - 61.318 + 2.211 (temperature).
(12.022) (0.431)
Standard errors are reported in parentheses below the estimated coefficients. Goodnees-of-fit
statistics could be computed in two ways. There are only 5 temperature levels. If you total the
counts for the five temperature levels before fitting the model, you obtain G2 = 14.86 and X2 =
14.95 with 3 d.f. (p-value < .005). This model does not fit the data well. It does not allow the
probability of a female to increase quickly enough between 27.2 and 28.5 oC.
If you fit the model with 15 sets of counts, three for each temperature you obtain G2 = 24.94
and X2 = 26.24 with 13 d.f. (p-value = .02). The first approach has more power for detecting
that the proposed model may be inadequate because the counts are larger. Also the chi-square
approximation to the null distributions of G2 and X2 would give more accurate p-values and better
control the type I error level because the counts are larger. The second approach allows you to
exam variation among replicates at the same temperature, which could detect that the probability of
success (hatching a female) varies among replicates at the same temperature which would be a
violation of the independent binomial model used to establish to likelihood function.
(b)
The estimated complimentary log-log regression model is
∧
log(-log(1- π i )) = - 24.983 + 0.889 (temperature).
(4.474) (0.158)
Standard errors are reported in parentheses below the estimated coefficients. If you total the
counts across the three replicates at each temperature before fitting the model, goodnees of fit test
statistics are G2 = 22.91 and X2 = 21.34 with 3 d.f. (p-value < .005). This model does not fit
the data well.
If you fit the model with 15 sets of counts, three for each temperature you obtain G2 = 32.99
and X2 = 33.8 with 13 d.f. (p-value < .01). (c)
Consider a model of the form
 (1- π )-α - 1
i
 = β 0 + β1 (temperature). The approximate
log

α


∧
∧
procedure for selecting a value of α yields α = 1 - γ = 1 - (-1.99) ≈ 3.00. Larger
values of α lead to models the appear to fit the data a little better. Some deviance values are
2
shown below. These were obtained by first totaling the counts at each of the five temperature
levels before fitting the model, you would get different deviance values if you used the 15 sets of
counts that are posted on the data file.
Alpha
1.0
3.0
10.0
20.0
100.0
deviance
14.86
9.82
6.81
6.44
6.37
As α increases the curve rises more rapidly between 27.2 and 28.5 oC. Determination of the sex
of turtle hatchlings appears to take place in a very narrow temperature range. Below 27.2 oC most
hatchlings are male and above 28.5 oC most hatchlings are female. The model for α = 10
appears to be adequate and there seems to be no need to go beyond α = 20. The estimated model
for α = 20 is
∧


 (1- π i )-20 - 1
log
 = - 763.63 + 28.03 (temperature).
20




(114.21)
(4.18)
Standard errors are reported in parentheses below the estimated coefficients. For this model G2 =
6.44 and X2 = 6.75 with 3 d.f. (p-value > .05).
(d)
For the model in part (c), compute
TIll, 0.5 =
∧
 ( 2α - 1
 - β0
log

 α 
∧
= 27.63 oC
β1
Using the delta method, the large sample standard error for this estimate is the square root of
[-0.035674
13043.62
- 0.98567] 
 - 477.02
An approximate 95% confidence interval is
(e)
 -0.035674
 = .004387

17.4478 -0.98567 
- 477.02
27.63 ± (1.96)(.06623)
⇒
( 27.50 , 27.76 ) .
Fitting the model in part (c) with α = 20 to the New Mexico data yields
∧


 (1- π i )-20 - 1
log
 = - 455.58 + 16.57 (temperature).
20




(139.91)
(5.06)
Standard errors are reported in parentheses below the estimated coefficients. For this model G2 =
1.29 and X2 = 1.30 with 1 d.f. (p-value > .25). This model appears to be adequate.
3
(f)
For the model in part (e), compute
∧
 ( 2α - 1
 - β0
log

 α 
TNM,0.5 =
∧
= 28.15 oC
β1
Using the delta method, the large sample standard error for this estimate is the square root of
[-0.0603453
19574.33
- 1.69859] 
 - 707.95
An approximate 95% confidence interval is
(g)
  0.0603453
 = .067271

25.6199 -1.69859 
- 707.95
28.15 ± (1.96)(.25937)
⇒
( 27.64, 28.66 ) .
Assuming that results for the Illinois eggs are completely independent of results for the New
Mexico eggs, Var(TIll, .05 - TNM, .05 ) = Var(TIll, .05 ) + Var(TNM, .05 ) and a test statistic that
has an approximate standard normal distribution under the null hypothesis is
Z=
28.15 - 27.63
= 1.94
0.067271 + .004387
with
p - value = .052.
There is some indication that the temperature that produces 50% females is higher in New Mexico.
A more accurate inference could be made if more eggs from New Mexico were included in the
study.
Now we present a solution where π i is the probability that a male turtle emerges from a randomly selected
egg incubated at the i-th temperature.
(a)
The estimated logistic regression model is
∧
∧
log( π i / (1- π i ) = 61.318 - 2.211 (temperature).
(12.022) (0.431)
Standard errors are reported in parentheses below the estimated coefficients. For this model G2 =
10.97 and X2 = 11.34 with 3 d.f. (p-value = .01). This model is not adequate for these data. It
does not allow the probability of a male to decrease rapidly enough between 27.2 and 28.5 oC.
(b)
The estimated complimentary log-log regression model is
∧
log(-log(1- π i )) = 55.6449 - 1.8770 (temperature).
(9.0407) (0.3261)
Standard errors are reported in parentheses below the estimated coefficients. For this model G2 =
10.97 and X2 = 11.34 with 3 d.f. This model fits better than the logistic regression model in part
(a). It allows the probability of a male to decrease more quickly between 27.2 and 28.5 oC, but it
does not quite bend the curve fast enough.
4
(c)
Consider a model of the form
 (1- π )-α - 1
i
 = β 0 + β1 (temperature). The procedure
log

α


∧
∧
for selecting a value of α yields α = 1 - γ
log model is suggested.
alpha
1.0
0.5
0.2
0.1
0.01
0
= 1 - (4.13) ⇒ 0.
The complimentary log-
deviance
14.86
13.08
11.86
11.42
11.02
10.97
Everyone using this setup selected the complimentary log-log model from part (b). You could
have explored other models such as raising the cdf of the extreme value distribution to a power,
but we will not pursue this here.
∧
(d)
For the model in part (b), compute
TIll, 0.5 =
log( log(2)) - β 0
∧
= 27.71 oC
β1
Using the delta method, the large sample standard error for this estimate is the square root of
[0.532765
81.73365
14.762875] 
-2.94815
An approximate 95% confidence interval is
(e)
- 2.94815  0.532765 
 = .0064585

0.10637 14.762875
27.71 ± (1.96)(.080365)
⇒
( 27.55 , 27.87 ) .
Fitting the complimentary log-log model to the New Mexico data yields
∧
log(-log(1- π i )) = 35.8123 - 1.2778 (temperature).
(11.70)
(0.417)
Standard errors are reported in parentheses below the estimated coefficients. For this model G2 =
3.83 and X2 = 3.63 with 1 d.f.
∧
(f)
For the model in part (e), compute
TNM,0.5 =
log( log(2)) - β 0
∧
= 28.31 oC
β1
Using the delta method, the large sample standard error for this estimate is the square root of
[0.782595
 136.84
22.15790] 
 - 4.87178
- 4.87178  0.782595
 = .057494

0.17356 -22.1579 
5
28.31 ± (1.96)(.23978)
An approximate 95% confidence interval is
(g)
⇒
( 27.84, 28.78 ) .
Assuming that results for the Illinois eggs are completely independent of results for the New
Mexico eggs, Var(TIll, .05 - TNM, .05 ) = Var(TIll, .05 ) + Var(TNM, .05 ) and a test statistic that
has an approximate standard normal distribution under the null hypothesis is
28.31 - 27.71
Z =
= 2.37
with
p - value = .018.
0.057494 + .0064585
There is some indication that the temperature that produces 50% females is higher in New Mexico,
although this test is based on a complimentary log-log model that does not quite fit the data.
The analyses of the turtle egg data shown above were based on the assumption that each egg responds
independently of any other egg. Each line of the original data file corresponds to a different box put into a
incubator. The recorded temperature was the taken from a thermometer inside the incubator, but
temperatures may vary across different locations in an incubator and the temperature in any particular box
may have varied from the thermometer reading. Given the narrow temperature range in which the
probability of females rapidly increases, a small deviation in a temperature inside a box could have a big
effect on the proportion of female turtles emerging from the eggs in the box. Hence, results from the same
box may exhibit positive correlation due to fluctuation in temperature within an incubator. How should you
deal with this?
2. (a) Using π i to represent the probability that black medic is present, the estimated logistic
regression model is
∧
∧
log( π i / (1- π i ) = - 1.154 + 0.3652 (mounds).
(0.4351) (0.1082)
Standard errors are reported in parentheses below the estimated coefficients. In this case, we
cannot reliably use large sample chi-square approximations for the null distributions of G2 and
X2 tests of the fit of this model against the general alternative.
(b)
Gamma = 0.796 indicates that this model assigns higher probabilities to most of the cases where
black medic is present than to cases where it is absent. In this sense, the model in part (a) seems to
be a reasonable approximation.
(c)
The value of the Hosmer-Lemeshow goodness-of-fit test is 39.76 with 5 d.f. and p-value<.0001.
In this case 7 categories are constructed:
Group
1
2
3
4
5
6
Total
21
6
8
7
6
7
Presence
Observed
2
3
6
6
4
7
Expected
5.03
2.21
4.16
4.63
4.62
6.37
Absence
Observed
19
3
2
1
2
0
Expected
15.97
3.79
3.84
2.37
1.38
0.63
6
7
9
8
8.97
1
0.03
Although the biggest absolute differences between the observed an expected counts occur in group
1 where there are no gopher mounds in the previous tear and estimated probabilities of the
presence of black medic are relatively low, the major contribution to the Hosmer-Lemeshow test
statistic comes from the presence of case 15 (where black medic is absent) in category 7 (where the
mode assigns very high probability to the presence of black medic). Note that (1-.03)2/(.03) =
31.36 contributes the most to the Hosmer-Lemeshow test. The Hosmer-Lemeshow test is
sensitive to the existence of a single case where the actual outcome does not match the predicted
probability.
Several diagnostic measures (c, cbar, difdev, difchisq, and the dfbeta values for both the intercept
and mound effect) indicate the case 15 might be an outlier. It is not a high leverage case. Black
medic was not present, but the model gives a very high probability to the presence of black medic
because there were 16 gopher mounds present in the previous year. Black medic was present in all
other cases where at least 8 gopher mounds were present in the previous year. This is a valid data
point, however, and it cannot be simply thrown away.
If you use π i to represent the probability that black medic is absent, the estimated logistic
regression model is
∧
∧
log( π i / (1 - π i ) =
1.154 - 0.3652 (mounds).
(0.4351) (0.1082)
This is the same model, but the value of the Hosmer-Lemeshow test (Goodness-of-fit Statistic =
11.364 with 5 df and p-value=0.0446) is different because slightly different categories are made:
Group
Total
1
2
3
4
5
6
7
6
6
7
10
8
6
21
Absence
Observed Expected
0
1
2
1
2
3
19
0.00
0.19
1.06
3.15
3.84
3.79
15.97
Presence
Observed Expected
6
5
5
9
6
3
2
6.00
5.81
5.94
6.85
4.16
2.21
5.03
Now case 15 is in group 2. Note that (1-.19)2/(.19) = 3.45 contributes the most to the HosmerLemeshow test.
3. (a) Using π i to represent the probability that black medic is present, the estimated logistic
regression model is
∧
∧
log( π i / (1 - π i ) =
- 4.0389 + 0.2790 (mounds) + 1.0712(elevation)
(1.0634) (0.1076)
(0.3136)
Standard errors are reported in parentheses below the estimated coefficients.
(b)
G2 = 61.857 - 44.174 = 17.683 with 1 df and p-value = 0.000026.
term provides a significant improvement in the model.
Adding the elevation
7
(c)
Gamma = 0.839
(d)
As in problem 2, the value of the Hosmer-Lemeshow test depends the definition of π i . Using
π i to represent the probability that black medic is present, the following categories are made:
Group
Total
1
2
3
4
5
6
7
8
9
10
11
6
6
6
6
6
6
6
6
6
6
4
Presence
Observed Expected
0
0
1
1
4
5
5
6
4
6
4
0.23
0.36
0.69
1.66
2.74
4.46
5.02
5.30
5.61
5.93
4.00
Absence
Observed Expected
6
6
5
5
2
1
1
0
2
0
0
5.77
5.64
5.31
4.34
3.26
1.54
0.98
0.70
0.39
0.07
0.00
Goodness-of-fit Statistic = 10.319 with 9 df and p-value=0.3253. The inclusion of the
linear elevation effect reduces the estimated probability for case 15 from .93 to .73. Case 15 is
now in group 9. Note that (2-.39)2/(.39) = 7.20 contributes the most to the Hosmer-Lemeshow
test, but it is not enough to reject the fit of the model in this case. Including the elevation term also
improves the model by allowing it to give estimated probabilities closer to zero when there were
no gopher mounds in the previous year.
There are no cases with high leverage that are cause for concern. Cases 1, 15, and 45 are
identified as highly influential cases. Case 45 is one of two cases where black medic is presence
and there were no gopher mounds in the previous year. The other case (case 21) has a much
higher elevation. Removing case 45 results in a smaller intercept to allow the estimated
probability of black medic to become closer to zero when mounds=0 and elevation is low. Case 1
is the only high elevation case where black medic is absent. Removing case 1 results in an
increase in the slope on elevation and a decrease in the intercept. Case 15 is the only case where
there are more than 8 mounds and black medic is absent. Deleting this case results in a larger
estimated slope for the mounds variable. All of these are valid data points and I would not
remove any of these cases from the data. Knowing that the estimated coefficients are sensitive to
the existence of a few cases in this data set, however, may affect your confidence in the estimated
model and influence your decision about whether or not more data should be collected.
Using π i to represent the probability that black medic is absent, the following categories are made
Group
Total
1
2
3
4
5
6
7
8
6
6
6
6
6
6
6
7
Absence
Observed Expected
0
1
1
0
1
1
3
6
0.01
0.17
0.47
0.82
1.14
2.00
3.68
5.60
Presence
Observed Expected
6
5
5
6
5
5
3
1
5.99
5.83
5.53
5.18
4.86
4.00
2.32
1.40
8
9
10
6
9
6
9
5.51
8.61
0
0
0.49
0.39
Goodness-of-fit Statistic = 7.8805 with 8 df and p-value=0.4452. Case 15 is in group 2, and
(1-.17)2/(.17) = 4.05 contributes the most to the Hosmer-Lemeshow statistic.
4. (a)
∧
Using π i to represent the probability that black medic is present, the estimated logistic
regression model is
∧
log( π i / (1 - π i ) =
- 3.6062 + 0.0937 (mounds) + 0.8775(elevation) + 0.0770(mounds)(elevation)
(1.1690) (0.2903)
(0.3995)
(0.1167)
Standard errors are reported in parentheses below the estimated coefficients.
(b)
G2 = 44.1740 - 43.7143 = 0.4597 with 1 df and p-value = 0.498. Adding the
interaction term does not provide a significant improvement in the model.
(c)
The AIC and SC values are:
Model
Problem 2
Problem 3
Problem 4
AIC
65.86
50.17
51.71
SC
70.17
56.65
60.35
Gamma
0.796
0.839
0.841
These values indicate that the model from problem 3 is an improvement over the model from
problem 2, and the model from problem 3 is essentially as good as the model from problem 4.
Note that the value of gamma steadily increases as the model is made more complex.
(d)
Predicted probabilities tend to be nearly the same for the models from problems 3 and 4.
For many cases with zero gopher mounds in the previous year, where black medic tends to be
absent, the model from problem 3 tends to provide probabilities much closer to zero than the
model from problem 2. Standard errors and lengths of confidence intervals for estimated
probabilities can be much larger for the model from problem 4 than for the model from problem 3.
This is the consequence of adding an insignificant interaction term to the model.
(e)
Some diagnostic results were described above. It appears that the model from problem 3 provides
an adequate description, but values of parameter estimates are sensitive to the presence or absence
of cases 1, 15, or 45.
Download