Document 10714947

advertisement
Stat 557
Solutions to Assignment 7
Fall 2002
1. (a)
After adjusting for the effects of the other variables in the model, only sex (p=.014) and age
(p=.0013) appear to have a significance association with the incidence of bile duct hyperplasia
(BDH). This does not mean that other variables are completely unimportant. Putting several
highly correlated variables into a model can result in none of the variables being declared
significant after adjusting for the effects of the other variables, even thought some of the variables
have significant marginal associations with the response variable. The global deviance for this
model is (321.475)/(310) = 1.037. This does not reveal any serious deficiency in the model. The
level of concordance between the observed results and the predicted probabilities provided by the
estimated model is reflected in a moderate gamma value of .416. Note that there are a substantial
number of cases with estimated probabilities for BDH between 0.3 and 0.7. In such situations it is
not unusual to observe a relatively large number of discordant pairs among the observed binary
responses and the predicted probabilities.
(b)
The deviance plot levels out below the global deviance, indicating that some of the information
in the explanatory variables has not been captured by this model. Note that warning messages
appear when 15 cluster effects are added to the logistic regression model. This happens because
clusters 10, 12 and 15 contain either all successes or all failures. It can also happen when a
combination of values for the explanatory variables corresponds to a situation consisting of either
all success or all failures. Maximum likelihood estimates do not exist in this case. You should
refit the model after removing some cluster effects. Numerical problems are avoided by adding
effects for clusters 1-9, 11, 13, and 14 to the original logistic regression model. Now the deviance
plot levels out only slightly below the line for the global deviance of the original model, suggesting
that only slight improvements can be made to the model.
(c)
The added variable partial residual plots do not reveal much about binary explanatory variables
like sex and the dummy variables for the cage tier levels. A nearly horizontal line suggests that a
binary explanatory variable has little association with the incidence of BDH after adjustments are
made for other terms in the model.
The partial residual plot for initial weight is nearly a straight line with zero slope, indicting that
initial weight has essentially no association with incidence of BDH that cannot be attributed to
other variables in the model. The added variable partial residual plot for PBB shows curvature at
the lower levels of PBB with an asymptote as PBB increases. Given the nature of the spacing
between the levels of PBB, replacing PBB with log(PBB) may slightly improve the model. The
added variable partial residual plot for age indicates that after adjusting for tiers, initial weight, sex
and level of exposure to PBB, the incidence of BDH gradually decreases as age increases. These
plots could change if some variables are removed from the model.

(d)
5 =

- 1.22
with standard error 0.4964, and exp( 5 ) = 0.295 is an estimate of a common
odds of BDH in females
for any combination of levels for age, initial
odds of BDH in males
weight, level of exposure to PBB, and cage tier. An approximate 95% confidence interval is
0.295  (1.96)(0.295)(.4964)  ( 0.008 , 0.582 ). The odds of BDH in males are from 1.6
to 125 times greater than the odds of BDH in females.
conditional odds ratio

(e)

 8 = 0.030 with standard error 0.0405, and exp( 8 ) = 1.03 is an estimate of a conditional
2
odds of BDH at level z + 1 of PBB
for any combination of levels for age, initial
odds of BDH at level z of PBB
weight, sex, and cage tier. It suggests that there is approximately a 3 percent increase in the odds
for BDH for each 1 unit increase in the level of PBB, if the other factors in the model are held
constant. This is not a statistically significant result. An approximate 95% confidence interval for
this conditional odds ratio is 1.030  (1.96)(1.030)(.0405)  ( 0.95 , 1.11 ). Consequently,
there appears to be no strong association between odds of BDH and level of PBB, after adjusting
for tier, age, sex and initial weight.
odds ratio
(f)
A backward elimination procedure identified sex, age, and sex*PBB as the only significant
terms. The PBB effect was included to create the hierarchical model:


log(  i / (1-  i ) = 2.4252 - 2.7973 (sex) - .0319 (age) - .0326 PBB +.5976 (sex * PBB)
(.9815) (.8305)
(.0093)
(.0858)
(.2239)
Although the PBB term is not significant, adding it to the model does not result in large changes in
the estimates of the coefficients for the other factors in the model. This model suggests that PBB
has little effect on the incidence of BDH in male rats, but increasing exposure to PBB increases the
incidence rate of BDH in female rats. Given any particular age, an estimate of the conditional
odds of BDH at level z + 1 of PBB
odds ratio
is exp(.5976 - .0326) = 1.76. This suggests that
odds of BDH at level z of PBB
there is approximately a 76 percent increase in the odds for BDH in female rats for each one unit
increase in the level of PBB, when age is held constant. An approximate 95% confidence interval
for this conditional odds ratio is ( 1.04 , 2.48 ). There appears to be a significant effect of PBB
on the incidence rate of BDH, but it is not measured very precisely by this study.
(g)
The deviance plot for the model in the solution to part (f) is very similar to the deviance plot for
the model in part (a), suggesting that a small improvement in the model can be made by using
the explanatory variables in a better way. The following are diagnostic results for the model
in part (f). Results vary with the model selected in part (f).
Cases 19 and 20 have large Pearson residuals. These are the only two females not exposed to PBB
that had BDH. Since there is a low incidence of BDH among females with little or no exposure to
PBB, the model assigns relatively low probabilities of developing BDH to these cases. The Dfbeta
values indicate the removing cases 19 and 20 from the data would substantially affect the estimated
coefficients for the sex and sex*PBB factors. The estimated coefficient for the sex*PBB factor
would increase to reflect a stronger affect of PBB on the incidence rate of BDH in female rats.
These are valid observations, however, and I would retain them in the analysis. In deciding how
strongly to believe in the model, it is worthwhile to realize that the presence or absence of two
cases has a substantial influence on the estimated effect of PBB on the incidence rate of BDH in
females.
Cases 93, 94 and 106 are relatively high leverage points corresponding to short survival times and
exposure to the highest level of PBB. Case 1 is a high leverage point corresponding to a short
survival time and no exposure to PBB. The other diagnostic measures indicate that parameter
estimates are not much affected by the presence or absence of cases 1 and 106. The Dfbeta values
indicate that deleting cases 93 and 94 would have moderate effects on the estimated coefficients
for the age and PBB factors and the intercept. Since cases 93 and 94 did not develop BDH,
deleting those cases results in a more negative coefficient for age and a stronger positive
coefficient for PBB. One possible explanation is that these animals died of other causes before
having enough time to develop BDH. These animals only lived about 40 days after treatment,
while most other animals lived at least 100 days. Should animals that died or left the study too
early be removed from the analysis because they did not have enough time to have to develop
3
BDH? We will not pursue this further at this time, but censored observations are always a concern
and potential complication in this type of study.
(h)
A number of models were proposed. Most of them address the following patterns in the data:
 Cage tiers have little effect on incidence of BDH.
 Incidence of BDH is lower among rats that survive longer, especially if they live longer than
80 weeks. Most models include an age effect to account for this. Some models also include
quadratic or other additional age effects to account for the complexity of this factor.
 Effects of initial weight and survival times (age) are partially confounded. A scatter plot
matrix reveals that the six rats with the shortest survival times (smallest age values) have
relatively low initial weights. Some search procedures may substitute a weight factor and
weight*age interaction for the age factor. You cannot completely separate the effects of these
factors, but initial weight appears to have little affect on incidence of BDH beyond its effect
on survival time (age).
 Changing the level of exposure to PBB has little effect on incidence of BDH in male rats, but
increasing exposure to PBB appears to increase incidence of BDH in females. Using
log(BPP+1) or accounting for curvature in this trend in some other way seems to offer a slight
improvement. Using a quadratic PBB term to adjust for curvature allows the incidence rate of
BDH to eventually decline as the level of PBB increases which may not be reasonable from a
biological point of view unless higher levels of PBB cause rats to die before they exhibit
BDH.
One reasonable model is


log(  i / (1-  i ) = - 4.6697 - 2.1448 (sex) + .1349 (age) - .00094(age)2
(2.6438) (.6189)
(.0604)
(.00034)
- .1632 log(PBB + 1) + 1.237 (sex * log(PBB + 1))
.1999
.3915
The local deviance plot for this model is similar to the plot for the model in part (a). Other
diagnostic plots indicate no severe problems with this model.
In this problem and in problem 2, some students relied too heavily on the Hosmer-Lemeshow test
to justify the selected model. The Hosmer-Lemeshow test has poor power against most
alternatives to the specified model. In this case, it does not reject the model in part (a), even
though there seems to be a significant interaction between sex and the level of PBB. As seen on
the previous homework assignment, the Hosmer-Lemeshow is sometimes able to indicate that a
single case is not well predicted by the model. It sometimes provides a better check for outliers
than a test of the overall fit of the model. I did not mean to promote the use of the HosmerLemeshow test, but I decided to talk about it because it is computed by the LOGISTIC procedure
in SAS.
The binary tree produced by the tree function in S-PLUS does not summarize the data as well as a
logistic regression model in this case. The tree first splits on age. Rats that survive longer than
102 weeks have a relatively low incidence of BDH. Rats who lived less than 103 weeks are split
into a number subgroups based on weight and age. At the end, some of these subgroups are split
according to exposure to PBB, with exposure to at least 3.5 units of PBB associated with relatively
high incidences of BDH. The tree does not split on sex and gives no indication that PBB may
have more of an effect on females than males.
2. You should start your investigation with an exploratory analysis of the data. Almost none did this. For
example, you should look at two-way frequency tables: (BST use)*(sites), (Trimester)*(sites),
4
(parity)*(sites), (mastitis rates)*(sites), etc... You would learn that BST was applied only in the second
trimester at sites 2 and 11, it was applied in both the first and second trimesters at site 1, and it was applied
only in the first trimester at the other sites. Also, sites 1, 2, and 11 have the lowest mastitis incidence rates.
Hence, the trimester factor is heavily confounded with site differences. If effects for all sites are included in
the model, trimester is unlikely to add any additional information and it will not appear to be a significant
factor. Also look at scatterplot matrices to see how continuous variables like peak milk production are
associated with other variables including mastitis incidence. It is difficult to think about the role of milk
production in the model because use of BST increases milk production. It is partly an outcome of the
treatment. Should you adjust a treatment effect for a variable that is affected by the treatment?
The CHAID algorithm implemented by the TREEDISC macro in SAS provides some useful information.
It first divides the sites into three groups according to mastitis incidence rates. Site 8 in the U.S. has an
extremely high incidence rate of mastitis. I examined a subsequent split with respect to the use of BST to
show that mastitis rates were not affected by use of BST at this site. For the other two groups of sites, the
next split is with respect to use of BST and for each of these two groups it indicates that use of BST nearly
doubles the mastitis incidence rate. For the middle group, there is also some indication the mastitis rates
were higher among cows (parity larger then one) than heifers (parity equal to one).
The tree() function in Splus reveals a similar pattern. Here, I used the factor() function in Splus to
specify the site variable as a factor so it is treated as a nominal variable by the tree() function. First, site 8 is
separated from the other sites. Then, sites (1, 2, 4, 10, 11) which had the lowest mastitis incidence rates are
separated from sites (3, 5, 6, 7, 9) which had somewhat higher incidence rates. The latter sites are further
split with respect to use of BST which reveals that use of BST is associated with about an 80% increase in
mastitis incidence at those sites. Other splits are made with respect to peak milk production, which for the
most part indicate that mastitis incidence increases as peak milk production increases.
You could obtain additional insight into how changes in explanatory variables, other than the BST
treatment, are associated with mastitis rates by performing an analysis of only the cows that were not treated
with BST (the controls). Only one student mentioned an attempt to do this.
I think it is important to include the site effects in the model as blocking effects to adjust for different
management practices and environment and genetics conditions at different sites. Some students split the
sites into a European group and a U.S. group and either entered a group effect into the model or fit different
models to the two groups. This may only partly adjust for site differences, however. Consider, for
example, how different site 8 is from other sites in the U.S. Some students entered a site 8 effect into their
model along with the U.S. group effect. This is better, but I would enter effects for all sites to be sure that I
adjusted the BST effect for different situations at different sites. Some of the sites may not be significantly
different, but you have enough data to waste a few degrees of freedom in fitting a few parameters that may
not actually be needed. It is better to do this than to be criticized latter for possibly not completely
accounting for site differences. Local deviance plots show improvement when all of the individual site
effects are taken into account. It may be a good idea to set site 8 aside and fit a model to the other sites
where the effect of using BST is reasonably homogeneous across sites. Some students included site*BST
interactions in their models, but these interactions may largely reflect the differences between site 8 and the
other sites.
After adjusting for individual site effects, a reasonably good model is obtained by adding BST and parity
factors to the model. As noted above, the trimester factor adds little additional information for predicting
incidence of mastitis. A parity*BST interaction could be considered . Alternative models that fit the data
essentially as well are obtained by including peak milk production in the model in some form. When this is
done, however, the BST treatment effect may no longer appear to be significant because peak milk
production is increased by the application of BST. Should peak milk production be used as an explanatory
variable?
An analysis of just control cows may be helpful in making this decision. Of course, interaction with
dairy science experts would also be helpful, but this resource was not available to most of you. The overall
5
conclusion may be that in very sanitary dairy operations where mastitis rates are low, the use of BST does
not appear to significantly increase the incidence of mastitis. At sites where mastitis rates are higher,
possibly because of less attention to sanitation, the odds of mastitis are nearly doubled by use of BST. To
prevent increased mastitis rates with the use of BST to increase milk production, greater attention must be
paid to sanitation, but this appears to be manageable.
2
3. (A) The deviance statistic for the first part of the model is G  21 .21 with 6 df, and the parameter
estimates are
ˆ 1  0.829
s ˆ 1  0.316
ˆ 1  0.0707
sˆ  0.0154
1
2
The deviance statistic for the second part of the model is G  1.66 with 6 df, and the parameter
estimates are
ˆ 2  3.183
s ˆ 2  0.610
ˆ 2  0.0623
sˆ  0.0162
2
The deviance statistic for overall fit is G  21.21  1.66  22.87 with 5+5=10 df and
p-value=0.029. This model does not fit well. The estimated curves do not provide a good description
of the trends in proportions across age for either the single group or the married group.
2
2
(B) The deviance statistic for the first part of the model is G  6.68 with 5 df, and the parameter
estimates are
ˆ 1  2.16
s ˆ 1  0.50
ˆ 1  0.236
sˆ  0.048
ˆ 1  0.00829
s ˆ 1  0.00084
1
2
The deviance statistic for the second part of the model is G  1.40 with 5 df, and the parameter
estimates are
ˆ 2  3.74
s ˆ 2  1.28
ˆ 2  0.10
sˆ  0.077
ˆ 2  0.00055
s ˆ 2  0.00107
2
2
The deviance statistic for overall fit is G  6.68  1.40  8.08 with 5+5=10 df and p-value=0.62.
This model is consistent with the data, but in the second equation neither ̂ 2 nor ̂ 2 is significant at the
.05 level. Hence the quadratic term is not needed in the second equation. The deviance test indicates that
the model provides a good fit, but the plot reveals that the estimated curve representing the proportion of
single people begins to rise after about age 60. It implies that single people tend to live longer than
married people. One would need to consider if this is reasonable. It is not supported by the observed
proportion of single people in the oldest age group.
6
(C)
2
The deviance statistic for the first part of the model is G  6.68 with 6 df, and the parameter
estimates are
ˆ 1  3.33
s ˆ 1  0.70
ˆ 1  1.44
sˆ  0.26
1
2
The deviance statistic for the second part of the model is G  0.98 with 6 df, and the parameter
estimates are
ˆ 2  7.44
s ˆ 2  1.83
ˆ 2  1.86
sˆ  0.52
2
The deviance statistic for overall fit is G  6.68  0.97  7.66 with 6+6=12 df and p-value=0.81.
The small p-value indicates that this model fits the data quite well.
2
2
(C) The deviance statistic for the first part of the model is G  0.70 with 5 df, and the parameter
estimates are
ˆ 1  8.49
s ˆ 1  2.67
ˆ 1  5.67
sˆ  2.00
ˆ 1  0.80
s ˆ 1  0.36
1
This is a significant improvement over the first part of the model in part (C).
2
The deviance statistic for the second part of the model is G  0.97 with 5 df, and the parameter
estimates are
ˆ 2  7.19
s ˆ 2  8.95
ˆ 2  2.72
sˆ  5.49
ˆ 2  0.02
s ˆ 2  0.83
2
2
The deviance statistic for overall fit is G  0.70  0.97  1.68 with 5+5=10 df and p-value=0.998.
2
E. The difference in the G values for models (C) and (D) is 7.66-1.68=5.98 with 2 degrees of freedom.
This yields a p-value of almost exactly 0.05. Consequently, there is some evidence that model (D) is an
improvement over model (C). In the second equation, however, none of the terms appear to be
significant. This suggests that the squared log-term can be deleted from the second part of the model.
This results in a model that provides essentially the same fit as model (D) and is a significant
2
improvement over model (C). G = 1.69 with 5+6 = 11 degrees of freedom. Plots of the fitted curves
for this model match the trends in the observed proportions very well for all three marital groups.
Of the models considered, model (C) is the most parsimonious model that provides an adequate
description of how the conditional probabilities change with age, but a substantial improvement in how
well the estimated curves fit the obsevered proportions of single and married persons in the youngest
and oldest age groups is obtained by adding a squared log-age term to the first logistic equation of
model (C).
Download