Stat 557 Solutions to Assignment 7 Fall 2002 1. (a) After adjusting for the effects of the other variables in the model, only sex (p=.014) and age (p=.0013) appear to have a significance association with the incidence of bile duct hyperplasia (BDH). This does not mean that other variables are completely unimportant. Putting several highly correlated variables into a model can result in none of the variables being declared significant after adjusting for the effects of the other variables, even thought some of the variables have significant marginal associations with the response variable. The global deviance for this model is (321.475)/(310) = 1.037. This does not reveal any serious deficiency in the model. The level of concordance between the observed results and the predicted probabilities provided by the estimated model is reflected in a moderate gamma value of .416. Note that there are a substantial number of cases with estimated probabilities for BDH between 0.3 and 0.7. In such situations it is not unusual to observe a relatively large number of discordant pairs among the observed binary responses and the predicted probabilities. (b) The deviance plot levels out below the global deviance, indicating that some of the information in the explanatory variables has not been captured by this model. Note that warning messages appear when 15 cluster effects are added to the logistic regression model. This happens because clusters 10, 12 and 15 contain either all successes or all failures. It can also happen when a combination of values for the explanatory variables corresponds to a situation consisting of either all success or all failures. Maximum likelihood estimates do not exist in this case. You should refit the model after removing some cluster effects. Numerical problems are avoided by adding effects for clusters 1-9, 11, 13, and 14 to the original logistic regression model. Now the deviance plot levels out only slightly below the line for the global deviance of the original model, suggesting that only slight improvements can be made to the model. (c) The added variable partial residual plots do not reveal much about binary explanatory variables like sex and the dummy variables for the cage tier levels. A nearly horizontal line suggests that a binary explanatory variable has little association with the incidence of BDH after adjustments are made for other terms in the model. The partial residual plot for initial weight is nearly a straight line with zero slope, indicting that initial weight has essentially no association with incidence of BDH that cannot be attributed to other variables in the model. The added variable partial residual plot for PBB shows curvature at the lower levels of PBB with an asymptote as PBB increases. Given the nature of the spacing between the levels of PBB, replacing PBB with log(PBB) may slightly improve the model. The added variable partial residual plot for age indicates that after adjusting for tiers, initial weight, sex and level of exposure to PBB, the incidence of BDH gradually decreases as age increases. These plots could change if some variables are removed from the model. ∧ (d) β5 = ∧ -1.22 with standard error 0.4964, and exp( β 5 ) = 0.295 is an estimate of a common odds of BDH in females for any combination of levels for age, initial odds of BDH in males weight, level of exposure to PBB, and cage tier. An approximate 95% confidence interval is 0.295 ± (1.96)(0.295)(.4964) ⇒ ( 0.008 , 0.582 ). The odds of BDH in males are from 1.6 to 125 times greater than the odds of BDH in females. conditional odds ratio ∧ (e) ∧ β 8 = 0.030 with standard error 0.0405, and exp(β 8 ) = 1.03 is an estimate of a conditional 2 odds of BDH at level z + 1 of PBB for any combination of levels for age, initial odds of BDH at level z of PBB weight, sex, and cage tier. It suggests that there is approximately a 3 percent increase in the odds for BDH for each 1 unit increase in the level of PBB, if the other factors in the model are held constant. This is not a statistically significant result. An approximate 95% confidence interval for this conditional odds ratio is 1.030 ± (1.96)(1.030)(.0405) ⇒ ( 0.95 , 1.11 ). Consequently, there appears to be no strong association between odds of BDH and level of PBB, after adjusting for tier, age, sex and initial weight. odds ratio (f) A backward elimination procedure identified sex, age, and sex*PBB as the only significant terms. The PBB effect was included to create the hierarchical model: ∧ ∧ log( π i / (1- π i ) = 2.4252 - 2.7973 (sex) - .0319 (age) - .0326 PBB +.5976 (sex * PBB) (.9815) (.8305) (.0093) (.0858) (.2239) Although the PBB term is not significant, adding it to the model does not result in large changes in the estimates of the coefficients for the other factors in the model. This model suggests that PBB has little effect on the incidence of BDH in male rats, but increasing exposure to PBB increases the incidence rate of BDH in female rats. Given any particular age, an estimate of the conditional odds of BDH at level z + 1 of PBB is exp(.5976 - .0326) = 1.76. This suggests that odds ratio odds of BDH at level z of PBB there is approximately a 76 percent increase in the odds for BDH in female rats for each one unit increase in the level of PBB, when age is held constant. An approximate 95% confidence interval for this conditional odds ratio is ( 1.04 , 2.48 ). There appears to be a significant effect of PBB on the incidence rate of BDH, but it is not measured very precisely by this study. (g) The deviance plot for the model in the solution to part (f) is very similar to the deviance plot for the model in part (a), suggesting that a small improvement in the model can be made by using the explanatory variables in a better way. The following are diagnostic results for the model in part (f). Results vary with the model selected in part (f). Cases 19 and 20 have large Pearson residuals. These are the only two females not exposed to PBB that had BDH. Since there is a low incidence of BDH among females with little or no exposure to PBB, the model assigns relatively low probabilities of developing BDH to these cases. The Dfbeta values indicate the removing cases 19 and 20 from the data would substantially affect the estimated coefficients for the sex and sex*PBB factors. The estimated coefficient for the sex*PBB factor would increase to reflect a stronger affect of PBB on the incidence rate of BDH in female rats. These are valid observations, however, and I would retain them in the analysis. In deciding how strongly to believe in the model, it is worthwhile to realize that the presence or absence of two cases has a substantial influence on the estimated effect of PBB on the incidence rate of BDH in females. Cases 93, 94 and 106 are relatively high leverage points corresponding to short survival times and exposure to the highest level of PBB. Case 1 is a high leverage point corresponding to a short survival time and no exposure to PBB. The other diagnostic measures indicate that parameter estimates are not much affected by the presence or absence of cases 1 and 106. The Dfbeta values indicate that deleting cases 93 and 94 would have moderate effects on the estimated coefficients for the age and PBB factors and the intercept. Since cases 93 and 94 did not develop BDH, deleting those cases results in a more negative coefficient for age and a stronger positive coefficient for PBB. One possible explanation is that these animals died of other causes before having enough time to develop BDH. These animals only lived about 40 days after treatment, while most other animals lived at least 100 days. Should animals that died or left the study too early be removed from the analysis because they did not have enough time to have to develop 3 BDH? We will not pursue this further at this time, but censored observations are always a concern and potential complication in this type of study. (h) A number of models were proposed. Most of them address the following patterns in the data: • Cage tiers have little effect on incidence of BDH. • Incidence of BDH is lower among rats that survive longer, especially if they live longer than 80 weeks. Most models include an age effect to account for this. Some models also include quadratic or other additional age effects to account for the complexity of this factor. • Effects of initial weight and survival times (age) are partially confounded. A scatter plot matrix reveals that the six rats with the shortest survival times (smallest age values) have relatively low initial weights. Some search procedures may substitute a weight factor and weight*age interaction for the age factor. You cannot completely separate the effects of these factors, but initial weight appears to have little affect on incidence of BDH beyond its effect on survival time (age). • Changing the level of exposure to PBB has little effect on incidence of BDH in male rats, but increasing exposure to PBB appears to increase incidence of BDH in females. Using log(BPP+1) or accounting for curvature in this trend in some other way seems to offer a slight improvement. Using a quadratic PBB term to adjust for curvature allows the incidence rate of BDH to eventually decline as the level of PBB increases which may not be reasonable from a biological point of view unless higher levels of PBB cause rats to die before they exhibit BDH. One reasonable model is ∧ ∧ log( π i / (1- π i ) = - 4.6697 - 2.1448 (sex) + .1349 (age) - .00094(age)2 (2.6438) (.6189) (.0604) (.00034) - .1632 log(PBB + 1) + 1.237 (sex * log(PBB + 1)) (.1999) (.3915) The local deviance plot for this model is similar to the plot for the model in part (a). Other diagnostic plots indicate no severe problems with this model. In this problem and in problem 2, some students relied too heavily on the Hosmer-Lemeshow test to justify the selected model. The Hosmer-Lemeshow test has poor power against most alternatives to the specified model. In this case, it does not reject the model in part (a), even though there seems to be a significant interaction between sex and the level of PBB. As seen on the previous homework assignment, the Hosmer-Lemeshow is sometimes able to indicate that a single case is not well predicted by the model. It sometimes provides a better check for outliers than a test of the overall fit of the model. I did not mean to promote the use of the HosmerLemeshow test, but I decided to talk about it because it is computed by the LOGISTIC procedure in SAS. The binary tree produced by the tree function in S-PLUS does not summarize the data as well as a logistic regression model in this case. The tree first splits on age. Rats that survive longer than 102 weeks have a relatively low incidence of BDH. Rats who lived less than 103 weeks are split into a number subgroups based on weight and age. At the end, some of these subgroups are split according to exposure to PBB, with exposure to at least 3.5 units of PBB associated with relatively high incidences of BDH. The tree does not split on sex and gives no indication that PBB may have more of an effect on females than males. 2. You should start your investigation with an exploratory analysis of the data. Almost none did this. For example, you should look at two-way frequency tables: (BST use)*(sites), (Trimester)*(sites), 4 (parity)*(sites), (mastitis rates)*(sites), etc... You would learn that BST was applied only in the second trimester at sites 2 and 11, it was applied in both the first and second trimesters at site 1, and it was applied only in the first trimester at the other sites. Also, sites 1, 2, and 11 have the lowest mastitis incidence rates. Hence, the trimester factor is heavily confounded with site differences. If effects for all sites are included in the model, trimester is unlikely to add any additional information and it will not appear to be a significant factor. Also look at scatterplot matrices to see how continuous variables like peak milk production are associated with other variables including mastitis incidence. It is difficult to think about the role of milk production in the model because use of BST increases milk production. It is partly an outcome of the treatment. Should you adjust a treatment effect for a variable that is affected by the treatment? The CHAID algorithm implemented by the TREEDISC macro in SAS provides some useful information. It first divides the sites into three groups according to mastitis incidence rates. Site 8 in the U.S. has an extremely high incidence rate of mastitis. I examined a subsequent split with respect to the use of BST to show that mastitis rates were not affected by use of BST at this site. For the other two groups of sites, the next split is with respect to use of BST and for each of these two groups it indicates that use of BST nearly doubles the mastitis incidence rate. For the middle group, there is also some indication the mastitis rates were higher among cows (parity larger then one) than heifers (parity equal to one). The tree() function in Splus reveals a similar pattern. Here, I used the factor() function in Splus to specify the site variable as a factor so it is treated as a nominal variable by the tree() function. First, site 8 is separated from the other sites. Then, sites (1, 2, 4, 10, 11) which had the lowest mastitis incidence rates are separated from sites (3, 5, 6, 7, 9) which had somewhat higher incidence rates. The latter sites are further split with respect to use of BST which reveals that use of BST is associated with about an 80% increase in mastitis incidence at those sites. Other splits are made with respect to peak milk production, which for the most part indicate that mastitis incidence increases as peak milk production increases. You could obtain additional insight into how changes in explanatory variables, other than the BST treatment, are associated with mastitis rates by performing an analysis of only the cows that were not treated with BST (the controls). Only one student mentioned an attempt to do this. I think it is important to include the site effects in the model as blocking effects to adjust for different management practices and environment and genetics conditions at different sites. Some students split the sites into a European group and a U.S. group and either entered a group effect into the model or fit different models to the two groups. This may only partly adjust for site differences, however. Consider, for example, how different site 8 is from other sites in the U.S. Some students entered a site 8 effect into their model along with the U.S. group effect. This is better, but I would enter effects for all sites to be sure that I adjusted the BST effect for different situations at different sites. Some of the sites may not be significantly different, but you have enough data to waste a few degrees of freedom in fitting a few parameters that may not actually be needed. It is better to do this than to be criticized latter for possibly not completely accounting for site differences. Local deviance plots show improvement when all of the individual site effects are taken into account. It may be a good idea to set site 8 aside and fit a model to the other sites where the effect of using BST is reasonably homogeneous across sites. Some students included site*BST interactions in their models, but these interactions may largely reflect the differences between site 8 and the other sites. After adjusting for individual site effects, a reasonably good model is obtained by adding BST and parity factors to the model. As noted above, the trimester factor adds little additional information for predicting incidence of mastitis. A parity*BST interaction could be considered . Alternative models that fit the data essentially as well are obtained by including peak milk production in the model in some form. When this is done, however, the BST treatment effect may no longer appear to be significant because peak milk production is increased by the application of BST. Should peak milk production be used as an explanatory variable? An analysis of just control cows may be helpful in making this decision. Of course, interaction with dairy science experts would also be helpful, but this resource was not available to most of you. The overall 5 conclusion may be that in very sanitary dairy operations where mastitis rates are low, the use of BST does not appear to significantly increase the incidence of mastitis. At sites where mastitis rates are higher, possibly because of less attention to sanitation, the odds of mastitis are nearly doubled by use of BST. To prevent increased mastitis rates with the use of BST to increase milk production, greater attention must be paid to sanitation, but this appears to be manageable. 3. (A) The deviance statistic for the first part of the model is G 2 = 21.21 with 6 df, and the parameter estimates are αˆ 1 = 0.829 s αˆ 1 = 0.316 βˆ 1 = −0.0707 sβˆ = 0.0154 1 The deviance statistic for the second part of the model is G 2 = 1.66 with 6 df, and the parameter estimates are αˆ 2 = −3.183 s αˆ 2 = 0.610 βˆ 2 = 0.0623 sβˆ = 0.0162 2 The deviance statistic for overall fit is G 2 = 21.21 + 1.66 = 22.87 with 5+5=10 df and p-value=0.029. This model does not fit well. The estimated curves do not provide a good description of the trends in proportions across age for either the single group or the married group. (B) The deviance statistic for the first part of the model is G 2 = 6.68 with 5 df, and the parameter estimates are αˆ 1 = 2.16 s αˆ 1 = 0.50 βˆ 1 = −0.236 sβˆ = 0.048 γˆ 1 = 0.00829 s γˆ 1 = 0.00084 1 The deviance statistic for the second part of the model is G 2 = 1.40 with 5 df, and the parameter estimates are αˆ 2 = −3.74 s αˆ 2 = 1.28 βˆ 2 = 0.10 sβˆ = 0.077 γˆ 2 = −0.00055 s γˆ 2 = 0.00107 2 The deviance statistic for overall fit is G 2 = 6.68 + 1.40 = 8.08 with 5+5=10 df and p-value=0.62. This model is consistent with the data, but in the second equation neither β̂ 2 nor γ̂ 2 is significant at the .05 level. Hence the quadratic term is not needed in the second equation. The deviance test indicates that the model provides a good fit, but the plot reveals that the estimated curve representing the proportion of single people begins to rise after about age 60. It implies that single people tend to live longer than married people. One would need to consider if this is reasonable. It is not supported by the observed proportion of single people in the oldest age group. 6 (C) The deviance statistic for the first part of the model is G 2 = 6.68 with 6 df, and the parameter estimates are αˆ 1 = 3.33 s αˆ 1 = 0.70 βˆ 1 = −1.44 sβˆ = 0.26 1 The deviance statistic for the second part of the model is G 2 = 0.98 with 6 df, and the parameter estimates are αˆ 2 = −7.44 s αˆ 2 = 1.83 βˆ 2 = 1.86 sβˆ = 0.52 2 The deviance statistic for overall fit is G 2 = 6.68 + 0.97 = 7.66 with 6+6=12 df and p-value=0.81. The small p-value indicates that this model fits the data quite well. (C) The deviance statistic for the first part of the model is G 2 = 0.70 with 5 df, and the parameter estimates are αˆ 1 = 8.49 s αˆ 1 = 2.67 βˆ 1 = −5.67 sβˆ = 2.00 γˆ 1 = 0.80 s γˆ 1 = 0.36 1 This is a significant improvement over the first part of the model in part (C). The deviance statistic for the second part of the model is G 2 = 0.97 with 5 df, and the parameter estimates are αˆ 2 = −7.19 s αˆ 2 = 8.95 βˆ 2 = 2.72 sβˆ = 5.49 γˆ 2 = 0.02 s γˆ 2 = 0.83 2 The deviance statistic for overall fit is G 2 = 0.70 + 0.97 = 1.68 with 5+5=10 df and p-value=0.998. E. The difference in the G 2 values for models (C) and (D) is 7.66-1.68=5.98 with 2 degrees of freedom. This yields a p-value of almost exactly 0.05. Consequently, there is some evidence that model (D) is an improvement over model (C). In the second equation, however, none of the terms appear to be significant. This suggests that the squared log-term can be deleted from the second part of the model. This results in a model that provides essentially the same fit as model (D) and is a significant improvement over model (C). G 2 = 1.69 with 5+6 = 11 degrees of freedom. Plots of the fitted curves for this model match the trends in the observed proportions very well for all three marital groups. Of the models considered, model (C) is the most parsimonious model that provides an adequate description of how the conditional probabilities change with age, but a substantial improvement in how well the estimated curves fit the obsevered proportions of single and married persons in the youngest and oldest age groups is obtained by adding a squared log-age term to the first logistic equation of model (C).