Stat 557 Solutions to Assignment 6 Fall 2002 1. Two answers can be derived for problem 1. We first present a solution where π i is the probability that a female turtle emerges from a randomly selected egg incubated at the i-th temperature. Then we present a solution where π i is the probability that a male turtle emerges from a randomly selected egg incubated at the i-th temperature. The data file for the Illinois turtle eggs has three sets of counts for each temperature. I made a new data file with five lines, one for each temperature, where I combined the counts at each temperature. Values of goodness-of-fit statistics were obtained by applying the GENMOD procedure to this second data file. The same parameter estimates are obtained with either of the data files. (a) Using π i to represent the probability that a female turtle emerges from a randomly selected egg incubated at the i-th temperature, the estimated logistic regression model is ∧ ∧ log( π i / (1- π i ) = - 61.318 + 2.211 (temperature). (12.022) (0.431) Standard errors are reported in parentheses below the estimated coefficients. Goodnees-of-fit statistics could be computed in two ways. There are only 5 temperature levels. If you total the counts for the five temperature levels before fitting the model, you obtain G2 = 14.86 and X2 = 14.95 with 3 d.f. (p-value < .005). This model does not fit the data well. It does not allow the probability of a female to increase quickly enough between 27.2 and 28.5 oC. If you fit the model with 15 sets of counts, three for each temperature you obtain G2 = 24.94 and X2 = 26.24 with 13 d.f. (p-value = .02). The first approach has more power for detecting that the proposed model may be inadequate because the counts are larger. Also the chi-square approximation to the null distributions of G2 and X2 would give more accurate p-values and better control the type I error level because the counts are larger. The second approach allows you to exam variation among replicates at the same temperature, which could detect that the probability of success (hatching a female) varies among replicates at the same temperature which would be a violation of the independent binomial model used to establish to likelihood function. (b) The estimated complimentary log-log regression model is ∧ log(-log(1- π i )) = - 24.983 + 0.889 (temperature). (4.474) (0.158) Standard errors are reported in parentheses below the estimated coefficients. If you total the counts across the three replicates at each temperature before fitting the model, goodnees of fit test statistics are G2 = 22.91 and X2 = 21.34 with 3 d.f. (p-value < .005). This model does not fit the data well. If you fit the model with 15 sets of counts, three for each temperature you obtain G2 = 32.99 and X2 = 33.8 with 13 d.f. (p-value < .01). (c) Consider a model of the form (1- π )-α - 1 i = β 0 + β1 (temperature). The approximate log α ∧ ∧ procedure for selecting a value of α yields α = 1 - γ = 1 - (-1.99) ≈ 3.00. Larger values of α lead to models the appear to fit the data a little better. Some deviance values are 2 shown below. These were obtained by first totaling the counts at each of the five temperature levels before fitting the model, you would get different deviance values if you used the 15 sets of counts that are posted on the data file. Alpha 1.0 3.0 10.0 20.0 100.0 deviance 14.86 9.82 6.81 6.44 6.37 As α increases the curve rises more rapidly between 27.2 and 28.5 oC. Determination of the sex of turtle hatchlings appears to take place in a very narrow temperature range. Below 27.2 oC most hatchlings are male and above 28.5 oC most hatchlings are female. The model for α = 10 appears to be adequate and there seems to be no need to go beyond α = 20. The estimated model for α = 20 is ∧ (1- π i )-20 - 1 log = - 763.63 + 28.03 (temperature). 20 (114.21) (4.18) Standard errors are reported in parentheses below the estimated coefficients. For this model G2 = 6.44 and X2 = 6.75 with 3 d.f. (p-value > .05). (d) For the model in part (c), compute TIll, 0.5 = ∧ ( 2α - 1 - β0 log α ∧ = 27.63 oC β1 Using the delta method, the large sample standard error for this estimate is the square root of [-0.035674 13043.62 - 0.98567] - 477.02 An approximate 95% confidence interval is (e) -0.035674 = .004387 17.4478 -0.98567 - 477.02 27.63 ± (1.96)(.06623) ⇒ ( 27.50 , 27.76 ) . Fitting the model in part (c) with α = 20 to the New Mexico data yields ∧ (1- π i )-20 - 1 log = - 455.58 + 16.57 (temperature). 20 (139.91) (5.06) Standard errors are reported in parentheses below the estimated coefficients. For this model G2 = 1.29 and X2 = 1.30 with 1 d.f. (p-value > .25). This model appears to be adequate. 3 (f) For the model in part (e), compute ∧ ( 2α - 1 - β0 log α TNM,0.5 = ∧ = 28.15 oC β1 Using the delta method, the large sample standard error for this estimate is the square root of [-0.0603453 19574.33 - 1.69859] - 707.95 An approximate 95% confidence interval is (g) 0.0603453 = .067271 25.6199 -1.69859 - 707.95 28.15 ± (1.96)(.25937) ⇒ ( 27.64, 28.66 ) . Assuming that results for the Illinois eggs are completely independent of results for the New Mexico eggs, Var(TIll, .05 - TNM, .05 ) = Var(TIll, .05 ) + Var(TNM, .05 ) and a test statistic that has an approximate standard normal distribution under the null hypothesis is Z= 28.15 - 27.63 = 1.94 0.067271 + .004387 with p - value = .052. There is some indication that the temperature that produces 50% females is higher in New Mexico. A more accurate inference could be made if more eggs from New Mexico were included in the study. Now we present a solution where π i is the probability that a male turtle emerges from a randomly selected egg incubated at the i-th temperature. (a) The estimated logistic regression model is ∧ ∧ log( π i / (1- π i ) = 61.318 - 2.211 (temperature). (12.022) (0.431) Standard errors are reported in parentheses below the estimated coefficients. For this model G2 = 10.97 and X2 = 11.34 with 3 d.f. (p-value = .01). This model is not adequate for these data. It does not allow the probability of a male to decrease rapidly enough between 27.2 and 28.5 oC. (b) The estimated complimentary log-log regression model is ∧ log(-log(1- π i )) = 55.6449 - 1.8770 (temperature). (9.0407) (0.3261) Standard errors are reported in parentheses below the estimated coefficients. For this model G2 = 10.97 and X2 = 11.34 with 3 d.f. This model fits better than the logistic regression model in part (a). It allows the probability of a male to decrease more quickly between 27.2 and 28.5 oC, but it does not quite bend the curve fast enough. 4 (c) Consider a model of the form (1- π )-α - 1 i = β 0 + β1 (temperature). The procedure log α ∧ ∧ for selecting a value of α yields α = 1 - γ log model is suggested. alpha 1.0 0.5 0.2 0.1 0.01 0 = 1 - (4.13) ⇒ 0. The complimentary log- deviance 14.86 13.08 11.86 11.42 11.02 10.97 Everyone using this setup selected the complimentary log-log model from part (b). You could have explored other models such as raising the cdf of the extreme value distribution to a power, but we will not pursue this here. ∧ (d) For the model in part (b), compute TIll, 0.5 = log( log(2)) - β 0 ∧ = 27.71 oC β1 Using the delta method, the large sample standard error for this estimate is the square root of [0.532765 81.73365 14.762875] -2.94815 An approximate 95% confidence interval is (e) - 2.94815 0.532765 = .0064585 0.10637 14.762875 27.71 ± (1.96)(.080365) ⇒ ( 27.55 , 27.87 ) . Fitting the complimentary log-log model to the New Mexico data yields ∧ log(-log(1- π i )) = 35.8123 - 1.2778 (temperature). (11.70) (0.417) Standard errors are reported in parentheses below the estimated coefficients. For this model G2 = 3.83 and X2 = 3.63 with 1 d.f. ∧ (f) For the model in part (e), compute TNM,0.5 = log( log(2)) - β 0 ∧ = 28.31 oC β1 Using the delta method, the large sample standard error for this estimate is the square root of [0.782595 136.84 22.15790] - 4.87178 - 4.87178 0.782595 = .057494 0.17356 -22.1579 5 28.31 ± (1.96)(.23978) An approximate 95% confidence interval is (g) ⇒ ( 27.84, 28.78 ) . Assuming that results for the Illinois eggs are completely independent of results for the New Mexico eggs, Var(TIll, .05 - TNM, .05 ) = Var(TIll, .05 ) + Var(TNM, .05 ) and a test statistic that has an approximate standard normal distribution under the null hypothesis is 28.31 - 27.71 Z = = 2.37 with p - value = .018. 0.057494 + .0064585 There is some indication that the temperature that produces 50% females is higher in New Mexico, although this test is based on a complimentary log-log model that does not quite fit the data. The analyses of the turtle egg data shown above were based on the assumption that each egg responds independently of any other egg. Each line of the original data file corresponds to a different box put into a incubator. The recorded temperature was the taken from a thermometer inside the incubator, but temperatures may vary across different locations in an incubator and the temperature in any particular box may have varied from the thermometer reading. Given the narrow temperature range in which the probability of females rapidly increases, a small deviation in a temperature inside a box could have a big effect on the proportion of female turtles emerging from the eggs in the box. Hence, results from the same box may exhibit positive correlation due to fluctuation in temperature within an incubator. How should you deal with this? 2. (a) Using π i to represent the probability that black medic is present, the estimated logistic regression model is ∧ ∧ log( π i / (1- π i ) = - 1.154 + 0.3652 (mounds). (0.4351) (0.1082) Standard errors are reported in parentheses below the estimated coefficients. In this case, we cannot reliably use large sample chi-square approximations for the null distributions of G2 and X2 tests of the fit of this model against the general alternative. (b) Gamma = 0.796 indicates that this model assigns higher probabilities to most of the cases where black medic is present than to cases where it is absent. In this sense, the model in part (a) seems to be a reasonable approximation. (c) The value of the Hosmer-Lemeshow goodness-of-fit test is 39.76 with 5 d.f. and p-value<.0001. In this case 7 categories are constructed: Group 1 2 3 4 5 6 Total 21 6 8 7 6 7 Presence Observed 2 3 6 6 4 7 Expected 5.03 2.21 4.16 4.63 4.62 6.37 Absence Observed 19 3 2 1 2 0 Expected 15.97 3.79 3.84 2.37 1.38 0.63 6 7 9 8 8.97 1 0.03 Although the biggest absolute differences between the observed an expected counts occur in group 1 where there are no gopher mounds in the previous tear and estimated probabilities of the presence of black medic are relatively low, the major contribution to the Hosmer-Lemeshow test statistic comes from the presence of case 15 (where black medic is absent) in category 7 (where the mode assigns very high probability to the presence of black medic). Note that (1-.03)2/(.03) = 31.36 contributes the most to the Hosmer-Lemeshow test. The Hosmer-Lemeshow test is sensitive to the existence of a single case where the actual outcome does not match the predicted probability. Several diagnostic measures (c, cbar, difdev, difchisq, and the dfbeta values for both the intercept and mound effect) indicate the case 15 might be an outlier. It is not a high leverage case. Black medic was not present, but the model gives a very high probability to the presence of black medic because there were 16 gopher mounds present in the previous year. Black medic was present in all other cases where at least 8 gopher mounds were present in the previous year. This is a valid data point, however, and it cannot be simply thrown away. If you use π i to represent the probability that black medic is absent, the estimated logistic regression model is ∧ ∧ log( π i / (1 - π i ) = 1.154 - 0.3652 (mounds). (0.4351) (0.1082) This is the same model, but the value of the Hosmer-Lemeshow test (Goodness-of-fit Statistic = 11.364 with 5 df and p-value=0.0446) is different because slightly different categories are made: Group Total 1 2 3 4 5 6 7 6 6 7 10 8 6 21 Absence Observed Expected 0 1 2 1 2 3 19 0.00 0.19 1.06 3.15 3.84 3.79 15.97 Presence Observed Expected 6 5 5 9 6 3 2 6.00 5.81 5.94 6.85 4.16 2.21 5.03 Now case 15 is in group 2. Note that (1-.19)2/(.19) = 3.45 contributes the most to the HosmerLemeshow test. 3. (a) Using π i to represent the probability that black medic is present, the estimated logistic regression model is ∧ ∧ log( π i / (1 - π i ) = - 4.0389 + 0.2790 (mounds) + 1.0712(elevation) (1.0634) (0.1076) (0.3136) Standard errors are reported in parentheses below the estimated coefficients. (b) G2 = 61.857 - 44.174 = 17.683 with 1 df and p-value = 0.000026. term provides a significant improvement in the model. Adding the elevation 7 (c) Gamma = 0.839 (d) As in problem 2, the value of the Hosmer-Lemeshow test depends the definition of π i . Using π i to represent the probability that black medic is present, the following categories are made: Group Total 1 2 3 4 5 6 7 8 9 10 11 6 6 6 6 6 6 6 6 6 6 4 Presence Observed Expected 0 0 1 1 4 5 5 6 4 6 4 0.23 0.36 0.69 1.66 2.74 4.46 5.02 5.30 5.61 5.93 4.00 Absence Observed Expected 6 6 5 5 2 1 1 0 2 0 0 5.77 5.64 5.31 4.34 3.26 1.54 0.98 0.70 0.39 0.07 0.00 Goodness-of-fit Statistic = 10.319 with 9 df and p-value=0.3253. The inclusion of the linear elevation effect reduces the estimated probability for case 15 from .93 to .73. Case 15 is now in group 9. Note that (2-.39)2/(.39) = 7.20 contributes the most to the Hosmer-Lemeshow test, but it is not enough to reject the fit of the model in this case. Including the elevation term also improves the model by allowing it to give estimated probabilities closer to zero when there were no gopher mounds in the previous year. There are no cases with high leverage that are cause for concern. Cases 1, 15, and 45 are identified as highly influential cases. Case 45 is one of two cases where black medic is presence and there were no gopher mounds in the previous year. The other case (case 21) has a much higher elevation. Removing case 45 results in a smaller intercept to allow the estimated probability of black medic to become closer to zero when mounds=0 and elevation is low. Case 1 is the only high elevation case where black medic is absent. Removing case 1 results in an increase in the slope on elevation and a decrease in the intercept. Case 15 is the only case where there are more than 8 mounds and black medic is absent. Deleting this case results in a larger estimated slope for the mounds variable. All of these are valid data points and I would not remove any of these cases from the data. Knowing that the estimated coefficients are sensitive to the existence of a few cases in this data set, however, may affect your confidence in the estimated model and influence your decision about whether or not more data should be collected. Using π i to represent the probability that black medic is absent, the following categories are made Group Total 1 2 3 4 5 6 7 8 6 6 6 6 6 6 6 7 Absence Observed Expected 0 1 1 0 1 1 3 6 0.01 0.17 0.47 0.82 1.14 2.00 3.68 5.60 Presence Observed Expected 6 5 5 6 5 5 3 1 5.99 5.83 5.53 5.18 4.86 4.00 2.32 1.40 8 9 10 6 9 6 9 5.51 8.61 0 0 0.49 0.39 Goodness-of-fit Statistic = 7.8805 with 8 df and p-value=0.4452. Case 15 is in group 2, and (1-.17)2/(.17) = 4.05 contributes the most to the Hosmer-Lemeshow statistic. 4. (a) ∧ Using π i to represent the probability that black medic is present, the estimated logistic regression model is ∧ log( π i / (1 - π i ) = - 3.6062 + 0.0937 (mounds) + 0.8775(elevation) + 0.0770(mounds)(elevation) (1.1690) (0.2903) (0.3995) (0.1167) Standard errors are reported in parentheses below the estimated coefficients. (b) G2 = 44.1740 - 43.7143 = 0.4597 with 1 df and p-value = 0.498. Adding the interaction term does not provide a significant improvement in the model. (c) The AIC and SC values are: Model Problem 2 Problem 3 Problem 4 AIC 65.86 50.17 51.71 SC 70.17 56.65 60.35 Gamma 0.796 0.839 0.841 These values indicate that the model from problem 3 is an improvement over the model from problem 2, and the model from problem 3 is essentially as good as the model from problem 4. Note that the value of gamma steadily increases as the model is made more complex. (d) Predicted probabilities tend to be nearly the same for the models from problems 3 and 4. For many cases with zero gopher mounds in the previous year, where black medic tends to be absent, the model from problem 3 tends to provide probabilities much closer to zero than the model from problem 2. Standard errors and lengths of confidence intervals for estimated probabilities can be much larger for the model from problem 4 than for the model from problem 3. This is the consequence of adding an insignificant interaction term to the model. (e) Some diagnostic results were described above. It appears that the model from problem 3 provides an adequate description, but values of parameter estimates are sensitive to the presence or absence of cases 1, 15, or 45.