Stat 557 Fall 2002 Assignment 5 Solutions 1. (a) Yij 's are independent and Yij P oisson(mi ). So the log-likelihood is 0 1 e mi mYi ij A ` = log @ Yij ! i=1 j =1 5 Y 3 Y = 3 5 X i=1 e0 +1 xi + 5 X i=1 Yi+ (0 + 1 xi ) 5 X 3 X i=1 j =1 log(Yij !) (b) The mle's and corresponding standard errors are ^0 = 2:878; s^0 = 0:108 ^1 = 0:3479; s^1 = 0:0593 In this case, exp(^0 ) = 17:78 is an estimate of the mean number of colonies of TA98 salmonella at a quinoline concentration of 1.0 mg per plate. The estimate exp(^1 ) = 1:416 indicates that a 10-fold increase in the concentration of quinoline results in about a 41.6 percent increase in the mean number of TA98 salmonella colonies. (c) The mean number of colonies when the log dose of quinoline is equal to x is m = e0 +1 x . Let = (0 ; 1 )T . Then, T @m @m T @m = ; = e0 +1 x ; e0 +1 x x = (m; mx)T @ @0 @1 By the Æ -method, var(m ^) @m @m T var(^) @ @ ! var(^0 ) cov (^0 ; ^1 ) = (m; mx) cov (^0 ; ^1 ) var(^1 ) h m mx ! i = m2 var(^0 ) + x2 var(^1 ) + 2x cov (^0 ; ^1 ) Then Sm^ = m ^ q var ^ (^0 ) + x2 var ^ (^1 ) + 2x cov ^ (^0 ; ^1 ) From the estimated model, m ^ = 38:22 when x = 2:2, and var ^ (^0 ) = 0:01166; var ^ (^1 ) = 0:003518; cov ^ (^0 ; ^1 ) = 0:005766 : Then, Sm^ = 2:20, and an approximate 95 percent condence interval is (m ^ z0:025 Sm^ ; m ^ + z0:025 Sm^ ) = (33:91; 42:54) 1 Alternatively, you could rst construct a condence interval for the natural logarithm of the mean at x = 2:2. Compute log(m ^ ) = 2:878 + (0:3479)(2:2) = 3:6434, and Slog(m^ ) = Then q var ^ (^0 ) + x2 var ^ (^1 ) + 2x cov ^ (^0 ; ^1 ) = 0:05756 : log(m ^ ) (1:96)Slog(m^ ) ) (3:5306; 3:7562) and an approximate 95 percent condence interval for the mean count at x = 2:2 is (exp(3:5306); exp(3:7562)) ) (34:14; 42:79) : There is only a small dierence in these two methods in this case because the observed counts are moderately large, but this second method would generally provide a more accurate coverage probability than the rst method. The GENMOD procedure in SAS uses the second method to compute condence intervals for mean responses. (d) The deviance test of the model satisfying log(mij ) = 0 + 1 X against the model of the independent Poisson counts with a dierent mean at each level of quinoline is G2 = 0:27, with 5 2 = 3 degrees of freedom and p-value=0.956. Assuming independent Poisson counts, the proposed Poisson regression model can not be rejected. You could also consider a second test. The deviance test of the model of the independent Poisson counts with a dierent mean at each level of quinoline against the more general alternative that the fteen counts can all have dierent means is G2 = 16:98, with 15 5 = 10 degrees of freedom and p-value=0.075. This alternative implies that the three counts obtained at each concentration of quinoline did not come from exact replications of the same experiment. The sum of the G2 values for the previous two tests provides a deviance test of the model satisfying log(mij ) = 0 + 1 X against the alternative of fteen independent Poisson counts with potentially fteen dierent means. G2 = 0:27 + 16:98 = 17:25 with 15 2 = 13 degrees of freedom and p-value = 0.19. Here the null hypothesis is not rejected. If you did reject the t of the model with this test you would not know if the model was rejected because log(mij ) = 0 + 1 X did not provide an adequate description of the trend in the means across the quinoline concentrations, or there were some uncontrolled background factors that prevented the three experiments at each concentration of quinoline from being exact replicates of each other. (e) Maximum likelihoods estimates and corresponding standard errors for the negative binomial model are ^0 = 2:8782; s^0 = 0:2048 ^1 = 0:3478; s^0 = 0:1295 ^ = 0:0055; s^ = 0:0277 The estimate of the dispersion parameter is much smaller than the standard error of the estimate. Also, an approximate 95% condence interval of the dispersion parameter is (0:0000; 104:9256). Hence. a zero value for the dispersion parameter is consistent with these data, and it appears that a Poisson regression model is adequate. 2 (f) From the estimated model, log(m ^ ) = 2:8782 + (:3478)(2:2) = 3:643 and m ^ = 38:22 when x = 2:2, and var ^ (^0 ) = 0:04196, var ^ (^1 ) = 0:01678, cov ^ (^0 ; ^1 ) = 0:02569. Similar to part (c) Slog(m^ ) = q and Sm^ = m ^2 q var ^ (^0 ) + x2 var ^ (^1 ) + 2x cov ^ (^0 ; ^1 ) = 0:1007 var ^ (^0 ) + x2 var ^ (^1 ) + 2x cov ^ (^0 ; ^1 ) = 3:85 An approximate 95% condence interval for the mean number of colonies when the log-dose of quinoline equals to 2.2 is (m ^ z0:025 Sm^ ; m ^ + z0:025 Sm^ ) = (31:4; 46:6) An approximate 95% condence interval with better coverage probability is constructed by evaluating log(m ^ ) (1:96)Slog(m^ ) ) (3:4456; 3:840) and transforming back to the original scale ) (31:36; 46:54) : (exp(3:4456); exp(3:840)) Condence intervals based on the negative binomial regression model are wider than those based on a corresponding Poisson regression model in part, because the negative binomial regression model allows for more variation in the observed counts. (g) Since the proposed model seems to be adequate, there is no need to search for a better model. 2. (a) The test results are in the following table. The p-value appears on the border line for X 2 test, which might be more reliable here than G2 test due to some small counts in the table. Although that p-value is slightly higher than .05, we might want to seek for a more appropriate model. X G2 2 stat 16.4680 18.8155 d.f. 9 9 p-value .0577 .0268 (b) (i) Vidmar's statement implies that the probability of a not guilty response should be higher for situation A than for any other situation, and the probability of a not guilty verdict should be higher for situations B and D than for situations C, E, F, G. We want to determine if the data support this statement. A simple thing to do is to make two 2x2 tables as shown below and use a one sided Fisher exact test of the null hypothesis that the probabilities of not guilty are the same for the two columns of each table. Table 1 Table 2 A B&D B & D C, E, F & G guilty 11 44 guilty 44 91 not guilty 13 4 not guilty 4 5 For Table 1, the p-value appears to be less than .0001 indicating that the null hypothesis is rejected, i.e., the probability of a non-guilty response should be higher for situation A than for situations B and D. For Table 2, the p-value turns out to be 0.241 which does not agree with Vidmar's statement. 3 (ii) Quasi-independence could be true without implying Vidmar's statement. Also, Vidmar's statement could be true without implying quasi independence. Here is a table of probabilities that satisfy quasi-independence but the rst three columns do not conform to Vidmar's hypothesis. A B C D E F G First degree 3/7 { { 1/3 3/8 { .3 Second degree { 1/3 { 2/9 { 2/7 .2 Manslaughter { { .2 { 1/8 1/7 .1 Not guilty 4/7 2/3 4/5 4/9 1/2 4/7 .4 Here is a table of probabilities that conform to Vidmar's hypothesis but do not satisfy quasiindependence. A B C D E F G First degree .1 { { .1 .1 { .1 Second degree { .4 { .3 { .2 .3 Manslaughter { { .9 { .8 .7 .5 Not guilty .9 .6 .1 .6 .1 .1 .1 3. (a) The t of the symmetry model is tested as follows. Both tests indicate the model does not t well. stat X2 G2 value 80.03 79.71 d.f. 10 10 p-value .0000 .0000 Checking the residuals, it turns out that most residuals in the upper triangle are negative and most in the lower triangle are positive, indicating that when fathers are not of the same status the woman's father more frequently has higher status than the husband's father. Thus, the symmetry model does not seem to t well. (b) This hypothesis could be tested with a Wald test, or by realizing that one has just two categories (above or below the main diagonal) that correspond to a binomial distribution with .5 for the probability of success when the null hypothesis is true. Note that both tests should yield the same result. Here the latter approach is shown. Let up be the probability that women marry up into a higher class. Test the hypothesis H0 : up = 0:5 by the test statistic p z=q1 0:50 (0:5)(1 N 0:5) HA : up 6= 0:5 vs. = 4675=9627 q 0:50 (0:5)(1 0:5) 9627 = 2:823; with p-value = 0:0048 ! Reject H0 . The probability of women marrying up is smaller than the probability of women marrying down. This result supports the conclusion in part(a). (c) The test results are shown below. stat X2 G2 value 68.52 68.67 d.f. 4 4 p-value .0000 .0000 Reject the hypothesis of marginal homogeneity | i.e., there are lower proportions of women than men in categories 2 and 5, and higher proportions of women than men in categories 1, 3, 4. 4 (d) The test for the t of the quasi independence model is shown as follows. X2 G2 stat 344.12 377.43 d.f. 11 11 p-value .0000 .0000 This model also does not t well. Studentized residuals are shown below. Residuals exhibit some patterns |- large positive residuals for the upper left corner and negative residuals for the upper right and lower left corners. I I II 0.57 III -0.04 IV -0.21 V -0.20 II III IV V 0.76 -0.10 -0.24 -0.19 - -0.03 -0.14 -0.09 -0.08 0.05 0.01 -0.11 0.10 0.12 -0.18 -0.07 0.22 - There is a greater tendency to either stay within the two upper classes or stay within the three lowest classes, and less marriage between the upper two classes and lower three classes than the quasiindependence model would suggest. (e) One plausible model is tting dierent quasi independence models to the upper and lower triangles, which will give the test results, X G2 2 stat 6.454 6.486 d.f. 6 6 p-value 0.3743 0.3710 No obvious pattern is observed in residuals of this model. This suggests that given that a woman marries a man from a higher class, the increase in status is independent of the woman's father's occupational status. Similary, given that a woman marries a man from a lower class, the decrease in status is independent of the woman's father's occupational status. Another popular model among students is splitting the table into three parts based on the residuals in (d) |- Group 1 for the upper left 2x2 table, Group 2 for the lower right corner 3x3 and Group 3 for the transitions between groups 1 and 2. They t a symmetry model for Group 1, a quasi-independence model for Group 3, and the saturated model for Group 2. This would give a slightly better t (p-value = .6810 with 5 d.f.). Most of the other students gave the quasi-symmetry model as an appropriate model, validating their conclusion by presenting the G2 (or X 2 ) value (11.5) with 12 d.f, which yields an inated p-value close to 0.50. The degrees of freedom should be 6 instead of 12. These students may have t the quasi-symmetry model by forming a 3-way table where the rst layer is the original table and the second layer is the transpose of the original table. In that case, each cell in the original table is counted twice and they should have divided the degrees of freedom reported by the program they were running by 2 to get 6. 4. (a) (i) Using the baseline restriction, the estimate of the intercept is ^ = 4:9264 and this is also an estimate for log (m222 ); i.e., log(m ^ 222 ) = 4:926; m ^ 222 = 137:89 5 Since the standard error for is estimated as 0.1229, the standard error for m222 may be obtained through the delta method, s:e(m ^ 222 ) = m ^ 222 s:e(^ ) = 16:9464: (ii) The following test results indicate the model does not t well. stat d.f. p-value 2 G 58.35 3 .0000 2 X 51.06 3 .0000 (b) (i) Using the baseline restriction, ^ = 4:9198 and this is also an estimate for log (m222 ). Therefore, log(m ^ 222 ) = 4:926; m ^ 222 = 136:98 Since the standard error for is estimated as 0.6248, the standard error for m222 may be obtained through the delta method, s:e(m ^ 222 ) = m ^ 222 s:e(^ ) = 85:5913: Making the model more complex, greatly inates the standard error of the estimate of m222 . (ii) We cannot test the t of this model due to lack of degrees of freedom. The 7 parameters in this model are estimated from seven counts; thus, this is a saturated model. BM DM 2 (c) (i) Add each of BD ij , ik and jk to the independence model and check the t of each model by G or X 2 tests. It turns out that the most parsimonious model that provides an adequate t to the data is D M DM log(mijk ) = + B i + j + k + jk This model allows for some association between reports from death certicates and reports from medical rehabilitation programs. In particular, the odds that a case is reported by a medical rehabilitation program is much higher when the case is not reported on a death certicate. stat d.f. p-value 2 G 3.860 2 0.1452 2 X 3.869 2 0.1445 The S-plus step procedure starting with the independence model in (a) ends up with the model D M DM BD log(mijk ) = + B i + j + k + jk + ij as the best model in terms of AIC. (ii) In the rst model in (i), log(m ^ 222 ) = 4:6533; with standard error m ^ 222 = 104:93; m ^ 222 s:e(^ ) = 13:5358: (d) Use the rst model in part (c). An estimated number of cases of spina bida in New York between 1969 and 1974 is Y^+ = 626 + 104:93 = 730:93. Then, the rate of spina bida cases per 1000 live births may be estimated as p^ = 1000 Y^+ =N = 1000 (730:93=863143) = 0:8468: Using the result in (c), the standard error for this estimate is given, , as s:e:(^ p) = 1000 s:e(Y+ )=N = 1000 s:e(m ^ 222 )=N = 0:01568; 6 and thus an approximate condence interval for p is p^ 1:96s:e(^ p) = (0:8161; 0:8776): Similarly, for the second model in (c), the estimated rate is 0.8786, and an approximate condence interval for the rate is computed as ( 0.8268, 0.9303). p Some students computed the standard error of p^ using that of binomial proportion, p^(1 p^)=N , which gives a much smaller CI than the above intervals. The binomial distribution is not appropriate in this situation, because the count was obtained as a prediction from a model. You must account for the variation in the parameter estimates involved in the prediction. 7