Stat 557 F all

advertisement
Stat 557
Fall 2002
Assignment 5 Solutions
1. (a) Yij 's are independent and Yij
P oisson(mi ). So the log-likelihood is
0
1
e mi mYi ij A
` = log @
Yij !
i=1 j =1
5 Y
3
Y
=
3
5
X
i=1
e0 +1 xi +
5
X
i=1
Yi+ (0 + 1 xi )
5 X
3
X
i=1 j =1
log(Yij !)
(b) The mle's and corresponding standard errors are
^0 = 2:878;
s^0 = 0:108
^1 = 0:3479;
s^1 = 0:0593
In this case, exp(^0 ) = 17:78 is an estimate of the mean number of colonies of TA98 salmonella at
a quinoline concentration of 1.0 mg per plate. The estimate exp(^1 ) = 1:416 indicates that a 10-fold
increase in the concentration of quinoline results in about a 41.6 percent increase in the mean number of
TA98 salmonella colonies.
(c) The mean number of colonies when the log dose of quinoline is equal to x is m = e0 +1 x . Let =
(0 ; 1 )T . Then,
T
@m @m T
@m
=
;
= e0 +1 x ; e0 +1 x x = (m; mx)T
@
@0 @1
By the Æ -method,
var(m
^)
@m
@m T
var(^)
@
@
!
var(^0 ) cov (^0 ; ^1 )
= (m; mx)
cov (^0 ; ^1 ) var(^1 )
h
m
mx
!
i
= m2 var(^0 ) + x2 var(^1 ) + 2x cov (^0 ; ^1 )
Then
Sm^ = m
^
q
var
^ (^0 ) + x2 var
^ (^1 ) + 2x cov
^ (^0 ; ^1 )
From the estimated model, m
^ = 38:22 when x = 2:2, and
var
^ (^0 ) = 0:01166;
var
^ (^1 ) = 0:003518;
cov
^ (^0 ; ^1 ) = 0:005766 :
Then, Sm^ = 2:20, and an approximate 95 percent condence interval is
(m
^
z0:025 Sm^ ; m
^ + z0:025 Sm^ ) = (33:91; 42:54)
1
Alternatively, you could rst construct a condence interval for the natural logarithm of the mean at
x = 2:2. Compute log(m
^ ) = 2:878 + (0:3479)(2:2) = 3:6434, and
Slog(m^ ) =
Then
q
var
^ (^0 ) + x2 var
^ (^1 ) + 2x cov
^ (^0 ; ^1 ) = 0:05756 :
log(m
^ ) (1:96)Slog(m^ )
) (3:5306;
3:7562)
and an approximate 95 percent condence interval for the mean count at x = 2:2 is
(exp(3:5306); exp(3:7562))
) (34:14; 42:79) :
There is only a small dierence in these two methods in this case because the observed counts are moderately large, but this second method would generally provide a more accurate coverage probability than the
rst method. The GENMOD procedure in SAS uses the second method to compute condence intervals
for mean responses.
(d) The deviance test of the model satisfying log(mij ) = 0 + 1 X against the model of the independent
Poisson counts with a dierent mean at each level of quinoline is G2 = 0:27, with 5 2 = 3 degrees
of freedom and p-value=0.956. Assuming independent Poisson counts, the proposed Poisson regression
model can not be rejected.
You could also consider a second test. The deviance test of the model of the independent Poisson counts
with a dierent mean at each level of quinoline against the more general alternative that the fteen counts
can all have dierent means is G2 = 16:98, with 15 5 = 10 degrees of freedom and p-value=0.075. This
alternative implies that the three counts obtained at each concentration of quinoline did not come from
exact replications of the same experiment.
The sum of the G2 values for the previous two tests provides a deviance test of the model satisfying
log(mij ) = 0 + 1 X against the alternative of fteen independent Poisson counts with potentially fteen
dierent means. G2 = 0:27 + 16:98 = 17:25 with 15 2 = 13 degrees of freedom and p-value = 0.19. Here
the null hypothesis is not rejected. If you did reject the t of the model with this test you would not
know if the model was rejected because log(mij ) = 0 + 1 X did not provide an adequate description of
the trend in the means across the quinoline concentrations, or there were some uncontrolled background
factors that prevented the three experiments at each concentration of quinoline from being exact replicates
of each other.
(e) Maximum likelihoods estimates and corresponding standard errors for the negative binomial model are
^0 = 2:8782;
s^0 = 0:2048
^1 = 0:3478;
s^0 = 0:1295
^ = 0:0055;
s^ = 0:0277
The estimate of the dispersion parameter is much smaller than the standard error of the estimate. Also,
an approximate 95% condence interval of the dispersion parameter is (0:0000; 104:9256). Hence. a zero
value for the dispersion parameter is consistent with these data, and it appears that a Poisson regression
model is adequate.
2
(f) From the estimated model, log(m
^ ) = 2:8782 + (:3478)(2:2) = 3:643 and m
^ = 38:22 when x = 2:2, and
var
^ (^0 ) = 0:04196, var
^ (^1 ) = 0:01678, cov
^ (^0 ; ^1 ) = 0:02569. Similar to part (c)
Slog(m^ ) =
q
and
Sm^ = m
^2
q
var
^ (^0 ) + x2 var
^ (^1 ) + 2x cov
^ (^0 ; ^1 ) = 0:1007
var
^ (^0 ) + x2 var
^ (^1 ) + 2x cov
^ (^0 ; ^1 ) = 3:85
An approximate 95% condence interval for the mean number of colonies when the log-dose of quinoline
equals to 2.2 is
(m
^ z0:025 Sm^ ; m
^ + z0:025 Sm^ ) = (31:4; 46:6)
An approximate 95% condence interval with better coverage probability is constructed by evaluating
log(m
^ ) (1:96)Slog(m^ )
) (3:4456;
3:840)
and transforming back to the original scale
) (31:36; 46:54) :
(exp(3:4456); exp(3:840))
Condence intervals based on the negative binomial regression model are wider than those based on a
corresponding Poisson regression model in part, because the negative binomial regression model allows for
more variation in the observed counts.
(g) Since the proposed model seems to be adequate, there is no need to search for a better model.
2. (a) The test results are in the following table. The p-value appears on the border line for X 2 test, which
might be more reliable here than G2 test due to some small counts in the table. Although that p-value is
slightly higher than .05, we might want to seek for a more appropriate model.
X
G2
2
stat
16.4680
18.8155
d.f.
9
9
p-value
.0577
.0268
(b) (i) Vidmar's statement implies that the probability of a not guilty response should be higher for situation
A than for any other situation, and the probability of a not guilty verdict should be higher for situations
B and D than for situations C, E, F, G. We want to determine if the data support this statement.
A simple thing to do is to make two 2x2 tables as shown below and use a one sided Fisher exact test
of the null hypothesis that the probabilities of not guilty are the same for the two columns of each
table.
Table 1
Table 2
A B&D
B & D C, E, F & G
guilty
11
44
guilty
44
91
not guilty 13
4
not guilty
4
5
For Table 1, the p-value appears to be less than .0001 indicating that the null hypothesis is
rejected, i.e., the probability of a non-guilty response should be higher for situation A than for
situations B and D.
For Table 2, the p-value turns out to be 0.241 which does not agree with Vidmar's statement.
3
(ii) Quasi-independence could be true without implying Vidmar's statement. Also, Vidmar's statement
could be true without implying quasi independence. Here is a table of probabilities that satisfy
quasi-independence but the rst three columns do not conform to Vidmar's hypothesis.
A
B
C
D
E
F G
First degree 3/7 {
{ 1/3 3/8 { .3
Second degree { 1/3 { 2/9 { 2/7 .2
Manslaughter
{
{
.2
{ 1/8 1/7 .1
Not guilty
4/7 2/3 4/5 4/9 1/2 4/7 .4
Here is a table of probabilities that conform to Vidmar's hypothesis but do not satisfy quasiindependence.
A B C D E F G
First degree .1 { { .1 .1 { .1
Second degree { .4 { .3 { .2 .3
Manslaughter { { .9 { .8 .7 .5
Not guilty
.9 .6 .1 .6 .1 .1 .1
3. (a) The t of the symmetry model is tested as follows. Both tests indicate the model does not t well.
stat
X2
G2
value
80.03
79.71
d.f.
10
10
p-value
.0000
.0000
Checking the residuals, it turns out that most residuals in the upper triangle are negative and most in
the lower triangle are positive, indicating that when fathers are not of the same status the woman's father
more frequently has higher status than the husband's father. Thus, the symmetry model does not seem
to t well.
(b) This hypothesis could be tested with a Wald test, or by realizing that one has just two categories (above
or below the main diagonal) that correspond to a binomial distribution with .5 for the probability of
success when the null hypothesis is true. Note that both tests should yield the same result. Here the
latter approach is shown.
Let up be the probability that women marry up into a higher class. Test the hypothesis
H0 : up = 0:5
by the test statistic
p
z=q1
0:50
(0:5)(1
N
0:5)
HA : up 6= 0:5
vs.
=
4675=9627
q
0:50
(0:5)(1 0:5)
9627
= 2:823;
with p-value = 0:0048
! Reject H0 . The probability of women marrying up is smaller than the
probability of women marrying down. This result supports the conclusion in part(a).
(c) The test results are shown below.
stat
X2
G2
value
68.52
68.67
d.f.
4
4
p-value
.0000
.0000
Reject the hypothesis of marginal homogeneity | i.e., there are lower proportions of women than men in
categories 2 and 5, and higher proportions of women than men in categories 1, 3, 4.
4
(d) The test for the t of the quasi independence model is shown as follows.
X2
G2
stat
344.12
377.43
d.f.
11
11
p-value
.0000
.0000
This model also does not t well. Studentized residuals are shown below. Residuals exhibit some patterns
|- large positive residuals for the upper left corner and negative residuals for the upper right and lower
left corners.
I
I
II
0.57
III -0.04
IV -0.21
V
-0.20
II
III
IV
V
0.76 -0.10 -0.24 -0.19
- -0.03 -0.14 -0.09
-0.08
0.05 0.01
-0.11 0.10
0.12
-0.18 -0.07 0.22
-
There is a greater tendency to either stay within the two upper classes or stay within the three lowest classes, and less marriage between the upper two classes and lower three classes than the quasiindependence model would suggest.
(e) One plausible model is tting dierent quasi independence models to the upper and lower triangles, which
will give the test results,
X
G2
2
stat
6.454
6.486
d.f.
6
6
p-value
0.3743
0.3710
No obvious pattern is observed in residuals of this model. This suggests that given that a woman marries a
man from a higher class, the increase in status is independent of the woman's father's occupational status.
Similary, given that a woman marries a man from a lower class, the decrease in status is independent of
the woman's father's occupational status.
Another popular model among students is splitting the table into three parts based on the residuals in
(d) |- Group 1 for the upper left 2x2 table, Group 2 for the lower right corner 3x3 and Group 3 for the
transitions between groups 1 and 2. They t a symmetry model for Group 1, a quasi-independence model
for Group 3, and the saturated model for Group 2. This would give a slightly better t (p-value = .6810
with 5 d.f.).
Most of the other students gave the quasi-symmetry model as an appropriate model, validating their
conclusion by presenting the G2 (or X 2 ) value (11.5) with 12 d.f, which yields an inated p-value close to
0.50. The degrees of freedom should be 6 instead of 12. These students may have t the quasi-symmetry
model by forming a 3-way table where the rst layer is the original table and the second layer is the
transpose of the original table. In that case, each cell in the original table is counted twice and they
should have divided the degrees of freedom reported by the program they were running by 2 to get 6.
4. (a) (i) Using the baseline restriction, the estimate of the intercept is ^ = 4:9264 and this is also an estimate
for log (m222 ); i.e.,
log(m
^ 222 ) = 4:926;
m
^ 222 = 137:89
5
Since the standard error for is estimated as 0.1229, the standard error for m222 may be obtained
through the delta method,
s:e(m
^ 222 ) = m
^ 222 s:e(^ ) = 16:9464:
(ii) The following test results indicate the model does not t well.
stat d.f. p-value
2
G
58.35 3
.0000
2
X
51.06 3
.0000
(b) (i) Using the baseline restriction, ^ = 4:9198 and this is also an estimate for log (m222 ). Therefore,
log(m
^ 222 ) = 4:926;
m
^ 222 = 136:98
Since the standard error for is estimated as 0.6248, the standard error for m222 may be obtained
through the delta method,
s:e(m
^ 222 ) = m
^ 222 s:e(^ ) = 85:5913:
Making the model more complex, greatly inates the standard error of the estimate of m222 .
(ii) We cannot test the t of this model due to lack of degrees of freedom. The 7 parameters in this model
are estimated from seven counts; thus, this is a saturated model.
BM
DM
2
(c) (i) Add each of BD
ij , ik and jk to the independence model and check the t of each model by G
or X 2 tests. It turns out that the most parsimonious model that provides an adequate t to the data
is
D
M
DM
log(mijk ) = + B
i + j + k + jk
This model allows for some association between reports from death certicates and reports from medical rehabilitation programs. In particular, the odds that a case is reported by a medical rehabilitation
program is much higher when the case is not reported on a death certicate.
stat d.f. p-value
2
G
3.860 2
0.1452
2
X
3.869 2
0.1445
The S-plus step procedure starting with the independence model in (a) ends up with the model
D
M
DM
BD
log(mijk ) = + B
i + j + k + jk + ij
as the best model in terms of AIC.
(ii) In the rst model in (i),
log(m
^ 222 ) = 4:6533;
with standard error
m
^ 222 = 104:93;
m
^ 222 s:e(^ ) = 13:5358:
(d) Use the rst model in part (c). An estimated number of cases of spina bida in New York between 1969
and 1974 is Y^+ = 626 + 104:93 = 730:93. Then, the rate of spina bida cases per 1000 live births may be
estimated as
p^ = 1000 Y^+ =N = 1000 (730:93=863143) = 0:8468:
Using the result in (c), the standard error for this estimate is given, , as
s:e:(^
p) = 1000 s:e(Y+ )=N = 1000 s:e(m
^ 222 )=N = 0:01568;
6
and thus an approximate condence interval for p is
p^ 1:96s:e(^
p) = (0:8161; 0:8776):
Similarly, for the second model in (c), the estimated rate is 0.8786, and an approximate condence interval
for the rate is computed as ( 0.8268, 0.9303).
p
Some students computed the standard error of p^ using that of binomial proportion, p^(1 p^)=N , which
gives a much smaller CI than the above intervals. The binomial distribution is not appropriate in this
situation, because the count was obtained as a prediction from a model. You must account for the variation
in the parameter estimates involved in the prediction.
7
Download