Assignment 3 Solutions Stat 557 Fall 2000 Problem 1

advertisement
Stat 557
Fall 2000
Assignment 3 Solutions
Problem 1
Gamma
95% condence interval
estimate standard error lower limit upper limit
Females 0.236
0.0385
0.160
0.311
Males -0.125
0.0327
-0.189
-0.061
Problem 2
@ =
(a) The rst derivative of = log ; is @
+ ; = ; 2 . By the -method,
!
@
^
var @ var(^ ) = (1 ;1 ) var(^ ):
For females,
^f = 0:240; var ^f = 0:00166:
For males,
^m = ;0:126; var ^m = 0:00110:
1
2
1+
1
1
2
1
1+
1
1
1
1
2
2 2
(b) For females, the approximate 95% condence interval for is
q
q
^ ; z : var(^); ^ + z : var(^) = (0160; 0:320):
0 975
0 975
Since = 1 ; 2 , an approximate 95% condence interval of is (0:159; 0:310)
Similarly for males, an approximate 95% condence interval for is (;0:189; ;0:061).
2
exp
+1
(c) Since = log ; is a strictly increasing function of , the test problem
H : f = m versus Ha : f 6= m
1
2
1+
1
0
is equivalent to the test problem
H : f = m versus Ha : f 6= m :
0
Consider ^f ; ^m, the estimator of f ; m. Under the assumption that the counts for
females and males are independent,
var(^f ; ^m) = var(^f ) + var(^m)
1
When H holds, the distribution of the test statistic Z = pvar f ;mvar m is well apf
proximated by a standard normal distribution when the sample sizes are large enough.
This generally provides a more accurate p-value than approximating the null distribution of pvar ff;mvar m with a standard normal distribution. For these data, Z = 6:96
and the standard normal approximation yields p ; value = 3:32 10; : The association between physical and psychological demands of work is not the same for males and
females. The association is positive for females, but weaker and negative for males.
^
0
^
( ^ )+
^
(^ )
^
(^ )+
(^
)
12
Problem 3
The numerical results for parts (a) (b) (c) and (d) are presented in the following table.
table (a)
table (b)
table (c)
table (d)
P
0.775
0.497
0.358
0.179
^
0.549
0.329
0.230
0.105
^s
0.549
0.626
0.642
0.663
^
0.844
0.684
0.630
0.564
^RjC
0.547
0.326
0.287
0.228
(e) Clearly gamma and Spearman's rho are much less aected by the choice of the row and
column categories than P, Kappa, or Lambda. Neither P, Kappa, nor Lambda should be
used to compare levels of agreement or predictive ability across tables from dierent studies
if the choice of categories diers across studies. If you had the actual values of the sucrose
intake levels in 1980 and 1981 for each of the 173 subjects, you could compute values of
gamma and Spearman's rho for the continuous data. Plot the 1981 sucrose intake values
against thecorresponding 1980 values. The values of gamma and Spearman's rho for the
continous data would be close to 1.0 if the points on the plot were tightly clustered about a
monotone increasing curve (not necessarily a straight line). As the number of row/columnn
categories in the contingency table are increased, the values of gamma and Spearman's rho
converge too the values for the continuous data. On the other had, the values of P and
Kappa for the continuos data would be large if the points in the plot were mostly on the
45-degree line. If the points are merely tightly clustered about a monotone increasing curve
or only loosely clustered about the 45-degree line, the number of cases that appear on the
main diagonal of a contingency table will decrease (and P and Kappa will decrease) as the
number of row/column categories increases.
2
Problem 4
(a) The estimated odds ratio for the marginal table of counts is ^ = 0:493. An approximate
95% condence interval is (0.366, 0.724).
(b) .
Age Odds Ratio 95% condence interval
18
0.664
(0.221, 1.998)
19
0.207
(0.059, 0.728)
20
0.949
(0.433, 2.079)
21
0.390
(0.169, 0.904)
22
0.811
(0.260, 2.526)
23
0.365
(0.070, 1.889)
0.867
(0.216, 3.482)
24
(c) ^MH = 0:549, and a approximate 95% condence interval for the common conditional
odds ratio is (0.371, 0.814).
(d) The Breslow-Day test statistic has a value of 6.23 with 6 d.f. and p-value=0.398. The
value of the T statistic is -0.220 with p-value=0.587.
4
(e) The counts at each age level are large enough for the chi-square approximation to the
null distribution of the Breslow-Day test to provide an accurate p-value. In this case,
the data are consistent with the null hypothesis of homogeneous conditional odds ratios.
This does not show that the conditional odds ratios are exactly all the same because
conditional odds ratios are not accurately estimated for some ages, as evidenced by
some rather wide condence intervals.
(f) X = 8:475, with d.f.=1, and p-value=0.0027. The null hypothesis is rejected. Results
from parts (b), (c), and (d) suggest that the odds of an IM case in the group that had
a tonsillectomy are between 37 and 81 percent smaller than the odds of an IM case in
the group of children that had no history of tonsillectomy, and this is consistent across
age groups. (Note that X = 9:023 when the continuity coreection is not used.)
2
2
Problem 5
(a) Under Mendelian theory, the expected counts are
3
Tall/Cut Tall/Potato Dwarf/Cut Dwarf/Potato
District 1 906.19
302.06
302.06
100.69
152.81
152.81
50.94
District 2 458.44
District 3 684.00
228.00
228.00
76.00
P P
when the populations are in equilibrium. Here, G = 2 i j Yij log(Yij =mij ) =
3:144 with 9 ; 0 = 9 degrees of freedom and p ; value = 0:958. The null hypothesis
completely species the expected counts, there are no parameters to estimate.
2
3
=1
4
=1
(b) The maximum likelihhod estimates of pa and pb within the three districts are
p^a
p^b
District 1 0.7567 0.7536
District 2 0.7570 0.7583
District 3 0.7500 0.7623
The corresponding expected counts are
Tall/Cut Tall/Potato Dwarf/Cut Dwarf/Potato
District 1 918.60
295.40
300.40
96.60
District 2 467.86
150.14
149.141
47.86
District 3 695.25
231.75
216.75
72.25
The value of the deviance statistic is G = 1:133, with 9 ; 6 = 3 degress of freedom
and p-value = 0:769.
2
(c) The maximum likelihood estimates of the common values of pa and pb are p^a = 0:7545
and p^b = 0:7576. The value of the deviance statistic is G = 1:629, with 9 ; 2 = 7
degrees of freedom and p-value= 0:978.
2
(d) The value of the deviance statistic is G = 1:483, with (3 ; 1)(4 ; 1) = 6 degrees of
freedom and p-value= 0:961: The observed counts are consistent with the independence
model.
2
(e) .
Deviance df p-value
model (a) vs model (c) 1.515 2 0.469
model (c) vs model (b) 0.495 4 0.974
model (a) vs model (b) 2.011 6 0.919
The results from all three districts are consistent with the equilibrium model based on
Mendelian theory.
4
(f) .
Deviance
model (a) vs model (c) 1.515
model (c) vs model (d) 0.145
model (a) vs model (d) 1.660
df p-value
2 0.469
1 0.703
3 0.646
(g) No. The independence model requires the same distribution of counts across the four
phenotypes in each district. The model in part (b) is not a psecial case of the independence model because it allows the distribution of counts across the four phenotypes to
dier across districts. Conversely, not every version of the independence model can be
obtained from the model in part (b), only the special cases represented by the model
in part (c).
Problem 6
(a) .
Birth Order Odds Ratio 95% condence interval
2
1.317
(0.572, 3.030)
3{4
1.189
(0.545, 2.596)
2.016
(0.844, 4.816)
5+
(b) The value of the Breslow-Day statistic is 0.85 with 2 degrees of freedom and pvalue=0.653. The data are consistent with the hypothesis of homogeneous odds ratios
within age groups.
(c) The estimate of the odds ratio for the marginal table of counts is 1.347. An approximate
95% condence interval is (0.851, 2.132). This is consistent with the information in
the tables for the three age groups, and it is not an example of Simpson's paradox.
Problem 7
(a) From the software posted as negbin.ssc or negbin.sas, which was applied to the cavity
data, we have ^ = :018888224 and k^ = :5883654 as the maximum likelihood estimates
for the parameters in the negative binomial model. Then the m.l.e. of Pr(Y = 0) = k
is ^ k = 0:375.
^
5
(b) Dene g(; k) = k . The rst partial derivatives of this function are G =
(k k; ; k log()). A consistent estimator of G is obtained by evaluating G at the
m.l.e's of the parameters. Then, G^ = (k^ ^ k; ; ^ k log(^)) = (1:168553; ;0:62574). The
software from part (a) provided the estimated covariance of (^; k^), the inverse of the
estimated Fisher Information matrix,
1
^
V^
2
6
= 64
1
^
3
:0009275 :0024042 7
:0024042 :0092353
7:
5
Then, by the delta method, an estimate of the large sample variance of ^ k is
^
G^ 0V^ G^ = :001363
and the standard error of ^ k is .03692.
^
(c) An approximate 95% condence interval for k is
:375 (1:96)(:03692)
)
(:303; :447):
(d) You could do a simulation study of the coverage probability of the procedure for constructing condence intervals used in part (c). Select values of and k. You could
try several sets of values, using some that are close or equal to the m.l.e.'s of the parameters evaluated in the previous parts of this problem. For each set of parameter
values, simulate a large number of samples (say 10000 samples) from the corresponding
negative binomial distribution with n independent observations in each sample. Then,
construct a condence interval from the data for each sample, using the method from
part (c). Record the proportion of the 10,000 condence intervals that contain the true
value for k . Repeat this for several choices of n, including the number of children in
the original study.
The upper and lower condence limits are random. To simulation the coverage probaility of a method of constructing condence intervals, you must simulate values of the
random upper and lower limits and monitor how often those random limits enclose
the true value of the quantity you are trying to estimate. Some students proposed
simulating estimates of the probability of observing a child with no cavities and then
monitoring how often those simulated estimates fell between the particular condence
limits computed in part (c). This incorrectly considers the upper and lower limits of a
condence interval as xed (non-random) quantities.
6
#-----------------------------------------------------#
# Splus code for assignment 3;
#
#-----------------------------------------------------#
#-----------------------------------------------------#
#The following function calculates unweighted kappa
#
#-----------------------------------------------------#
unweight.kappa<-function(x){
n_sum(x)
xr_apply(x, 1, sum)
xc_apply(x, 2, sum)
e_outer(xr, xc)/n
k1_sum(diag(x))/n
k2_sum(diag(e))/n
kappa_(k1-k2)/(1-k2)
kappa
}
#----------------------------------------------------#
# The following function calculates the Pearson
#
# Chi-Square test and the deviance test of the
#
# independence of the row factor and the
#
# column factor for a 2-way contingency table.
#
#----------------------------------------------------#
X2G2test<-function(X){
sr<-apply(X, 1, sum)
sc<-apply(X, 2, sum)
n<-sum(sc)
m<-sr%*%t(sc)/n
X.sqr<-sum((X-m)^2/m)
df<-(length(sr)-1)*(length(sc)-1)
p.v1<-1-pchisq(X.sqr, df)
G.sqr<-2*sum(X*log(X/m))
p.v2<-1-pchisq(G.sqr, df)
7
list(Person.test=cbind(test.statistic=X.sqr, df=df, p.value=p.v1),
Deviance.test=cbind(test.statistic=G.sqr, df=df, p.value=p.v2))
}
#---------------------------------------------------#
# For a (2 by 2) contingency table, the following
#
# function estimates the odds ratio, its standard
#
# error and a 95% confidence interval.
#
#---------------------------------------------------#
odds<-function(X, correction=F)
{
if(correction == T)
X <- X + 0.5
alpha <- (X[1, 1] * X[2, 2])/(X[1, 2] * X[2, 1])
temp <- sqrt(sum(1/X))
ase <- alpha * temp
la<-log(alpha)
z975<-qnorm(0.975)
a <- la - z975 * temp
b <- la + z975 * temp
list(odds.ratio = alpha, ase = ase,
CI95 = cbind(lower = exp(a), upper = exp(b)))
}
#-----------#
# problem 1 #
#-----------#
female<-matrix(c(100, 109, 202,
33,
89, 179,
100, 179, 542), ncol=3, byrow=T)
male<-matrix(c(113, 163, 370,
45, 106, 280,
8
229, 343, 568), ncol=3, byrow=T)
#Consider the table for females;
female.g<-association(female)$Gamma
gamma.female<-female.g[1]
std.female<-female.g[2]
z975<-qnorm(0.975)
cbind(gamma=gamma.female, std=std.female,
lower=gamma.female-z975*std.female,
upper=gamma.female+z975*std.female)
#Consider the table for males;
male.g<-association(male)$Gamma
gamma.male<-male.g[1]
std.male<-male.g[2]
cbind(gamma=gamma.male, std=std.male,
lower=gamma.male-z975*std.male,
upper=gamma.male+z975*std.male)
#-----------#
# problem 2 #
#-----------#
# *** part (b) ***
#For females;
std2.female<-1/(1-gamma.female^2)*std.female
temp.f_0.5*log((1+gamma.female)/(1-gamma.female))
l0<-temp.f-z975*std2.female
u0<-temp.f+z975*std2.female
l<-1-2/(exp(2*l0)+1)
u<-1-2/(exp(2*u0)+1)
cbind(lower=l, upper=u)
9
#For males;
std2.male<-1/(1-gamma.male^2)*std.male
temp.m_0.5*log((1+gamma.male)/(1-gamma.male))
l0<-temp.m-z975*std2.male
u0<-temp.m+z975*std2.male
l<-1-2/(exp(2*l0)+1)
u<-1-2/(exp(2*u0)+1)
cbind(lower=l, upper=u)
# *** part (c) ***
Z<-(temp.f-temp.m)/sqrt(std2.female^2+std2.male^2)
p.value<-2*(1-pnorm(abs(Z)))
#--------#
# prob 3 #
#--------#
a<-matrix(c(67, 20,
19, 67), ncol=2, byrow=T)
b<-matrix(c(24, 12,
5,
2,
10, 21, 12,
1,
8,
7, 14, 14,
1,
3, 12, 27), ncol=4, byrow=T)
c<-matrix(c(17,
5,
4,
1,
0,
1,
5,
9,
8,
5,
2,
1,
3,
6, 10,
7,
2,
1,
1,
4,
6,
4, 11,
3,
2,
3,
0,
9,
7,
8,
1,
1,
1,
2,
8, 15), ncol=6,byrow=T)
d<- matrix(c(7, 4, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0,
3, 3, 4, 0, 3, 0, 0, 0, 0, 0, 1, 0,
1, 0, 2, 2, 2, 3, 1, 2, 1, 0, 1, 0,
1, 3, 2, 3, 3, 0, 1, 1, 1, 0, 0, 0,
0, 2, 1, 2, 1, 5, 0, 1, 2, 0, 0, 0,
10
0, 1, 0, 3, 1, 3, 3, 3, 0, 0, 1, 0,
1, 0, 1, 0, 1, 2, 3, 0, 3, 1, 1, 1,
0, 0, 2, 1, 2, 1, 0, 1, 2, 5, 0, 1,
1, 1, 2, 0, 0, 0, 1, 2, 2, 1, 3, 1,
0, 0, 0, 1, 0, 0, 4, 2, 3, 1, 1, 3,
0, 1, 0, 0, 0, 1, 0, 1, 0, 5, 1, 5,
0, 0, 0, 1, 0, 0, 0, 1, 1, 2, 5, 4), ncol=12, byrow=T)
# *** part (a)(b)(c)(d) ***
#calculate P for each table;
sum(diag(a))/sum(a)
sum(diag(b))/sum(b)
sum(diag(c))/sum(c)
sum(diag(d))/sum(d)
#calculate K (unweighted kappa)
unweight.kappa(a)
unweight.kappa(b)
unweight.kappa(c)
unweight.kappa(d)
# Calculate the other measures of association.
# You need to first source in the
# S-plus code in "association.ssc".
association(a)
association(b)
association(c)
association(d)
#--------#
# prob 4 #
#--------#
11
# *** part (a) ***
X<-matrix(c(40, 145, 235, 420), 2, 2, byrow=T)
odds(X)
# *** part (b) ***
Y<-matrix(c( 6, 17, 17, 32,
3, 39, 26, 70,
12, 29, 34, 78,
8, 38, 48, 89,
5, 10, 45, 73,
2,
7, 29, 37,
4,
5, 36, 39), ncol=4, byrow=T)
M<-matrix(0, 7, 3, dimnames=list(NULL, c('odd.ratio', 'lower', 'upper')))
for (i in 1:7) {
temp<-odds(matrix(Y[i, ], 2, 2, byrow=T))
M[i, ]<-c(temp$odds.ratio, temp$CI95[1], temp$CI95[2])
}
age<-18:24
#The results;
cbind(age=age, M)
# *** part (c) ***
temp<-comodds(Y)
alpha.mh<-1/temp$mh.estimate
confidence.interval<-1/temp$ci95[2:1]
# *** part (d) ***
temp$Breslow.Day.test
temp$Liang.Self.T4test
# *** part (f) ***
Y.array<-array(0, c(2, 2, 7))
for (k in 1:7) Y.array[ , , k]<-matrix(Y[k,], 2, 2, byrow=T)
12
mantelhaen.test(Y.array, correct=F)
#--------#
# prob 5 #
#--------#
X<-matrix(c(926, 288, 293, 104,
467, 151, 150, 47,
693, 234, 219, 70), ncol=4, byrow=T)
# *** part (a) ***
N<-apply(X, 1, sum)
M.a<-N%*%t(c(9/16, 3/16, 3/16, 1/16))
dev.a<-2*sum(X*log(X/M.a))
df.a<-9-0
pvalue.a<-1-pchisq(dev.a, df.a)
# *** part (b) ***
pa.b<-(X[,1]+X[,3])/N
pb.b<-(X[,1]+X[,2])/N
P.b<-cbind(pa.b*pb.b, (1-pa.b)*pb.b, pa.b*(1-pb.b), (1-pa.b)*(1-pb.b))
M.b<-diag(N)%*%P.b
dev.b<-2*sum(X*log(X/M.b))
df.b<-9-6
pvalue.b<-1-pchisq(dev.b, df.b)
# *** part(c) ***
Nc<-apply(X, 2, sum)
N.tot<-sum(Nc)
pa.c<-(Nc[1]+Nc[3])/N.tot
pb.c<-(Nc[1]+Nc[2])/N.tot
P.c<-c(pa.c*pb.c, (1-pa.c)*pb.c, pa.c*(1-pb.c), (1-pa.c)*(1-pb.c))
M.c<-N%*%t(P.c)
dev.c<-2*sum(X*log(X/M.c))
13
df.c<-9-2
pvalue.c<-1-pchisq(dev.c, df.c)
# *** part (d) ***
temp1<-X2G2test(X)$Deviance.test
dev.d<-temp1[1]
df.d<-temp1[2]
pvalue.d<-temp1[3]
# *** part (e) ***
dev.ac<-dev.a-dev.c
dev.cb<-dev.c-dev.b
dev.ab<-dev.a-dev.b
df.ac<-df.a-df.c
df.cb<-df.c-df.b
df.ab<-df.a-df.b
pvalue.ac<-1-pchisq(dev.ac, df.ac)
pvalue.cb<-1-pchisq(dev.cb, df.cb)
pvalue.ab<-1-pchisq(dev.ab, df.ab)
dev<-c(dev.ac,
dev.cb, dev.ab)
df<-c(df.ac, df.cb, df.ab)
pvalue<-c(pvalue.ac, pvalue.cb, pvalue.ab)
name<-c("model (a) vs model (c)",
"model (c) vs model (b)",
"model (c) vs model (d)")
dev.table<-cbind(dev, df, pvalue)
dimnames(dev.table)<-list(name, c("deviance", "d.f.", "p-value"))
#print out the deviance table;
dev.table
# *** part (f) ***
dev.cd<-dev.c-dev.d
dev.ad<-dev.a-dev.d
df.cd<-df.c-df.d
df.ad<-df.a-df.d
pvalue.cd<-1-pchisq(dev.cd, df.cd)
14
pvalue.ad<-1-pchisq(dev.ad, df.ad)
dev<-c(dev.ac,
dev.cd, dev.ad)
df<-c(df.ac, df.cd, df.ad)
pvalue<-c(pvalue.ac, pvalue.cd, pvalue.ad)
name<-c("model (a) vs model (c)",
"model (c) vs model (d)",
"model (a) vs model (d)")
dev.table<-cbind(dev, df, pvalue)
dimnames(dev.table)<-list(name, c("deviance", "d.f.", "p-value"))
#print out the deviance table;
dev.table
#--------#
# prob 6 #
#--------#
Y<-matrix(c(20, 82, 10, 54,
26, 41, 16, 30,
27, 22, 14, 23), ncol=4, byrow=T)
# *** part (a) ***
M<-matrix(0, 3, 3, dimnames=list(NULL, c('odd.ratio', 'lower', 'upper')))
for (i in 1:3) {
temp<-odds(matrix(Y[i, ], 2, 2, byrow=T))
M[i, ]<-c(temp$odds.ratio, temp$CI95[1], temp$CI95[2])
}
M
# *** part (b) ***
comodds(Y)$Breslow.Day.test
# *** part (c) ***
collapse.table<-matrix(apply(Y, 2, sum), 2, 2, byrow=T)
#use the Splus function "odds";
odds(collapse.table)
15
Download