Stat 557 Fall 2000 Assignment 3 Solutions Problem 1 Gamma 95% condence interval estimate standard error lower limit upper limit Females 0.236 0.0385 0.160 0.311 Males -0.125 0.0327 -0.189 -0.061 Problem 2 @ = (a) The rst derivative of = log ; is @ + ; = ; 2 . By the -method, ! @ ^ var @ var(^ ) = (1 ;1 ) var(^ ): For females, ^f = 0:240; var ^f = 0:00166: For males, ^m = ;0:126; var ^m = 0:00110: 1 2 1+ 1 1 2 1 1+ 1 1 1 1 2 2 2 (b) For females, the approximate 95% condence interval for is q q ^ ; z : var(^); ^ + z : var(^) = (0160; 0:320): 0 975 0 975 Since = 1 ; 2 , an approximate 95% condence interval of is (0:159; 0:310) Similarly for males, an approximate 95% condence interval for is (;0:189; ;0:061). 2 exp +1 (c) Since = log ; is a strictly increasing function of , the test problem H : f = m versus Ha : f 6= m 1 2 1+ 1 0 is equivalent to the test problem H : f = m versus Ha : f 6= m : 0 Consider ^f ; ^m, the estimator of f ; m. Under the assumption that the counts for females and males are independent, var(^f ; ^m) = var(^f ) + var(^m) 1 When H holds, the distribution of the test statistic Z = pvar f ;mvar m is well apf proximated by a standard normal distribution when the sample sizes are large enough. This generally provides a more accurate p-value than approximating the null distribution of pvar ff;mvar m with a standard normal distribution. For these data, Z = 6:96 and the standard normal approximation yields p ; value = 3:32 10; : The association between physical and psychological demands of work is not the same for males and females. The association is positive for females, but weaker and negative for males. ^ 0 ^ ( ^ )+ ^ (^ ) ^ (^ )+ (^ ) 12 Problem 3 The numerical results for parts (a) (b) (c) and (d) are presented in the following table. table (a) table (b) table (c) table (d) P 0.775 0.497 0.358 0.179 ^ 0.549 0.329 0.230 0.105 ^s 0.549 0.626 0.642 0.663 ^ 0.844 0.684 0.630 0.564 ^RjC 0.547 0.326 0.287 0.228 (e) Clearly gamma and Spearman's rho are much less aected by the choice of the row and column categories than P, Kappa, or Lambda. Neither P, Kappa, nor Lambda should be used to compare levels of agreement or predictive ability across tables from dierent studies if the choice of categories diers across studies. If you had the actual values of the sucrose intake levels in 1980 and 1981 for each of the 173 subjects, you could compute values of gamma and Spearman's rho for the continuous data. Plot the 1981 sucrose intake values against thecorresponding 1980 values. The values of gamma and Spearman's rho for the continous data would be close to 1.0 if the points on the plot were tightly clustered about a monotone increasing curve (not necessarily a straight line). As the number of row/columnn categories in the contingency table are increased, the values of gamma and Spearman's rho converge too the values for the continuous data. On the other had, the values of P and Kappa for the continuos data would be large if the points in the plot were mostly on the 45-degree line. If the points are merely tightly clustered about a monotone increasing curve or only loosely clustered about the 45-degree line, the number of cases that appear on the main diagonal of a contingency table will decrease (and P and Kappa will decrease) as the number of row/column categories increases. 2 Problem 4 (a) The estimated odds ratio for the marginal table of counts is ^ = 0:493. An approximate 95% condence interval is (0.366, 0.724). (b) . Age Odds Ratio 95% condence interval 18 0.664 (0.221, 1.998) 19 0.207 (0.059, 0.728) 20 0.949 (0.433, 2.079) 21 0.390 (0.169, 0.904) 22 0.811 (0.260, 2.526) 23 0.365 (0.070, 1.889) 0.867 (0.216, 3.482) 24 (c) ^MH = 0:549, and a approximate 95% condence interval for the common conditional odds ratio is (0.371, 0.814). (d) The Breslow-Day test statistic has a value of 6.23 with 6 d.f. and p-value=0.398. The value of the T statistic is -0.220 with p-value=0.587. 4 (e) The counts at each age level are large enough for the chi-square approximation to the null distribution of the Breslow-Day test to provide an accurate p-value. In this case, the data are consistent with the null hypothesis of homogeneous conditional odds ratios. This does not show that the conditional odds ratios are exactly all the same because conditional odds ratios are not accurately estimated for some ages, as evidenced by some rather wide condence intervals. (f) X = 8:475, with d.f.=1, and p-value=0.0027. The null hypothesis is rejected. Results from parts (b), (c), and (d) suggest that the odds of an IM case in the group that had a tonsillectomy are between 37 and 81 percent smaller than the odds of an IM case in the group of children that had no history of tonsillectomy, and this is consistent across age groups. (Note that X = 9:023 when the continuity coreection is not used.) 2 2 Problem 5 (a) Under Mendelian theory, the expected counts are 3 Tall/Cut Tall/Potato Dwarf/Cut Dwarf/Potato District 1 906.19 302.06 302.06 100.69 152.81 152.81 50.94 District 2 458.44 District 3 684.00 228.00 228.00 76.00 P P when the populations are in equilibrium. Here, G = 2 i j Yij log(Yij =mij ) = 3:144 with 9 ; 0 = 9 degrees of freedom and p ; value = 0:958. The null hypothesis completely species the expected counts, there are no parameters to estimate. 2 3 =1 4 =1 (b) The maximum likelihhod estimates of pa and pb within the three districts are p^a p^b District 1 0.7567 0.7536 District 2 0.7570 0.7583 District 3 0.7500 0.7623 The corresponding expected counts are Tall/Cut Tall/Potato Dwarf/Cut Dwarf/Potato District 1 918.60 295.40 300.40 96.60 District 2 467.86 150.14 149.141 47.86 District 3 695.25 231.75 216.75 72.25 The value of the deviance statistic is G = 1:133, with 9 ; 6 = 3 degress of freedom and p-value = 0:769. 2 (c) The maximum likelihood estimates of the common values of pa and pb are p^a = 0:7545 and p^b = 0:7576. The value of the deviance statistic is G = 1:629, with 9 ; 2 = 7 degrees of freedom and p-value= 0:978. 2 (d) The value of the deviance statistic is G = 1:483, with (3 ; 1)(4 ; 1) = 6 degrees of freedom and p-value= 0:961: The observed counts are consistent with the independence model. 2 (e) . Deviance df p-value model (a) vs model (c) 1.515 2 0.469 model (c) vs model (b) 0.495 4 0.974 model (a) vs model (b) 2.011 6 0.919 The results from all three districts are consistent with the equilibrium model based on Mendelian theory. 4 (f) . Deviance model (a) vs model (c) 1.515 model (c) vs model (d) 0.145 model (a) vs model (d) 1.660 df p-value 2 0.469 1 0.703 3 0.646 (g) No. The independence model requires the same distribution of counts across the four phenotypes in each district. The model in part (b) is not a psecial case of the independence model because it allows the distribution of counts across the four phenotypes to dier across districts. Conversely, not every version of the independence model can be obtained from the model in part (b), only the special cases represented by the model in part (c). Problem 6 (a) . Birth Order Odds Ratio 95% condence interval 2 1.317 (0.572, 3.030) 3{4 1.189 (0.545, 2.596) 2.016 (0.844, 4.816) 5+ (b) The value of the Breslow-Day statistic is 0.85 with 2 degrees of freedom and pvalue=0.653. The data are consistent with the hypothesis of homogeneous odds ratios within age groups. (c) The estimate of the odds ratio for the marginal table of counts is 1.347. An approximate 95% condence interval is (0.851, 2.132). This is consistent with the information in the tables for the three age groups, and it is not an example of Simpson's paradox. Problem 7 (a) From the software posted as negbin.ssc or negbin.sas, which was applied to the cavity data, we have ^ = :018888224 and k^ = :5883654 as the maximum likelihood estimates for the parameters in the negative binomial model. Then the m.l.e. of Pr(Y = 0) = k is ^ k = 0:375. ^ 5 (b) Dene g(; k) = k . The rst partial derivatives of this function are G = (k k; ; k log()). A consistent estimator of G is obtained by evaluating G at the m.l.e's of the parameters. Then, G^ = (k^ ^ k; ; ^ k log(^)) = (1:168553; ;0:62574). The software from part (a) provided the estimated covariance of (^; k^), the inverse of the estimated Fisher Information matrix, 1 ^ V^ 2 6 = 64 1 ^ 3 :0009275 :0024042 7 :0024042 :0092353 7: 5 Then, by the delta method, an estimate of the large sample variance of ^ k is ^ G^ 0V^ G^ = :001363 and the standard error of ^ k is .03692. ^ (c) An approximate 95% condence interval for k is :375 (1:96)(:03692) ) (:303; :447): (d) You could do a simulation study of the coverage probability of the procedure for constructing condence intervals used in part (c). Select values of and k. You could try several sets of values, using some that are close or equal to the m.l.e.'s of the parameters evaluated in the previous parts of this problem. For each set of parameter values, simulate a large number of samples (say 10000 samples) from the corresponding negative binomial distribution with n independent observations in each sample. Then, construct a condence interval from the data for each sample, using the method from part (c). Record the proportion of the 10,000 condence intervals that contain the true value for k . Repeat this for several choices of n, including the number of children in the original study. The upper and lower condence limits are random. To simulation the coverage probaility of a method of constructing condence intervals, you must simulate values of the random upper and lower limits and monitor how often those random limits enclose the true value of the quantity you are trying to estimate. Some students proposed simulating estimates of the probability of observing a child with no cavities and then monitoring how often those simulated estimates fell between the particular condence limits computed in part (c). This incorrectly considers the upper and lower limits of a condence interval as xed (non-random) quantities. 6 #-----------------------------------------------------# # Splus code for assignment 3; # #-----------------------------------------------------# #-----------------------------------------------------# #The following function calculates unweighted kappa # #-----------------------------------------------------# unweight.kappa<-function(x){ n_sum(x) xr_apply(x, 1, sum) xc_apply(x, 2, sum) e_outer(xr, xc)/n k1_sum(diag(x))/n k2_sum(diag(e))/n kappa_(k1-k2)/(1-k2) kappa } #----------------------------------------------------# # The following function calculates the Pearson # # Chi-Square test and the deviance test of the # # independence of the row factor and the # # column factor for a 2-way contingency table. # #----------------------------------------------------# X2G2test<-function(X){ sr<-apply(X, 1, sum) sc<-apply(X, 2, sum) n<-sum(sc) m<-sr%*%t(sc)/n X.sqr<-sum((X-m)^2/m) df<-(length(sr)-1)*(length(sc)-1) p.v1<-1-pchisq(X.sqr, df) G.sqr<-2*sum(X*log(X/m)) p.v2<-1-pchisq(G.sqr, df) 7 list(Person.test=cbind(test.statistic=X.sqr, df=df, p.value=p.v1), Deviance.test=cbind(test.statistic=G.sqr, df=df, p.value=p.v2)) } #---------------------------------------------------# # For a (2 by 2) contingency table, the following # # function estimates the odds ratio, its standard # # error and a 95% confidence interval. # #---------------------------------------------------# odds<-function(X, correction=F) { if(correction == T) X <- X + 0.5 alpha <- (X[1, 1] * X[2, 2])/(X[1, 2] * X[2, 1]) temp <- sqrt(sum(1/X)) ase <- alpha * temp la<-log(alpha) z975<-qnorm(0.975) a <- la - z975 * temp b <- la + z975 * temp list(odds.ratio = alpha, ase = ase, CI95 = cbind(lower = exp(a), upper = exp(b))) } #-----------# # problem 1 # #-----------# female<-matrix(c(100, 109, 202, 33, 89, 179, 100, 179, 542), ncol=3, byrow=T) male<-matrix(c(113, 163, 370, 45, 106, 280, 8 229, 343, 568), ncol=3, byrow=T) #Consider the table for females; female.g<-association(female)$Gamma gamma.female<-female.g[1] std.female<-female.g[2] z975<-qnorm(0.975) cbind(gamma=gamma.female, std=std.female, lower=gamma.female-z975*std.female, upper=gamma.female+z975*std.female) #Consider the table for males; male.g<-association(male)$Gamma gamma.male<-male.g[1] std.male<-male.g[2] cbind(gamma=gamma.male, std=std.male, lower=gamma.male-z975*std.male, upper=gamma.male+z975*std.male) #-----------# # problem 2 # #-----------# # *** part (b) *** #For females; std2.female<-1/(1-gamma.female^2)*std.female temp.f_0.5*log((1+gamma.female)/(1-gamma.female)) l0<-temp.f-z975*std2.female u0<-temp.f+z975*std2.female l<-1-2/(exp(2*l0)+1) u<-1-2/(exp(2*u0)+1) cbind(lower=l, upper=u) 9 #For males; std2.male<-1/(1-gamma.male^2)*std.male temp.m_0.5*log((1+gamma.male)/(1-gamma.male)) l0<-temp.m-z975*std2.male u0<-temp.m+z975*std2.male l<-1-2/(exp(2*l0)+1) u<-1-2/(exp(2*u0)+1) cbind(lower=l, upper=u) # *** part (c) *** Z<-(temp.f-temp.m)/sqrt(std2.female^2+std2.male^2) p.value<-2*(1-pnorm(abs(Z))) #--------# # prob 3 # #--------# a<-matrix(c(67, 20, 19, 67), ncol=2, byrow=T) b<-matrix(c(24, 12, 5, 2, 10, 21, 12, 1, 8, 7, 14, 14, 1, 3, 12, 27), ncol=4, byrow=T) c<-matrix(c(17, 5, 4, 1, 0, 1, 5, 9, 8, 5, 2, 1, 3, 6, 10, 7, 2, 1, 1, 4, 6, 4, 11, 3, 2, 3, 0, 9, 7, 8, 1, 1, 1, 2, 8, 15), ncol=6,byrow=T) d<- matrix(c(7, 4, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 3, 3, 4, 0, 3, 0, 0, 0, 0, 0, 1, 0, 1, 0, 2, 2, 2, 3, 1, 2, 1, 0, 1, 0, 1, 3, 2, 3, 3, 0, 1, 1, 1, 0, 0, 0, 0, 2, 1, 2, 1, 5, 0, 1, 2, 0, 0, 0, 10 0, 1, 0, 3, 1, 3, 3, 3, 0, 0, 1, 0, 1, 0, 1, 0, 1, 2, 3, 0, 3, 1, 1, 1, 0, 0, 2, 1, 2, 1, 0, 1, 2, 5, 0, 1, 1, 1, 2, 0, 0, 0, 1, 2, 2, 1, 3, 1, 0, 0, 0, 1, 0, 0, 4, 2, 3, 1, 1, 3, 0, 1, 0, 0, 0, 1, 0, 1, 0, 5, 1, 5, 0, 0, 0, 1, 0, 0, 0, 1, 1, 2, 5, 4), ncol=12, byrow=T) # *** part (a)(b)(c)(d) *** #calculate P for each table; sum(diag(a))/sum(a) sum(diag(b))/sum(b) sum(diag(c))/sum(c) sum(diag(d))/sum(d) #calculate K (unweighted kappa) unweight.kappa(a) unweight.kappa(b) unweight.kappa(c) unweight.kappa(d) # Calculate the other measures of association. # You need to first source in the # S-plus code in "association.ssc". association(a) association(b) association(c) association(d) #--------# # prob 4 # #--------# 11 # *** part (a) *** X<-matrix(c(40, 145, 235, 420), 2, 2, byrow=T) odds(X) # *** part (b) *** Y<-matrix(c( 6, 17, 17, 32, 3, 39, 26, 70, 12, 29, 34, 78, 8, 38, 48, 89, 5, 10, 45, 73, 2, 7, 29, 37, 4, 5, 36, 39), ncol=4, byrow=T) M<-matrix(0, 7, 3, dimnames=list(NULL, c('odd.ratio', 'lower', 'upper'))) for (i in 1:7) { temp<-odds(matrix(Y[i, ], 2, 2, byrow=T)) M[i, ]<-c(temp$odds.ratio, temp$CI95[1], temp$CI95[2]) } age<-18:24 #The results; cbind(age=age, M) # *** part (c) *** temp<-comodds(Y) alpha.mh<-1/temp$mh.estimate confidence.interval<-1/temp$ci95[2:1] # *** part (d) *** temp$Breslow.Day.test temp$Liang.Self.T4test # *** part (f) *** Y.array<-array(0, c(2, 2, 7)) for (k in 1:7) Y.array[ , , k]<-matrix(Y[k,], 2, 2, byrow=T) 12 mantelhaen.test(Y.array, correct=F) #--------# # prob 5 # #--------# X<-matrix(c(926, 288, 293, 104, 467, 151, 150, 47, 693, 234, 219, 70), ncol=4, byrow=T) # *** part (a) *** N<-apply(X, 1, sum) M.a<-N%*%t(c(9/16, 3/16, 3/16, 1/16)) dev.a<-2*sum(X*log(X/M.a)) df.a<-9-0 pvalue.a<-1-pchisq(dev.a, df.a) # *** part (b) *** pa.b<-(X[,1]+X[,3])/N pb.b<-(X[,1]+X[,2])/N P.b<-cbind(pa.b*pb.b, (1-pa.b)*pb.b, pa.b*(1-pb.b), (1-pa.b)*(1-pb.b)) M.b<-diag(N)%*%P.b dev.b<-2*sum(X*log(X/M.b)) df.b<-9-6 pvalue.b<-1-pchisq(dev.b, df.b) # *** part(c) *** Nc<-apply(X, 2, sum) N.tot<-sum(Nc) pa.c<-(Nc[1]+Nc[3])/N.tot pb.c<-(Nc[1]+Nc[2])/N.tot P.c<-c(pa.c*pb.c, (1-pa.c)*pb.c, pa.c*(1-pb.c), (1-pa.c)*(1-pb.c)) M.c<-N%*%t(P.c) dev.c<-2*sum(X*log(X/M.c)) 13 df.c<-9-2 pvalue.c<-1-pchisq(dev.c, df.c) # *** part (d) *** temp1<-X2G2test(X)$Deviance.test dev.d<-temp1[1] df.d<-temp1[2] pvalue.d<-temp1[3] # *** part (e) *** dev.ac<-dev.a-dev.c dev.cb<-dev.c-dev.b dev.ab<-dev.a-dev.b df.ac<-df.a-df.c df.cb<-df.c-df.b df.ab<-df.a-df.b pvalue.ac<-1-pchisq(dev.ac, df.ac) pvalue.cb<-1-pchisq(dev.cb, df.cb) pvalue.ab<-1-pchisq(dev.ab, df.ab) dev<-c(dev.ac, dev.cb, dev.ab) df<-c(df.ac, df.cb, df.ab) pvalue<-c(pvalue.ac, pvalue.cb, pvalue.ab) name<-c("model (a) vs model (c)", "model (c) vs model (b)", "model (c) vs model (d)") dev.table<-cbind(dev, df, pvalue) dimnames(dev.table)<-list(name, c("deviance", "d.f.", "p-value")) #print out the deviance table; dev.table # *** part (f) *** dev.cd<-dev.c-dev.d dev.ad<-dev.a-dev.d df.cd<-df.c-df.d df.ad<-df.a-df.d pvalue.cd<-1-pchisq(dev.cd, df.cd) 14 pvalue.ad<-1-pchisq(dev.ad, df.ad) dev<-c(dev.ac, dev.cd, dev.ad) df<-c(df.ac, df.cd, df.ad) pvalue<-c(pvalue.ac, pvalue.cd, pvalue.ad) name<-c("model (a) vs model (c)", "model (c) vs model (d)", "model (a) vs model (d)") dev.table<-cbind(dev, df, pvalue) dimnames(dev.table)<-list(name, c("deviance", "d.f.", "p-value")) #print out the deviance table; dev.table #--------# # prob 6 # #--------# Y<-matrix(c(20, 82, 10, 54, 26, 41, 16, 30, 27, 22, 14, 23), ncol=4, byrow=T) # *** part (a) *** M<-matrix(0, 3, 3, dimnames=list(NULL, c('odd.ratio', 'lower', 'upper'))) for (i in 1:3) { temp<-odds(matrix(Y[i, ], 2, 2, byrow=T)) M[i, ]<-c(temp$odds.ratio, temp$CI95[1], temp$CI95[2]) } M # *** part (b) *** comodds(Y)$Breslow.Day.test # *** part (c) *** collapse.table<-matrix(apply(Y, 2, sum), 2, 2, byrow=T) #use the Splus function "odds"; odds(collapse.table) 15