STAT 557 1. (a) (b) (c) (d) (e) Solutions to Assignment 1 Fall 2002 Adults in the State of Iowa with telephones An adult from the population a binary (yes, no) response nominal Since the sample size (n = 825) is fairly large and the point estimate of p is not too extreme (np > 5 and n(1 p) > 5), the approximation would be reasonable if the random digit dialing approximately produces a simple random sample from the population of adults with telephones in Iowa. This should be case if they talk to only one adult on each call. Then, with p = 0:4630, an approximate 95 % condence interval is (0:429; 0:497): If they talk to every adult present in the household on each call, however, it would be closer to a one-stage cluster sample than a simple random sample and the number of \yes" responses may not have a binomial distribution. In that case, using the binomial variance formula in the construct of a condence interval would not be appropriate. 2. (a) Patients in the national registry of records for individuals who have been diagnosed as suering from a certain liver disease (b) a patient from the population (c) disease severity measured on a ve point scale (d) ordinal (e) Since the sample size (n = 825) is fairly large and the point estimate of p is not too extreme (np > 5 and n(1 p) > 5), the approximation would be reasonable. With p = 0:171, the approximate 95 % condence interval is (0:124; 0:218): 3. (a) Iowa State students (b) A student and his or her parents (c) There are three potentially correlated binary response variables corresponding to right (or left) handedness for a student and his or her two parents. For each student/parent trio you could think of this as a multinomial response variable with trio would belong to exactly on the 8 categories, if information on both parents is available. (d) nominal (e) The Pearson chi-square test can be used to compare proportions obtained from independent binomial random variables. Since the binomial variables in this problem are correlated, the use of the Pearson chi-square test statistic is not reasonable. 4. (a) Schools in Iowa with sixth grade classes (b) This is a two stage sampling scheme where schools are sampled from the population of schools in Iowa with sixth grades students. Then one class is randomly selected from the sixth grade classes in each of the selected schools. An entire sixth grade class is the unit of response. Since students in a class are inuenced by the same teacher and they largely share the same educational background, two students from the same class will tend to respond in a more similar way than two students from dierent schools. To use sixth grade students as the units of response, correlations among responses from classmates would have to be taken into account. This could be diÆcult because it would require a model they would allow for dierent types or strengths of relationships between dierent pairs of classmates. (c) For each of the 25 animals there are two response variables (each with ve categories) corresponding to attitudes toward the animal before and after the visit to the wildlife center. (d) ordinal 1 p (e) The original data was used to create a binary response for each student (did their attitude toward snakes become more positive?) As mentioned in part (a), students' response may not be independent and the total number of positive responses not be the binomial. Thus, p(1 p) may underestimate the variance of the sample proportion and p (1:96) p(1 p) may produce intervals that are too short to provide 95% condence. p 5. Let 1 and 2 be the survival rates of the standard treatment and the new treatment, respectively. Then the testing problem is "H0 : 1 = 2 : HA : 1 < 2 ". The alternative is one-sided because the objective of the study is to demonstrate that the new treatment is more eective. Here, = 0:05, = 0:20, z = 1:64485, z = 0:84162, 1 = 0:55, 2 = 0:75. s p = (1 + 2 )=2 = 0:65 2p(1 p) = 1:02273 r = 1 (1 1 ) + 2 (1 2 ) (z + z r)2 (1 (1 1 ) + 2 (1 2 )) = 69:27 n = (1 2 )2 Then, 70 subjects are needed in each treatment. 6. (a) Note that p = 4=933 = 0:004287 and np = 4 < 5. Thus a normal approximation would not be reasonable. The exact condence interval is computed as (0.001169, 0.010940) using the formulas on pages 73-76 of the lecture notes. (b) Similarly as in part (a), the normal approximation would not be reasonable. The exact condence interval is (0.0000585, 0.0017427). 7. (a) The table of counts would have a multinomial distribution if simple random sampling with replacement was used. Since the population is large, a multinomial distribution would provide a good approximation to the distribution of the counts if simple random sampling without replacement was used. The log-likelihood function is `(; Y) = log(n!) X2 X2 i=1 j =1 log(Yij !) + X2 X2 i=1 j =1 Yij log(ij ) (b) Under the independence model ij = i+ +j . Substituting this into the log-likelihood function shown above, we have `(; Y) = log(n!) X2 X2 i=1 j =1 log(Yij !) + X2 i=1 Yi+ log(i+ ) + X2 j =1 Y+j log(+j ) Maximizing this log-likelihood with respect to the constraints 1+ + 2+ = 1 and +1 + +2 = 1. The formula for the maximum likelihood estimates for the expected counts is Yi+ Y+j m ^B ; ij = Y ++ and i = 1; 2andj = 1; 2 (m ^B ^B ^B ^B 11 ; m 12 ; m 21 ; m 22 ) = (799:5; 220:5; 295:2; 81:5): (c) Under the general alternative, the mle's of the expected counts will be same as the observations; i.e., (m ^ C11 ; m ^ C12 ; m ^ C21 ; m ^ C22 ) = (784; 236; 311; 66): (d) G2 = 2 X2 X2 Y i=1 j =1 ij log Yij = 5:321 on 1 d.f. with p-value=.021 m ^B ij 2 X2 = 2 X2 X2 Y ( ij i=1 j =1 2 m ^B ij ) = 5:15 on 1 d.f. with p-value=.023 m ^B ij The data suggest that opinions on gun registration are not held independently of opinions on the death penalty. In particular, people who oppose the death penalty are more likely to favor gun registration than people who favor the death penalty. (e) The log-likelihood function is `(; Y) = log(n!) X2 X2 log(Yij !) + = log(n!) X2 X2 log(Yij !) + Y11 log(2 ) + (Y12 + Y21 ) log((1 = log(n!) X2 i=1 j =i i=1 j =i j =i X2 X2 Y i=1 j =i ij log(ij ) )) + 2Y22 log((1 )2 ) log(Yij !) + (2Y11 + Y12 + Y21 ) log() + (Y12 + Y21 + 2Y22 ) log((1 )) (f) Solve the likelihood equation 0= @`(; Y) 2Y11 + Y12 + Y21 = @ Y12 + Y21 + 2Y22 1 The maximum likelihood estimate is Y +Y 2Y + Y + Y ^ = 11 12 21 = 1+ +2 = 0:757 2Y++ 2Y++ (g) Compute m.l.e.'s for expected counts: 2 m ^A 11 = Y++ ^ = 800:5055 m ^A ^A 12 = m 21 = Y++ ^(1 ^) = 256:9944 2 m ^A 22 = Y++ (1 ^) = 82:5055 (h) The deviance statistic is G2 = 2 X2 X2 Y ^A ij log(Yij =m ij ) = 16:2823 i=1 j =i with 2 d.f. and p-value = .0003. It is not surprising that the data do not support this model because the independence model was rejected in a previous part of this problem and this model is a restricted form of the independence model. (i) An analysis of deviance table Comparison Model A vs. Model B Model B vs. Model C Model A vs. Model C d.f. 1 1 2 Deviance 10.962 5.321 16.283 p-value .0009 .021 .0003 Note that the deviance statistic for \Model A vs. Model B" is given by G2 = 2 X2 X2 Y i=1 j =i ^B ^A ij log(m ij =m ij ) = 10:962 Although Model B is a signicant improvement over Model A, neither Model A nor Model B is appropriate in this case. 3 > #=========================================================================# > # Problem 5 # > #=========================================================================# > pi1 <- .55 > pi2 <- .75 > > alpha <- .05 > beta <- .20 > z.alpha <- qnorm( 1 - alpha ) > z.beta <- qnorm( 1 - beta ) > > p <- (pi1+pi2)/2 > r <- sqrt( 2*p*(1-p)/(pi1*(1-pi1)+pi2*(1-pi2)) ) > n <- (z.beta+z.alpha*r)^2*(pi1*(1-pi1)+pi2*(1-pi2))/(pi1-pi2)^2 > c(p,r,n) [1] 0.65000 1.02273 69.27248 > > #=========================================================================# > # Problem 6 # > #=========================================================================# > #The following function uses F-distribution to construct a > #100*a% "exact" confidence interval for the probability of > #success in a binomial distribution; > > binci.f<-function(x, n, a){ + # x = observed number of successes; + # n = number of trials; + # a = level of confidence (e.g. 0.95); + p <- x/n + a2 <- 1-((1-a)/2) + if (x > 0) + f1 <- qf(a2, 2*(n-x+1), 2*x) + else + f1 <- 1 + if (n > x) + f2 <- qf(a2, 2*(x+1), 2*(n-x)) + else + f2 <- 1 + plower <- x / (x + (n-x+1)*f1) + pupper <- (x+1)* f2 / ((n-x)+(x+1)*f2) + cbind( plower=plower, pupper=pupper ) + } > binci.f(x=4, n=933, a=0.95) plower pupper [1,] 0.001169328 0.01094038 > binci.f(x=2, n=4143, a=0.95) plower pupper [1,] 0.00005846764 0.001742731 > > > #=========================================================================# > # Problem 7 # > #=========================================================================# > #-----------------------------------------------------------------------(b) > Y <- matrix( c( 784, 236, 311, 66), 2, 2, byrow=T ) > dimnames(Y) <- list( c("GR.Favor","GR.Oppose"), c("DP.Favor","DP.Oppose") ) > Y DP.Favor DP.Oppose GR.Favor 784 236 GR.Oppose 311 66 > > row.total <- Y["GR.Favor",] + Y["GR.Oppose",] > col.total <- Y[,"DP.Favor"] + Y[,"DP.Oppose"] > total <- sum(Y) > #------------------------------------# > # Expected counts; Independence Model# > #------------------------------------# > mB.hat <- (col.total %o% row.total) / total > mB.hat DP.Favor DP.Oppose GR.Favor 799.4989 220.50107 GR.Oppose 295.5011 81.49893 > #------------------------------------------------------------------------(d) > #------------------------------------# > # Pearson X^2 Test: ModelB vs ModelC # > #------------------------------------# > chisq.test(Y, correct=F) Pearson's chi-square test without Yates' continuity correction data: Y X-square = 5.1503, df = 1, p-value = 0.0232 > # See (i) for deviance G^2 test 4 # > #------------------------------------------------------------------------(f) > #------------------------------------# > # MLE of theta # > #------------------------------------# > theta <- (2 * Y[1,1] + Y[1,2] + Y[2,1]) / ( 2 * sum(Y) ) > theta [1] 0.7569792 > #------------------------------------------------------------------------(g) > #------------------------------------# > # Expected counts by Model A # > #------------------------------------# > piA.hat <- matrix( c(theta^2, theta*(1-theta), + theta*(1-theta), (1-theta)^2 ), 2, 2 ) > mA.hat <- total * piA.hat > mA.hat [,1] [,2] [1,] 800.5055 256.99445 [2,] 256.9945 82.50555 > #------------------------------------------------------------------------(i) > table2 <- matrix(0, 3, 3) > dimnames(table2) <- list( c( "A vs B", "B vs C", "A vs C"), + c("df", "deviance", "pvalue") ) > #------------------------------------# > # MODEL A vs MODEL B # > #------------------------------------# > table2["A vs B", "deviance"] <- 2 * sum( Y * log( mB.hat/mA.hat ) ) > table2["A vs B", "df"] <- 1 > table2["A vs B", "pvalue"] <- 1 - pchisq(table2["A vs B", "deviance"], + table2["A vs B", "df"] ) > #------------------------------------# > # MODEL B vs MODEL C # > #------------------------------------# > table2["B vs C", "deviance"] <- 2 * sum( Y * log( Y / mB.hat ) ) > table2["B vs C", "df"] <- 1 > table2["B vs C", "pvalue"] <- 1 - pchisq(table2["B vs C", "deviance"], + table2["B vs C", "df"] ) > #------------------------------------# > # MODEL A vs MODEL C # > #------------------------------------# > table2["A vs C", "deviance"] <- 2 * sum( Y * log( Y /mA.hat ) ) > table2["A vs C", "df"] <- 2 > table2["A vs C", "pvalue"] <- 1 - pchisq(table2["A vs C", "deviance"], + table2["A vs C", "df"] ) > #------------------------------------# > # Analysis of Deviance Table # > #------------------------------------# > table2 df deviance pvalue A vs B 1 10.96130 0.0009303443 B vs C 1 5.32065 0.0210741513 A vs C 2 16.28195 0.0002913527 5