Stat 557 Fall 2000 Assignment 1 Solutions Problem 1 (a) population: Adults in the State of New York with telephones unit of response: An adult from the populaltion response variable: a binary (yes, no) nominal variable explanatory variables: age (interval variable) and sex (nominal variable) (b) population: Some population of patients with a cetain liver disease. The population is not clearly identied. unit of response: A patient from the populaltion. response variable: disease severity measured as an ordinal variable on a ve point scale. explanatory variables: time Alternatively, you could view the repeated measurements on the same subject as a set of responses and not list time as an explanatory variable. In either case, an appropriate analysis of the data would have to account for correlations among repeated measurements taken on an individual patient. (c) population: Iowa State students unit of response: A student and his or her parents. response variable: There are six potentially correlated binary (nominal) response variables corresponding to right (or left) handedness and right (or left) footedness for the student and his or her parents. Alternatiely, you could think of this as a single nominal variable with 64 categories. explanatory variables: none are mentioned. Alternatively, you could take right (or left) handedness and right (or left) footedness of the student as a pair of binary response variables and use the information from the parents as explanatory variables, The appropriate point of view would be dictated by the objectives of the study. 1 (d) population: Schools in Iowa with sixth grade classes. unit of response: This is a two stage sampling scheme where schools are sampled from the population of schools in Iowa with with sixth grades students. Then one class is randomly selected from the sixth grade classes in each of the selected schools. An entire sixth grade class is the unit of response. Since students in a class are inuenced by the same teacher and they largely share the same educational background, two students from the same class will tend to respond in a more similar way than two students from dierent schools. To use sixth grade students as the units of response, correlations among responses from classmates would have to be taken into account. This could be dicult because it would require a model the would allow for dierent types or strengths of relationships between dierent pairs of classmates. response variable: For each of the 25 animals there are two ordinal response variables (each with ve categories) corresponding to attitudes toward the animal before and after the visit to the wildlife center. explanatory variables: none are mentioned Alternatively, you could consider the post visit attitudes as the response variables and the pre-visit responses as explanatory variables. This would depend on the objectives of the study, but they were not addressed in this problem. (e) population: All orders for automobile parts lled by the warehouse center in a specic time period. unit of response: An order which may contain more than one line response variable: An interval variable corrsponding to the number of errors made in lling an order explanatory variables: The number of lines in an order could be used as an explanatory variable. There may also be information on the types of errors of types of items ordered. (f) population: The population from which the 200,000 potential costumers were selected was not identied. unit of response: A person on the mailing list. response variable: There are two binary (nominal) response variables. One is whether or not a person responded to the mailing. The other is whether or not a respondent used the new credit card within one year. You could think of this 2 as a single nominal response variable with three categories (no intial response, responded and did not use the credit card, responded and used the credit card). explanatory variables: sex(nominal), age(interval), income(interval), marital status(nominal), home ownership(nominal), credit history(nominal). Problem 2 Let and be the success rate of the ACS program and the success rate of the psychologist's new program, respectively. Then the testing problem is "H : = : HA : < ". The alternative is one-sided because the objective of the study is to demonstrate that the new program is better. Here, = 0:05, = 0:20, z = 1:64485, z = 0:84162, = 0:17, = 0:24. 1 2 0 1 2 1 2 1 2 p = (v + )=2 u u ; p) r = t (1 ; 2p(1 = 1:0038 ) + (1 ; ) 1 2 1 1 2 2 ; ) + (1 ; )) = 410:2 n = (z + z r) (((1 ; ) 2 1 1 1 2 2 2 2 Then, 411 smokers are needed in each program. Problem 3 (a) The observed proportions of SIDS cases in the nine weight categories are 0.00358, 0.00726, 0.01327, 0.00423, 0.00321, 0.00208, 0.00156, 0.00155, 0.00097, respectively. Using binomial models for the number of SIDS cases, conditioning on the number of births in each weight category, the standard errors for the observed proportions are .00358, .00417, .00381, .00121, .00053, .00029, .00027, .00043, .00069, respectively. The SIDS incidence rates in the two lowest weight cateroies are poorly estimated, and it is not clear if SIDS incidence rates really increase across the rst three weight categories. Given the accuracy of the observed proportions, there appears to be a decreasing trend in the incidence rates of SIDS cases as birth weight increases. The analysis in problem 4 does not reject the t of a monotonically decreasing power (or exponential) curve to this trend. (b) Without combining any birth weight categories, X = 71:25 with d.f.=8 and the chisquare approximation to the distribution of the Pearson statistic, when the null hypothesis is true, yields p ; value < :0001. SIDS rates are not the same for all birth weight categories. Since two of the estimated expected counts are smaller than one and four of the estimated expected counts are smaller than 5, the large sample chi-square approximation to the distributions of the Pearson and deviance statistics may not provide accurate p-values. One 2 3 option would be to combine the results for the three smallest birth weight categories. This yields a Pearson statistic that also rejects the hypothesis of equal SIDS rates across all birth weight categories. (c) Let i denote the joint probability of birth in the ith birth weight category and a SIDS case. Let i denote the joint probability of a birth in the ith birth weight category and a non-SIDS case. Let Xi and Xi denote the corresponding random counts. The joint distribution of X ; X ; ; X ; X is a multinomial distribution with total sample size n = Pi Pj Xij . The probability function is 1 2 1 11 9 2 12 91 92 2 =1 =1 YY x P (X = x ; X = x ; ; X = x ; X = x ) = Q Qn! x ! ij : ij i j i j 9 2 ij 11 11 12 12 91 91 92 92 9 2 =1 =1 =1 =1 Let ni = Xi + Xi be the number of infants in the ith birth weight category. Then, 1 2 P (X = x jX + X = n ) X +X =n ) = P (X =P (xX and +X =n ) = PP(Xn9 = x and X = n ; x ) (X = k; X = n ; k) PkP8 P P 2 x n;n P (X = x ; X = x ; ; X = x ; X = x ; X = n ; x ) = Pn9 P=1 P8 =1P2 9 k n;n9 P (X = x ; X = x ; ; X = x ; X = k; X = n ; k) =1 =1 x P P8 P2 Q Q x )x91 n9;x91 n Q ( 8 Q2 ij i j x n ; n 9 x91 n9 ;x91 =1 =1 x = Pn9 =1P P=1 P Q Q Q Q n 8 2 ( x )k n9 ;k 91 91 91 91 91 92 91 91 91 ( =0 ( 9 92 j ij = i j 9 9 92 91 i 92 92 91 =0 9 91 9 11 ) ij = 11 12 11 ) 11 11 82 12 11 i k ij = j =0 ( i=1 ) ( j =1 i xij n;n9 = ij !) j ) ( 8 i=1 ( x9191 n992;x91 ) (P P8=1 P2=1 x x91 = P ( k n9 ;x91 !( )! ( i j = = =0 !( !( 1 n9 ! )! ( 2 j =1 n;n9 ij = n9 ;k k P P8 P2 91 92 k n9 ;k ) ( x i=1 j =1 ij x91 n9 ;x91 91 92 x91 n9 ;x91 ( + )n9 9 !( !) !( ! = ) ! 91 92 9 92 91 9 92 ij ij =1 91 92 Q Q x ) ij i j Q Q x ) x i=1 j =1 ij n n;n9 Q8i=1 Q2j=1 xij ) j =1 ! 91 2 i )! 82 =1 8 ! xij k n9 ;k Q8 Qn 2 91 ij 2 =1 )! 91 82 8 ! ( 82 8 ! ij 2 =1 =1 8 2 i =1 j =1 ij ij )! 91 92 n! n9 ;x91 ( )x91 (1 ; x !(n ; x )! + + ) 9 91 9 91 91 91 91 92 91 92 Hence if you condition on the number of infants in the 9th (heaviest weight) category, then X , the number of SIDS cases in that category, has a Binomial(n ; 919192 ) distribution. Note that 919192 is the conditional probability that a baby in the heaviest birth weight category becomes a SIDS case. 91 9 + 4 + q (d) The formula is p z: ( p n;p ). Here, n = 2061 and p = 2=2061. This yields: lower limit=-0.00037, upper limit=0.00231. The large sample normal approximation to the binomial distribution is not a good approximation because the expected number of SIDS cases is too small. Note that this is not the result produced by prop.test function in S-PLUS. (e) lower limit=0.000118, upper limit=0.0035. (1 ) 025 Problem 4 (a) Note that i = exp( + (i + 1)) = exp( ): i exp( + i) Then, = log(i =i) is the log of the relative risk of SIDS for adjacent birth weight categories. Since, exp(^ ) = 0:72 and 1=0:72 = 1:39, the relative risk of SIDS increases by about 40 percent when the birth falls into the next lower birth weight category. (b) The log-likelihood function is Y l = log( Y !(nni;! Y )! iY (1 ; i)n ;Y ) i i i i X ni ! ) + X Y log + X(n ; Y ) log(1 ; ) = log( i i i i Yi!(ni ; Yi)! i i i i X X X = log( Y !(nni;! Y )! ) + Yi( + i) + (ni ; Yi) log(1 ; exp( + i)) i i i i i i +1 0 1 1 0 1 1 +1 1 9 i i i =1 9 9 =1 9 =1 9 =1 9 9 0 =1 1 =1 0 1 =1 (c) The likelihood equations are @l = X Y ; X(n ; Y ) exp( + i) = 0 i i i @ 1 ; exp( + i) i i @l = X iY ; X i(n ; Y ) exp( + i) = 0: i i i @ 1 ; exp( + i) i i 9 9 0 0 =1 9 1 =1 =1 1 0 1 0 1 9 0 =1 1 The solution to these equations must be obtained numerically with an iterative algorithm. There is no convenient algebraic formula for the solution. (d) The data are consistent with the proposed model. X = 10:79 with 9 ; 2 = 7 degrees of freedom and p ; value = 0:148. G = 9:45 with 9 ; 2 = 7 degrees of freedom and p ; value = 0:222. Under the alternative model, you must estimate a dierent incidence rate for each of the nine binomial distributions. Under the null hypothesis you must estimate two parameters and . Hence, the dierence in the dimensions of the parameter spaces is 9-7=2. 2 2 0 1 5 (e) The second partial derivatives of the log-likelihood function are @ l = ; X(n ; Y ) exp( + i) i i @ (1 ; exp( + i)) i @ l = ; X i (n ; Y ) exp( + i) i i @ (1 ; exp( + i)) i @ l = ; X i(n ; Y ) exp( + i) i i (1 ; exp( + i)) @ @ i 9 2 0 2 0 1 0 =1 9 2 0 2 2 1 1 0 =1 2 1 9 2 0 2 1 0 1 1 0 =1 1 2 Since E (Yi) = ni i = ni exp( + i); the negative of the expectations of the second partial derivatives are @ l ) = X n exp( + i) E ( @ i 1 ; exp( + i) i @ l ) = X i n exp( + i) E ( @ i 1 ; exp( + i) i l ) = X in exp( + i) E ( @@ @ i 1 ; exp( + i) i 0 1 9 2 0 2 0 1 0 =1 1 9 2 0 2 2 1 1 0 =1 1 9 2 0 0 1 1 0 =1 1 (f) The Fisher information matrix, 0 @ 22l ) E ( @ 2 l ) 1 E ( @0 @1 A ; @0 I = ;@ 2l @ @ 22l ) E ( @0 @1 ) E ( @ 1 is obtained from the negative of the expectations of the second partial derivatives. By substituting mle's for unknown parameters, we have an estimate 0 1 166 : 48 950 : 31 A I^ = @ 950:31 5780:38 The inverse of this matrix, 0 1 0 : 0977 ; 0 : 0161 A I^; = @ ;0:0161 0:00281 1 provides an estimate of the covariance matrix for the large sample normal approximation to the distribution of the mle's. 6 (g) p s0 = 0:0977 = 0:313 p s1 = 0:0281 = 0:053 In this case, it is easy to obtain formulas for the expectations of the second partial derivatives of the log-likelihood. In situations where the expectations are not easy to derive, the values of the second partial derivations evaluated at the mle's of the parameters are used instead of the expectations of those derivatives. This is often called the "local" estimator of the information matrix. #-----------------------------------------------------------------------; #Xiao-Hu Liu used the following S-PLUS code to obtain the solutions to assignment 1. #------------; # problem 2 ; #------------; alpha_0.05 beta_0.2 pi1_0.17 pi2_0.24 p_(pi1+pi2)/2 z.alpha_qnorm(1-alpha) z.beta_qnorm(1-beta) temp_pi1*(1-pi1)+pi2*(1-pi2) r_sqrt(2*p*(1-p)/temp) n_(z.beta+z.alpha*r)^2*temp/(pi1-pi2)^2 #------------; # problem 3 ; #------------; 7 cat_1:9 x_c(1,3,12,12,37,52,34,13,2) n_c(279, 413, 904, 2838, 11509, 24941, 21832, 8408, 2061) #Merge the data; cbind(cat, x, n) #part (a); p_x/n plot(cat, p) # a decreasing trend, a power( or exponential) curve may fit # the scatter plots; #part (b); #method 1: Chi-square test for two way contigency table; x1_n-x X_cbind(x, x1) chisq.test(X) #method 2: use prop.test to test equal proportion of several # independent binomial distributions; prop.test(x, n) #part (d); x9_x[9]; n9_n[9]; prop.test(x9, n9)$conf.int #part (e); #The following function uses F-distribution to construct a #100*a% "exact" confidence interval for the probability of #success in a binomial distribution; binci.f<-function(x, n, a) { # x = observed number of successes; 8 # n = number of trials; # a = level of confidence (e.g. 0.95); p<-x/n a2 <- 1-((1-a)/2) if (x > 0) f1 <- qf(a2,2*(n-x+1),2*x) else f1 <- 1 plower <- x/(x + (n-x+1)*f1) if (n > x) f2 <- qf(a2,2*(x+1),2*(n-x)) else f2 <- 1 pupper <- (x+1)*f2/((n-x)+(x+1)*f2) cbind(plower=plower, pupper=pupper) } binci.f(x9, n9, 0.95) #------------; # problem 4 ; #------------; #The data were entered into S-PLUS in problem 3; #part (d); beta0_-4.10393 beta1_-0.329635 pi.hat_exp(beta0+beta1*cat) m_n*pi.hat m1_n-m M_cbind(m, m1) X.sqr_sum((X-M)^2/M) df_7 p.value1_1-pchisq(X.sqr, df) G.sqr_2*sum(X*log(X/M)) p.value2_1-pchisq(G.sqr, df) #part (f); o.r_pi.hat/(1-pi.hat) a11_sum(n*o.r) 9 a12_sum(cat*n*o.r) a22_sum(cat^2*n*o.r) I_matrix(c(a11, a12, a12, a22), 2, 2) I.inv_solve(I) #part (g); std_sqrt(diag(I.inv)) 10