STAT 557 FALL 1998 Instructions: 1. FINAL EXAM Name _______________ You may use only the formula sheet provided with this exam. No other notes or books are allowed. Write your answers in the spaces provided on this exam. If you need more space use the back of the page or attach additional sheets of paper, but clearly indicate where this is done. You should not attempt to complete complex calculations. You will receive complete credit by showing that you know how to properly solve the problem. Use of herbicides to control weeds in grain fields may adversely affect the environment by inhibiting reproduction in certain organisms. To examine the reproductive toxicity of a certain herbicide, C. dubia zooplankton, were exposed to five concentrations of the herbicide: 0, 0.8, 1.6, 2.35, and 3.1 mg/ l The total number of offspring born to the C. dubia at each concentration are shown below. Herbicide Concentration 0 0.8 1.6 2.35 3.1 Number of Offspring 33 31 28 17 6 Maximum likelihood estimation was used to fit a Poisson regression model to these data. Using Z to denote the herbicide concentration and mz to denote the corresponding mean number of offspring, the estimated model is log ( m̂z ) = 3.5721 (0.1178) 0.1546 Z2 (.0340) Standard errors are given in parentheses beneath the estimated coefficients. The estimated covariance matrix for the estimated coefficients is intercept Z2 intercept 0.01387 -0.00245 Z2 -0.00245 .0016 (a) Write down a formula for the log-likelihood function. (b) Explain how the estimated covariance matrix, shown above, was obtained. 2 (c) One quantity of interest is the reproductive index defined as RI = (m 0 - mz ) m0 where m0 is the mean number of offspring when no herbicide is present. Estimate Z.50, the herbicide concentration at which RI = 0.50. This is the herbicide concentration corresponding to a 50 percent reduction in the number of offspring. (d) Show how to construct an approximate 95% confidence interval for Z.50 from part (c). 2. Responses from 2500 high school students were obtained from a survey on smoking habits. These responses were cross classified into a 2x2x3x3 contingency table with respect to the following factors. A: Smoking status of the respondent (i=1 smokers, i=2 nonsmokers) B: Sex of respondent (j=1 female, j=2 male) C: Socio-economic status of parents (k=1 low, k=2 middle, k=3 high). D: Smoking status of parents ( l = 1 neither smoke, l =2 one smokes, l =3 both smoke) (a) Consider the following log-linear model CD log(m i jk l) =λ +λ iA +λ Bj +λ Ck +λ lD + λ AB i j +λ k l What does this model imply about associations among the four factors? (b) Using the constraints AB AB CD CD CD CD λA2 = λ2B = λC3 = λD3 = λ 12 = λ AB 21 = λ 22 = λ 13 = λ 23 = λ 31 = λ 32 = 0 , 3 the maximum likelihood estimates for the interaction terms in the model in part (a) are shown below: Parameter Estimate AB λ 11 -.3505 Standard Error .0837 CD λ 11 .0716 .3428 CD λ 12 -1.9601 .3334 λ CD 21 .0795 .3649 λ CD 22 -1.3838 .3545 Assuming the model is correct, explain how the estimate for λCD 12 should be interpreted. (c) Consider the deviance statistic for testing the null hypothesis that the model in part (a) is correct against the general alternative. How many degrees of freedom are associated with this test? (d) List the conditions under which the deviance test in part (c) would approximately have a central chi-squared distribution. (e) The following model was also fit to the data: CD log(m i jk l) =λ +λ iA +λ Bj +λ Ck +λ lD + λ AB i j +λ k l + γ1 ui vk + γ2 ui wl where (u1, u2) = (-.5, .5) (v1, v2, v3) = (-1, 0, 1) (w1, w 2, w 3) = (-1, 0, 1) and γ1 and γ 2 are unknown parameters. Maximum likelihood estimates are Parameter γ1 γ2 Estimate -.422 .329 Standard Error .040 .030 assuming that this model is correct, interpret γ̂ 2 as an odds ratio. (f) What are the degrees of freedom associated with the deviance test of the null hypothesis that the model in part (a) is correct against the alternative that the model in part (e) is correct? 4 (g) Suppose respondents were obtained by first taking a simple random sample of 100 high schools and then taking a simple random sample of 25 students within each high school. What effect, if any, would this have on parameter estimates and standard errors obtained by maximizing a multinomial log-likelihood (as was done in part (e) of this problem)? 3. Leukemia patients were treated with chemotherapy and examined at the end of one year to determine if the disease was in remission. Information was also recorded on the following variables: Z1 The percentage of cells undergoing DNA synthesis at the start of the chemotherapy treatment. This is called the labeling index. Z2 The highest body temperature of the patient (in ° F ) during the week prior to the start of chemotherapy. There were 58 leukemia patients in this study. Let πi denote the conditional probability that the disease goes into remission for patients with values (Zli, Z2i) for the two explanatory variables. Maximum likelihood estimation was used to fit the following logistic regression model to the data: πˆ log i 1 − πˆ i = 103.3 + 0.3463 Z1i - 1.0844 Z2i (42.14) (.1019) (.4302) Standard errors are shown in parentheses beneath the corresponding parameter estimates. The estimated covariance matrix for the parameter estimates is intercept labeling index (Z1) temperature (Z2) intercept 1776.05487 1.72931 -18.12329 Labeling Index Z1 1.72931 0.01038 -0.01859 Temperature Z2 -18.12329 -0.01859 0.18505 (a) Clearly explain how 0.3463, the estimated coefficient for Z1, can be interpreted with respect to a conditional odds ratio for remission. (b) Estimate the probability that a leukemia patient with Z1 = 10% and Z2 = 99 ° F at the start of the chemotherapy treatment will experience remission. (c) Show how to construct a 95% confidence interval for this probability (you need not complete the calculations). 5 d) Some diagnostic results are shown on page 7. Explain how C and the Df beta values are computed. (e) Explain what the diagnostic results from part (d) indicate about cases 23, 30, and 53. (f) Four plots are shown on page 9. What do these plots indicate about the estimated model and how it could be improved? 4. Consider the following study of the effect of incubation temperature on the sex of turtle eggs. In this study 10 turtle eggs were collected from each of 20 different sites in Illinois and 10 turtle eggs were also collected from each of 20 different sites in New Mexico. All eggs at one site were laid by the same female. The ten eggs from one site in Illinois were placed in a box with the ten eggs from one site in New Mexico. Sites from Illinois and New Mexico were randomly matched. Four boxes were incubated at each of five temperatures: 26.5, 27, 27.5, 28, 28.5 °C . Boxes were randomly assigned to temperatures. The numbers of male and female turtles hatching from Illinois eggs and the number of male and female turtles hatching from New Mexico eggs were recorded for each box incubated at each temperature. Although an incubator can be set at a specific temperature, actual temperatures can vary across locations inside the incubator. Hence, the temperatures will not be the same in all boxes incubated at the same temperature setting, and this will affect all of the eggs in a box. A slightly higher temperature in the box, for example, would increase the probability of female turtles hatching from the eggs. Using π to denote the conditional probability that a female turtle hatches from an egg, a logistic regression model is π log = β0 + β1 Z1 + β2 Z 2 1 − π where 0 for Illinois eggs Z1 = 1 for New Mexico eggs and Z2 is the temperature setting on the incubator. Explain how you would estimate the parameters β0 , β1 , and β2 in this model and obtain standard errors for the estimates. SCORE ________ COURSE GRADE ________