STAT 557 FALL 2000 Reading Assignment: Assignment #2 Name ______________ Lloyd, Categorical Data Analysis: You should have already read Chapter 1 and Sections 3.1-3.3. Read Chapter 2 and finish reading Chapter 3. Also read Sections 7.1 and 7.2. Written Assignment: On campus: Off campus: Due Friday, September 15, in class. Due Friday, September 22, in class. 1. The following table of counts appears as Table 2.7 on page 29 of Agresti, Categorical Data Analysis. Assume it was obtained from a simple random sample of 1397 respondents from the population of adults (people more than 18 years old) in the United States in 1982. Each respondent was cross-classified with respect to opinions expressed on the issues of gun control and imposing the death penalty on criminals convicted of certain violent acts. Gun Registration Favor Oppose 2. Death Penalty Favor Oppose 784 236 311 66 (a) What is the distribution of the possible 2×2 tables of counts that could be observed in such a survey of 1397 respondents? (b) Report maximum likelihood estimates of the expected counts under the null hypothesis that gun registration opinion is held independently of the death penalty opinion. m̂11 = m̂12 = m̂21 = m̂22 = (c) Report values of the deviance G² and Pearson statistic X² for testing the independence hypothesis in part (b) against the general alternative. Report degrees of freedom and p-values. State your conclusion. For a 2×2 contingency table Row i=1 Factor i=2 Column j=1 X11 X21 Factor j=2 X12 X22 obtained from a multinomial sample of size n, consider testing the following null hypothesis (call it model A) 2 Ho: π11 = θ2 , π12 = π 21 = θ(1 − θ), π 22 = (1 - θ)2 for some unknown 0 < θ < 1 , against the general alternative (call if model C). HA : 0 < π ij < 1 and π11 + π12 + π21 + π 22 = 1 Model A imposes both independence between the row and column factors and identical marginal distributions for the row and column factors. (a) Assuming the null hypothesis is true, give a formula for the log-likelihood function. (b) Give a formula for the maximum likelihood estimator for θ . (c) Give a formula for the deviance statistic for testing the null hypothesis against the general alternative and report its degrees of freedom. This is a goodness of fit test for model A. (d) Consider the model (call if model B) that only imposes independence between the row and column factors, i.e., πij = πi + π+ j for all (i,j). Complete the following analysis of deviance table for the data in the 2×2 table in Problem 1. Comparison d.f. deviance value Model A vs Model B Model B vs Model C Model A vs Model C State the conclusions you would reach from this analysis of deviance table. p-value 3 3. The distribution of corn borers was examined by counting the number of corn borers found at each of 120 different locations (Bliss, 1953). The following table gives the number of locations with 0, 1, 2, ... borers. Number of Corn Borers 0 1 2 3 4 5 6 7 8 9 10 11 12 Number of Locations 24 16 16 18 15 9 6 5 3 4 3 0 1 Expected Counts Poisson Model Neg. Binomial Model (a) Create a table like the one shown above and record maximum likelihood estimates of the expected counts for the i.i.d. Poisson model in the third column of the table. Use the Pearson chi-square test to assess the fit of the Poisson model. Combine the categories if necessary to keep estimates of expected counts larger than 5. Report values for X2, the degrees of freedom, and a p-value. State your conclusion. (b) An alternative test that is often used to assess the fit of the Poisson model is Fisher's dispersion index. Let X1, …, Xn, denote the counts for the n locations, and let X = n −1 n ∑ Xi and S2 = n −1 i =1 n ∑(Xi − X)2 . i =1 Since the mean is equal to the variance of a Poisson distribution, both X and S² are estimates of the variance. The Poisson model is declared inadequate when n S 2 / X > χn2−1,α This test generally has more power against mixed Poisson alternatives, like the negative binomial model, than the test in part (a). Report values for n S2 / X , the degrees of freedom, and a p-value. State your conclusion. 4 (c) Consider a negative binomial model for the corn borer data. Write the maximum likelihood estimates of the expected counts in the fourth column of the table. Use the Pearson chi-square statistic to assess the fit of the negative binomial model. Combine categories if necessary to keep estimates of expected counts larger than 5. Report values for X2, its degrees of freedom, and a p-value. State your conclusion. (d) Construct a 95% confidence interval for the mean number of corn borers per location. Indicate how your confidence interval was constructed. Report lower limit = __________ 4. upper limit = __________ Vianna, et al. (1971, Lancet, 1, 431-432) considered a series of 109 patients with Hodgkin's disease. A sample of 109 "control" patients was selected from hospital records of patients with no history of Hodgkin's disease or any other malignant disease or chronic illness. The control patients were selected from a set of hospital records that generally matched the composition of the group of patients with Hodgkin's disease with respect to age, sex, race, county of residence, and date of hospital admission. Eight of the patients with Hodgkin's disease and two control patients were deleted from the analysis because their tonsillectomy history could not be obtained. The remaining 208 patients were cross-classified into the following 2×2 contingency table. Had Tonsillectomy Did not have Tonsillectomy Totals Hodgkin’s Disease 67 34 101 Controls 43 64 107 The Pearson Chi-square test for independence is 14.26 on 1 d.f. with p-value < .001. Vianna, et al. used the odds ratio αˆ = 2.93 as an approximate measure of relative risk and concluded that tonsillectomy increases the risk of contracting Hodgkin's disease by a factor of nearly 3. They concluded that tonsillectomy removes a protective barrier against Hodgkin's disease. (a) Compute a 95% confidence interval for the odds ratio. Report lower limit = _________ upper limit = _________ 5 (b) A year later, Johnson and Johnson (1972, New England Journal of Medicine, 287, 1122-1125) reported results from a different study of 175 patients treated for Hodgkin's disease at the Radiation Branch of the National Cancer Institute. There was information available of 472 siblings for 172 patients. The authors chose the closest sibling of the same sex within five years of age of each patient. This matching reduced the data to 85 patient-sibling pairs and the following table was reported. Had Tonsillectomy Did not have Tonsillectomy Totals Hodgkin’s Disease 41 44 85 Controls 33 52 85 The Pearson chi-square test for independence is computed as 1.53 with p-value = .22, and the estimated odds ratio is αˆ = 1.47 with 95% confidence bounds (0.80, 2.70). On the basis of this, Johnson and Johnson claim to have refuted the contention of Vienna, et al., that tonsils constitute a lymphoid barrier to Hodgkin's disease. Which authors, if any, do you agree with? State your reasons. If you think any mistakes were made in either of the analyses, describe the mistakes and explain how the data should be analyzed, including formulas for test statistics, relative risk, and a confidence interval for relative risk. 5. The data in Table 3.11 on page 73 in Agresti's book, Categorical Data Analysis, are Surgery Radiation Therapy Cancer Controlled 21 15 Cancer Not Controlled 2 3 Assume that the 41 larynx cancer patients were randomly assigned to the two treatments. Use Fisher's exact test to test the null hypothesis that the two treatments are equally effective in controlling the cancer against the alternative that the treatments are not equally effective. Report a p-value and state your conclusion. 6 6. In a study of the effects of treating multiple sclerosis patients with human fibroblast interferon (IFN-B) (reported by Jacobs, O'Malley, Freeman, and Ekes (1981), Science, 214, pp. 1026-1028), 20 multiple sclerosis patients were randomly divided into a group of 10 IFN-B recipients and a group of 10 controls. At the beginning of the study the severity of each patient's symptoms was evaluated and at the end of the study each patient was reevaluated and classified as either improved, unchanged, or worsened. The data are given in the following table. Treated with IFN-B Controls Improved 5 1 Result of Treatment Unchanged Worsened 4 1 4 5 TOTALS 10 10 Perform an "exact" randomization test of the null hypothesis that the IFN-B treatment produces the same results as the treatment given to the controls against the null hypothesis that the IFN-B treatment gives better results. Using whatever criterion you think is best to order the tables, report the possible tables that are less consistent with the null hypothesis than the observed table. Compute the p-value for your test and state your conclusion. 7. Roth, et al. (1975, N. E. J. Med., 295, 386-389) report results from a study of 173 skin cancer patients. One objective was to determine if allergic reaction to a contact allergen DCNB was related to the stage of the skin cancer. Each skin cancer patient was classified into one of three stages. Each patient as exposed to DCNB and the reaction was recorded as positive or negative. The results are shown below. Reaction to DCNB Positive Negative TOTAL Stage I 39 13 52 Stage II 39 19 58 Stage III 26 37 63 (a) Describe a scenario under which the counts in this table would have a multinomial distribution. (b) Using the multinomial model from Part (a), test the null hypothesis that reaction to DCNB is independent of the stage of the skin cancer. Report values for the Pearson X2 and log-likelihood ratio G2 statistics, degrees of freedom, and p-values. (c) State your conclusion from Part (b). (d) Instead of assuming a multinomial distribution for the entire table of counts condition on the total numbers of patients in the three stages to obtain three independent binomial distributions. Test the null hypothesis that the probability of a positive reaction to DCNB is the same for all three binomial distributions. Do your results differ from those in Part (b)? Explain. 7 8. The following data are fictitious results for 121 individuals who were cross-classified with respect to lung capacity and smoking habits. Lung Capacity Normal Impaired TOTALS None 36 4 40 Smoking Habit Occasional Regular 24 28 4 8 28 36 Heavy 4 8 12 (a) Use the likelihood ratio G2 statistic to test the null hypothesis that lung capacity is independent of smoking habit. Report values for G2, degrees of freedom, and a pvalue. (b) Repeat Part (a) using the Pearson X2 statistic. (c) Perform an exact conditional test of the null hypothesis in Part (a). Report the pvalue. (Use X2 values to order the possible tables.) (d) Use the SAS code stored in the file hw2p8d.sas or the S-PLUS code stored in the file hw2p8d.ssc to simulate 10,000 tables of counts. Each table contains random counts from four independent binomial distributions with common success rate π = 92 / 116 and sample sizes n1 = 40, n2 = 28, n3 = 36, n4 = 12, respectively. Calculate G2 and X2 for each simulated table of counts. Compare the results with a chi-square distribution with three degrees of freedom. Report the “true” p-values for the G 2 and X2 tests from Parts (a) and (b) and compare these results with the result from Part (c). Using χ32,.05 as the critical value for each test, report the simulated values of the true type I error levels for the G2 and X2 tests. WARNING: The seeds for the random number generators used in these programs are taken from the computer clock. Consequently, students who run this code at different times will generate different sets of tables and obtain slightly different results. 9. How small can the expected counts be under the null hypothesis before the large sample chisquare approximation provides unreliable p-values and does not maintain the proper type I error level for the G2 and X2 tests? Divide each count in the table for problem 8 by 2. Then, repeat the simulation. 10. Divide each count in the table for problem 8 by 4. Then, repeat the simulation. 11. Subtract one from each count in the table for problem 10. Then, repeat the simulation. How do these results differ from those in problems 8, 9, and 10?