STAT 557 Assignment #2 Name ______________
FALL 2000
Reading Assignment: Lloyd, Categorical Data Analysis: You should have already read Chapter 1 and Sections 3.1-3.3. Read Chapter 2 and finish reading Chapter 3.
Also read Sections 7.1 and 7.2.
Written Assignment: On campus: Due Friday, September 15, in class.
Off campus: Due Friday, September 22, in class.
1. The following table of counts appears as Table 2.7 on page 29 of Agresti, Categorical Data
Analysis. Assume it was obtained from a simple random sample of 1397 respondents from the population of adults (people more than 18 years old) in the United States in 1982. Each respondent was cross-classified with respect to opinions expressed on the issues of gun control and imposing the death penalty on criminals convicted of certain violent acts.
Gun Registration
Favor
Oppose
Death Penalty
Favor
784
311
Oppose
236
66
(a) What is the distribution of the possible 2
×
2 tables of counts that could be observed in such a survey of 1397 respondents?
(b) Report maximum likelihood estimates of the expected counts under the null hypothesis that gun registration opinion is held independently of the death penalty opinion.
ˆ
11
=
12
=
21
=
ˆ
22
=
(c) Report values of the deviance G² and Pearson statistic X² for testing the independence hypothesis in part (b) against the general alternative. Report degrees of freedom and p-values. State your conclusion.
2. For a 2
×
2 contingency table
Row i=1
Factor i=2
Column j=1
X
11
X
21
Factor j=2
X
12
X
22 obtained from a multinomial sample of size n, consider testing the following null hypothesis
(call it model A)
H o
: ð
11
= è
2
, ð
12
= ð
21
= è(1
−
è),
H
A
: 0 < ð ij
< 1 and
ð
22
= (1
−
è)
2 for some unknown 0 < è < 1, against the general alternative (call if model C).
ð
11
+ ð
12
+ ð
21
+ ð
22
= 1.
2
Model A imposes both independence between the row and column factors and identical marginal distributions for the row and column factors.
(a)
(b)
Assuming the null hypothesis is true, give a formula for the log-likelihood function.
Give a formula for the maximum likelihood estimator for è.
(c) Give a formula for the deviance statistic for testing the null hypothesis against the general alternative and report its degrees of freedom. This is a goodness of fit test for model A.
(d) Consider the model (call if model B) that only imposes independence between the row and column factors, i.e., ð ij
= ð i+
ð
+j
for all (i,j). Complete the following analysis of deviance table for the data in the 2
×
2 table in Problem 1.
Comparison
Model A vs Model B
Model B vs Model C
Model A vs Model C d.f. deviance value p-value
State the conclusions you would reach from this analysis of deviance table.
3
3. The distribution of corn borers was examined by counting the number of corn borers found at each of 120 different locations (Bliss, 1953). The following table gives the number of locations with 0, 1, 2, ... borers.
Number of
Corn Borers
0
1
2
3
4
5
6
7
8
9
10
11
12
Number of
Locations
24
16
16
18
15
9
6
5
3
4
3
0
1
Expected Counts
Poisson Model Neg. Binomial Model
(a) Create a table like the one shown above and record maximum likelihood estimates of the expected counts for the i.i.d. Poisson model in the third column of the table. Use the Pearson chi-square test to assess the fit of the Poisson model. Combine the categories if necessary to keep estimates of expected counts larger than 5. Report values for X
2
, the degrees of freedom, and a p-value. State your conclusion.
(b) An alternative test that is often used to assess the fit of the Poisson model is Fisher's dispersion index. Let X
1
, …, X n
, denote the counts for the n locations, and let
X
= n
−
1 i n
=
1
X i and S
2 = n
−
1 i n
=
1
( X i
−
X )
2
.
Since the mean is equal to the variance of a Poisson distribution, both X and S² are estimates of the variance. The Poisson model is declared inadequate when n S
2
/ X
> χ 2 n
−
1 ,
α
This test generally has more power against mixed Poisson alternatives, like the negative binomial model, than the test in part (a). Report values for n S
2
/ X , the degrees of freedom, and a p-value. State your conclusion.
4
(c) Consider a negative binomial model for the corn borer data. Write the maximum likelihood estimates of the expected counts in the fourth column of the table. Use the
Pearson chi-square statistic to assess the fit of the negative binomial model. Combine categories if necessary to keep estimates of expected counts larger than 5. Report values for X
2
, its degrees of freedom, and a p-value. State your conclusion.
(d) Construct a 95% confidence interval for the mean number of corn borers per location.
Indicate how your confidence interval was constructed. Report lower limit = __________ upper limit = __________
4. Vianna, et al. (1971, Lancet, 1, 431-432) considered a series of 109 patients with Hodgkin's disease. A sample of 109 "control" patients was selected from hospital records of patients with no history of Hodgkin's disease or any other malignant disease or chronic illness. The control patients were selected from a set of hospital records that generally matched the composition of the group of patients with Hodgkin's disease with respect to age, sex, race, county of residence, and date of hospital admission. Eight of the patients with Hodgkin's disease and two control patients were deleted from the analysis because their tonsillectomy history could not be obtained. The remaining 208 patients were cross-classified into the following 2
×
2 contingency table.
Hodgkin’s
Disease
Controls
Had
Tonsillectomy
67
43
Did not have
Tonsillectomy
34
64
Totals
101
107
The Pearson Chi-square test for independence is 14.26 on 1 d.f. with p-value < .001.
Vianna, et al. used the odds ratio ˆ
=
2 .
93 as an approximate measure of relative risk and concluded that tonsillectomy increases the risk of contracting Hodgkin's disease by a factor of nearly 3. They concluded that tonsillectomy removes a protective barrier against
Hodgkin's disease.
(a) Compute a 95% confidence interval for the odds ratio. Report lower limit = _________ upper limit = _________
5
(b) A year later, Johnson and Johnson (1972, New England Journal of Medicine, 287,
1122-1125) reported results from a different study of 175 patients treated for
Hodgkin's disease at the Radiation Branch of the National Cancer Institute. There was information available of 472 siblings for 172 patients. The authors chose the closest sibling of the same sex within five years of age of each patient. This matching reduced the data to 85 patient-sibling pairs and the following table was reported.
Hodgkin’s
Disease
Controls
Had
Tonsillectomy
41
33
Did not have
Tonsillectomy
44
52
Totals
85
85
The Pearson chi-square test for independence is computed as 1.53 with p-value = .22, and the estimated odds ratio is
α =
1 .
47 with 95% confidence bounds (0.80, 2.70).
On the basis of this, Johnson and Johnson claim to have refuted the contention of
Vienna, et al., that tonsils constitute a lymphoid barrier to Hodgkin's disease.
Which authors, if any, do you agree with? State your reasons. If you think any mistakes were made in either of the analyses, describe the mistakes and explain how the data should be analyzed, including formulas for test statistics, relative risk, and a confidence interval for relative risk.
5. The data in Table 3.11 on page 73 in Agresti's book, Categorical Data Analysis, are
Surgery
Radiation Therapy
Cancer
Controlled
21
15
Cancer Not
Controlled
2
3
Assume that the 41 larynx cancer patients were randomly assigned to the two treatments.
Use Fisher's exact test to test the null hypothesis that the two treatments are equally effective in controlling the cancer against the alternative that the treatments are not equally effective.
Report a p-value and state your conclusion.
6. In a study of the effects of treating multiple sclerosis patients with human fibroblast
6 interferon (IFN-B) (reported by Jacobs, O'Malley, Freeman, and Ekes (1981), Science, 214, pp. 1026-1028), 20 multiple sclerosis patients were randomly divided into a group of 10
IFN-B recipients and a group of 10 controls. At the beginning of the study the severity of each patient's symptoms was evaluated and at the end of the study each patient was reevaluated and classified as either improved, unchanged, or worsened. The data are given in the following table.
Result of Treatment
Improved Unchanged Worsened TOTALS
Treated with IFN-B
Controls
5
1
4
4
1
5
10
10
Perform an "exact" randomization test of the null hypothesis that the IFN-B treatment produces the same results as the treatment given to the controls against the null hypothesis that the IFN-B treatment gives better results. Using whatever criterion you think is best to order the tables, report the possible tables that are less consistent with the null hypothesis than the observed table. Compute the p-value for your test and state your conclusion.
7. Roth, et al. (1975, N. E. J. Med., 295, 386-389) report results from a study of 173 skin cancer patients. One objective was to determine if allergic reaction to a contact allergen
DCNB was related to the stage of the skin cancer. Each skin cancer patient was classified into one of three stages. Each patient as exposed to DCNB and the reaction was recorded as positive or negative. The results are shown below.
Reaction to DCNB Positive
Negative
TOTAL
Stage I
39
13
52
Stage II
39
19
58
Stage III
26
37
63
(a) Describe a scenario under which the counts in this table would have a multinomial distribution.
(b) Using the multinomial model from Part (a), test the null hypothesis that reaction to
DCNB is independent of the stage of the skin cancer. Report values for the Pearson
X
2 and log-likelihood ratio G
2
statistics, degrees of freedom, and p-values.
(c) State your conclusion from Part (b).
(d) Instead of assuming a multinomial distribution for the entire table of counts condition on the total numbers of patients in the three stages to obtain three independent binomial distributions. Test the null hypothesis that the probability of a positive reaction to DCNB is the same for all three binomial distributions. Do your results differ from those in Part (b)? Explain.
7
8. The following data are fictitious results for 121 individuals who were cross-classified with respect to lung capacity and smoking habits.
Lung Capacity
Normal
Impaired
TOTALS
None
36
4
40
Smoking Habit
Occasional
24
4
28
Regular
28
8
36
Heavy
4
8
12
(a) Use the likelihood ratio G
2
statistic to test the null hypothesis that lung capacity is independent of smoking habit. Report values for G
2
, degrees of freedom, and a pvalue.
(b) Repeat Part (a) using the Pearson X
2
statistic.
(c) Perform an exact conditional test of the null hypothesis in Part (a). Report the pvalue. (Use X
2
values to order the possible tables.)
(d) Use the SAS code stored in the file hw2p8d.sas or the S-PLUS code stored in the file hw2p8d.ssc to simulate 10,000 tables of counts. Each table contains random counts from four independent binomial distributions with common success rate
π =
92 / 116 and sample sizes n
1
= 40, n
2
= 28, n
3
= 36, n
4
= 12, respectively. Calculate G
2 and X
2
for each simulated table of counts. Compare the results with a chi-square distribution with three degrees of freedom. Report the “true” p-values for the G
2
and
X
2
tests from Parts (a) and (b) and compare these results with the result from Part (c).
Using
χ 2
3 ,.
05
as the critical value for each test, report the simulated values of the true type I error levels for the G
2
and X
2
tests.
WARNING: The seeds for the random number generators used in these programs are taken from the computer clock. Consequently, students who run this code at different times will generate different sets of tables and obtain slightly different results.
9. How small can the expected counts be under the null hypothesis before the large sample chisquare approximation provides unreliable p-values and does not maintain the proper type I error level for the G
2
and X
2
tests? Divide each count in the table for problem 8 by 2.
Then, repeat the simulation.
10. Divide each count in the table for problem 8 by 4. Then, repeat the simulation.
11. Subtract one from each count in the table for problem 10. Then, repeat the simulation.
How do these results differ from those in problems 8, 9, and 10?