Document 10714931

advertisement
STAT 557
Assignment #3
Fall 2002
Reading Assignment:
Lloyd: The delta method is reviewed in Section 1.6. The analysis of
several 2x2 tables is discussed in Sections 3.4-3.6.
Written Assignment:
Due Monday, October 7, in class.
1.
Freeman (1987) presents the following 2x2 table, reported by Ostfeld (1980), on the relationship
between coronary heart disease (CHD) and perceived stress in the workplace. These data came
from the Western Electric Longitudinal Study in which workers were asked a set of questions
about their work environment and then followed for ten years. The ten year follow-up permitted
the identification of major CHD outcomes.
Major CHD event?
Do you work
under tension?
Yes
No
Yes
97
200
No
307
1409
(a) Since this is a prospective study, you can directly estimate the relative risk of CHD. Report
the values of your estimate, its standard error and a 95% confidence interval for the relative
risk of CHD. State your conclusion.
(b) Report the value of the estimated odds ratio, its standard error, and a 95% confidence
interval for the odds ratio. How does this compare to the results for the direct estimate of
relative risk in part (a)?
2.
In 1974, the Danish National Institute for Social Science Research interviewed a random sample
of Danes between 20 and 69 years old in order to investigate the general welfare in Denmark,
The following two tables (Andersen, 1990) cross-classify workers with respect to the physical
and psychological demands of the employment. There are separate tables for males and
females.
Table 1: Females
Work is physically demanding
Usually
Sometimes
Seldom
Work is psychologically demanding
Usually
Sometimes
Seldom
100
109
202
33
89
179
100
179
542
2
Table 2: Males
Work is physically demanding
3.
Usually
Sometimes
Seldom
Work is psychologically demanding
Usually
Sometimes
Seldom
113
163
370
45
106
280
229
343
568
(a)
Conditional on the gender of the respondents is there any association between attitudes
toward physical and psychological demands of employment? For each table, compute
values of the Pearson X 2 and the G 2 statistics. Report degrees of freedom and p-values.
State your conclusions.
(b)
Use the Goodman-Kruskal gamma statistic to quantify the level of association between
attitudes about the physical and psychological demands of work for females and males.
Report a standard error for each estimate and use the large sample normal approximation
to the distribution of the gamma statistic to construct approximate 95% confidence
intervals.
The sample size may need to be very large before the sampling distribution of the GoodmanKruskal gamma (γˆ ) statistic is reasonably well approximated by its limiting normal distribution,
especially if γ is large. A transformation that approaches its asymptotic normal distribution
more rapidly is ςˆ = 1/2 log [(1 + ˆγ ) / (1 − γˆ )] . (Note that this is a transformation R.A. Fisher
proposed for correlation coefficients.)
(a)
Use the delta method to obtain a formula for the large sample variance of ςˆ as a function
of the large sample variance for γ̂ .
(b)
ςˆ also has a limiting normal distribution. Use this fact and the result from Part (a) to
construct approximate 95% confidence intervals for γ for the two tables in Problem 1.
(c)
Test the null hypothesis that the level of association between attitudes toward physical and
psychological demands of employment, as measured by gamma, is the same for females
and males. Give a formula and a value for your test statistic and a p-value. State your
conclusions. (In answering this question you may assume that the counts for females and
males have independent multinomial distributions.
3
4.
Mullins and Sites (1984) collected information on educational achievement of mothers and
fathers for a sample of eminent black Americans (persons in the publication Who’s Who Among
Black Americans). The following table shows some of the results
Mother
College Graduate
Not a College Graduate
College Graduate
87
35
Father
Not a College Graduate
51
217
Let πij denote the probability of selecting a person from the population of eminent black
Americans with a mother in the i-th row category and a father in the j-th row column category.
(a)
How would you interpret the quantity
α=
(b)
π1+ π+2
π 2+ π+1
?
Let Yij denote the count in the i-th row and j-th column of the table. Use the multinomial
distribution for (Y11 , Y12 , Y21 , Y22 ) and the delta method to find the large sample variance
of ln (αˆ ) where ln(αˆ ) = ln( Y1+ ) − ln(Y2+ ) − ln(Y+1 ) + ln(Y2+ ) .
(c)
Use the large sample normal distribution for ln (αˆ ) and the data in the table to construct a
95% confidence interval for ln(α) = ln( π1+ ) − ln(π 2 + ) − ln( π + 1 ) + ln( π + 2 ) . What can you
conclude from this confidence interval?
(d)
Apply the exponential function to the end points of the confidence interval in Part (c) to
obtain an approximate 95% confidence interval for α .
(e)
Use the large sample normal approximation to the distribution of ln(α̂ ) and the delta
method to obtain the large sample normal distribution for α̂ .
(f)
Use the result from Part (e) to construct an approximate 95% confidence interval for α .
(g)
The methods for constructing a 95% confidence interval for α in Part (d) generally gives
coverage probabilities closer to 95% than the method in Part (f). Do you believe that this
is true? Explain.
(h)
Commercial statistical software packages generally use the large sample normal
distribution of the counts, or parameter estimates, and the delta method to produce
standard errors and confidence intervals. Alternatively a bootstrap procedure could be
used to construct confidence intervals. The bootstrap is another method that gives
consistent results for large samples. SAS code for computing bootstrapped confidence
intervals (using 5000 bootstrap samples) is posted on the course web page for assignments
as hw3boot.sas, and corresponding S-PLUS code is posted as hw3boot.ssc. Choose one
4
of these programs and run it three times to produce 3 bootstrap confidence intervals.
These data are already in the code. This will give you some idea about variation in
bootstrap results. Report your confidence intervals and compare them to those from Parts
(e) and (g). Which method is better? Explain.
5.
In a study of disparities between mother and child perceptions of ability, sixth grade
children were asked to rate their own academic ability. The mother of each child also was
asked to rate the child’s academic ability. Separate tables are reported for white and black
children. Each count in a table corresponds to one mother-child pair.
Table 1. White Children
Mother’s
Rating
Below Average
Average
Above Average
Below Average
9
26
10
Child’s Rating
Average
10
6
17
Above Average
5
13
10
Below Average
10
31
22
Child’s Rating
Average
5
10
18
Above Average
10
4
9
Table 1. Black Children
Mother’s
Rating
6.
Below Average
Average
Above Average
(a)
To quantify the level of agreement between mother and child perception of academic
ability, estimate Cohen’s Kappa and compute a 95% confidence interval for Cohen’s
Kappa for each table. Is there more than random agreement in either table?
(b)
Compute a 95% confidence interval for the difference in the Kappa measures of
agreement for the two tables. (Assume the statistics for the two tables are independent.)
State your conclusion. Is the level of agreement between ratings given by mother and
child the same for white and black families?
The data for this problem were collected by Tuyns, et. al., (1977) Bull. Cancer, 64, pp. 45-60) in
a study of oesophageal cancer at Ille-et-Vilaine, France. Cases in this study are 200 males
diagnosed with oesophageal cancer in one of the regional hospitals between January 1972 and
April 1974. Controls were obtained from a sample of adult males drawn from electoral lists, of
whom 775 provided sufficient data for analysis. (This is an example of a traditional casecontrol study.) Both cases and controls completed a detailed interview that provided
information on dietary habits. This question examines the effects of alcohol consumption on the
risk of developing oesophageal cancer. The following table presents the results of the study
5
stratified by ten-year age intervals. The data are posted as tuyns.dat. In this problem you are
asked to examine the data with simple methods such as odds ratios and the Mantel-Haenszel
estimation of a common odds ratio. If the age and alcohol consumption for each subject were
not categorized, a logistic regression analysis might provide a more informative summary of the
data.
Daily Alcohol Consumption
Age (in years)
At Least 80g
Less Than 80g
25-34
Cases
Controls
1
9
0
106
35-44
Cases
Controls
4
26
5
164
45-54
Cases
Controls
25
29
21
138
55-64
Cases
Controls
42
27
34
139
65-74
Cases
Controls
19
18
36
88
74+
Cases
Controls
5
0
8
31
(a)
For each age group, estimate the ratio of the odds for cancer for the high alcohol
consumption (at less 80 g/day) group versus the low alcohol consumption (less than 80
g/day) group. Also report a 95% confidence interval for each odds ratio. What can you
conclude from these results. Describe your method for dealing with zeros.
(b)
Compute the value of the Mantel-Haenszel estimator of the common odds ratio and also
obtain an approximate 95% confidence interval.
(c)
Of course the Mantel-Haenszel estimator in Part (b) is appropriate only when the odds
ratios are the same for all 6 age groups. Compute the value of the Breslow-Day and T4
tests for homogeneity of odds ratios. Report the value of each test statistic along with
degrees of freedom and a p-value.
(d)
Which of the tests in Part (c) is more appropriate for these data? Explain.
(e)
State your conclusion from Part (c). Do the odds ratios appear to be homogeneous? If
not, which are the same and which are different?
6
7.
(f)
Compute the value of the Cochran-Mantel-Haenszel test statistic for the null hypothesis
that alcohol consumption is independent of case/control status within each age group.
Report the value of the test statistic, its degrees of freedom, and a p-value. State your
conclusions.
(g)
Consider the data in the 2x2 table for the 65-74 year olds. Can the relative risk of
oesophageal cancer for heavy and light alcohol consumption be directly estimated from
these data? Explain.
Consider the data in Problem 3.24 on Page 173 in Lloyd’s book.
(a)
Use the odds ratio to quantify the effect of loss of a sibling on the risk of being a
“problem” child within each of the three birth order categories. Construct an
approximate 95% confidence interval for each odds ratio.
(b)
Use the Breslow-Day test to test the null hypothesis of homogeneous odds ratios in Part
(a). State your conclusion. Is there a trend in the logarithms of the odds ratios? Why is
it appropriate to use the Breslow-Day test in this case?
(c)
Obtain a 2x2 table by collapsing across the birth order categories. Analyze this table. Is
this an example of Simpson’s paradox? Explain.
8. Return to the analysis of smooth cavities in 12 year old children performed in the lecture.
(a)
Use the maximum likelihood estimates for the parameters in the negative binomial model
to compute a maximum likelihood estimate of the proportion of 12 year old children with
no cavities.
(b)
Derive a formula for a large sample approximation to the variance of your estimate in
Part (a).
(c)
Evaluate the variance formula in Part (a) and use it to obtain an approximate 95%
confidence interval for the proportion of 12 year old children with no cavities.
(d)
Describe a method for assessing the true coverage probability of the confidence interval
constructed in Part (c). Do not perform any calculations, just outline what you would do.
9. Complete problem 5 on assignment 2.
Download