STAT 557 Assignment #3 Fall 2002 Reading Assignment: Lloyd: The delta method is reviewed in Section 1.6. The analysis of several 2x2 tables is discussed in Sections 3.4-3.6. Written Assignment: Due Monday, October 7, in class. 1. Freeman (1987) presents the following 2x2 table, reported by Ostfeld (1980), on the relationship between coronary heart disease (CHD) and perceived stress in the workplace. These data came from the Western Electric Longitudinal Study in which workers were asked a set of questions about their work environment and then followed for ten years. The ten year follow-up permitted the identification of major CHD outcomes. Major CHD event? Do you work under tension? Yes No Yes 97 200 No 307 1409 (a) Since this is a prospective study, you can directly estimate the relative risk of CHD. Report the values of your estimate, its standard error and a 95% confidence interval for the relative risk of CHD. State your conclusion. (b) Report the value of the estimated odds ratio, its standard error, and a 95% confidence interval for the odds ratio. How does this compare to the results for the direct estimate of relative risk in part (a)? 2. In 1974, the Danish National Institute for Social Science Research interviewed a random sample of Danes between 20 and 69 years old in order to investigate the general welfare in Denmark, The following two tables (Andersen, 1990) cross-classify workers with respect to the physical and psychological demands of the employment. There are separate tables for males and females. Table 1: Females Work is physically demanding Usually Sometimes Seldom Work is psychologically demanding Usually Sometimes Seldom 100 109 202 33 89 179 100 179 542 2 Table 2: Males Work is physically demanding 3. Usually Sometimes Seldom Work is psychologically demanding Usually Sometimes Seldom 113 163 370 45 106 280 229 343 568 (a) Conditional on the gender of the respondents is there any association between attitudes toward physical and psychological demands of employment? For each table, compute values of the Pearson X 2 and the G 2 statistics. Report degrees of freedom and p-values. State your conclusions. (b) Use the Goodman-Kruskal gamma statistic to quantify the level of association between attitudes about the physical and psychological demands of work for females and males. Report a standard error for each estimate and use the large sample normal approximation to the distribution of the gamma statistic to construct approximate 95% confidence intervals. The sample size may need to be very large before the sampling distribution of the GoodmanKruskal gamma (γˆ ) statistic is reasonably well approximated by its limiting normal distribution, especially if γ is large. A transformation that approaches its asymptotic normal distribution more rapidly is ςˆ = 1/2 log [(1 + ˆγ ) / (1 − γˆ )] . (Note that this is a transformation R.A. Fisher proposed for correlation coefficients.) (a) Use the delta method to obtain a formula for the large sample variance of ςˆ as a function of the large sample variance for γ̂ . (b) ςˆ also has a limiting normal distribution. Use this fact and the result from Part (a) to construct approximate 95% confidence intervals for γ for the two tables in Problem 1. (c) Test the null hypothesis that the level of association between attitudes toward physical and psychological demands of employment, as measured by gamma, is the same for females and males. Give a formula and a value for your test statistic and a p-value. State your conclusions. (In answering this question you may assume that the counts for females and males have independent multinomial distributions. 3 4. Mullins and Sites (1984) collected information on educational achievement of mothers and fathers for a sample of eminent black Americans (persons in the publication Who’s Who Among Black Americans). The following table shows some of the results Mother College Graduate Not a College Graduate College Graduate 87 35 Father Not a College Graduate 51 217 Let πij denote the probability of selecting a person from the population of eminent black Americans with a mother in the i-th row category and a father in the j-th row column category. (a) How would you interpret the quantity α= (b) π1+ π+2 π 2+ π+1 ? Let Yij denote the count in the i-th row and j-th column of the table. Use the multinomial distribution for (Y11 , Y12 , Y21 , Y22 ) and the delta method to find the large sample variance of ln (αˆ ) where ln(αˆ ) = ln( Y1+ ) − ln(Y2+ ) − ln(Y+1 ) + ln(Y2+ ) . (c) Use the large sample normal distribution for ln (αˆ ) and the data in the table to construct a 95% confidence interval for ln(α) = ln( π1+ ) − ln(π 2 + ) − ln( π + 1 ) + ln( π + 2 ) . What can you conclude from this confidence interval? (d) Apply the exponential function to the end points of the confidence interval in Part (c) to obtain an approximate 95% confidence interval for α . (e) Use the large sample normal approximation to the distribution of ln(α̂ ) and the delta method to obtain the large sample normal distribution for α̂ . (f) Use the result from Part (e) to construct an approximate 95% confidence interval for α . (g) The methods for constructing a 95% confidence interval for α in Part (d) generally gives coverage probabilities closer to 95% than the method in Part (f). Do you believe that this is true? Explain. (h) Commercial statistical software packages generally use the large sample normal distribution of the counts, or parameter estimates, and the delta method to produce standard errors and confidence intervals. Alternatively a bootstrap procedure could be used to construct confidence intervals. The bootstrap is another method that gives consistent results for large samples. SAS code for computing bootstrapped confidence intervals (using 5000 bootstrap samples) is posted on the course web page for assignments as hw3boot.sas, and corresponding S-PLUS code is posted as hw3boot.ssc. Choose one 4 of these programs and run it three times to produce 3 bootstrap confidence intervals. These data are already in the code. This will give you some idea about variation in bootstrap results. Report your confidence intervals and compare them to those from Parts (e) and (g). Which method is better? Explain. 5. In a study of disparities between mother and child perceptions of ability, sixth grade children were asked to rate their own academic ability. The mother of each child also was asked to rate the child’s academic ability. Separate tables are reported for white and black children. Each count in a table corresponds to one mother-child pair. Table 1. White Children Mother’s Rating Below Average Average Above Average Below Average 9 26 10 Child’s Rating Average 10 6 17 Above Average 5 13 10 Below Average 10 31 22 Child’s Rating Average 5 10 18 Above Average 10 4 9 Table 1. Black Children Mother’s Rating 6. Below Average Average Above Average (a) To quantify the level of agreement between mother and child perception of academic ability, estimate Cohen’s Kappa and compute a 95% confidence interval for Cohen’s Kappa for each table. Is there more than random agreement in either table? (b) Compute a 95% confidence interval for the difference in the Kappa measures of agreement for the two tables. (Assume the statistics for the two tables are independent.) State your conclusion. Is the level of agreement between ratings given by mother and child the same for white and black families? The data for this problem were collected by Tuyns, et. al., (1977) Bull. Cancer, 64, pp. 45-60) in a study of oesophageal cancer at Ille-et-Vilaine, France. Cases in this study are 200 males diagnosed with oesophageal cancer in one of the regional hospitals between January 1972 and April 1974. Controls were obtained from a sample of adult males drawn from electoral lists, of whom 775 provided sufficient data for analysis. (This is an example of a traditional casecontrol study.) Both cases and controls completed a detailed interview that provided information on dietary habits. This question examines the effects of alcohol consumption on the risk of developing oesophageal cancer. The following table presents the results of the study 5 stratified by ten-year age intervals. The data are posted as tuyns.dat. In this problem you are asked to examine the data with simple methods such as odds ratios and the Mantel-Haenszel estimation of a common odds ratio. If the age and alcohol consumption for each subject were not categorized, a logistic regression analysis might provide a more informative summary of the data. Daily Alcohol Consumption Age (in years) At Least 80g Less Than 80g 25-34 Cases Controls 1 9 0 106 35-44 Cases Controls 4 26 5 164 45-54 Cases Controls 25 29 21 138 55-64 Cases Controls 42 27 34 139 65-74 Cases Controls 19 18 36 88 74+ Cases Controls 5 0 8 31 (a) For each age group, estimate the ratio of the odds for cancer for the high alcohol consumption (at less 80 g/day) group versus the low alcohol consumption (less than 80 g/day) group. Also report a 95% confidence interval for each odds ratio. What can you conclude from these results. Describe your method for dealing with zeros. (b) Compute the value of the Mantel-Haenszel estimator of the common odds ratio and also obtain an approximate 95% confidence interval. (c) Of course the Mantel-Haenszel estimator in Part (b) is appropriate only when the odds ratios are the same for all 6 age groups. Compute the value of the Breslow-Day and T4 tests for homogeneity of odds ratios. Report the value of each test statistic along with degrees of freedom and a p-value. (d) Which of the tests in Part (c) is more appropriate for these data? Explain. (e) State your conclusion from Part (c). Do the odds ratios appear to be homogeneous? If not, which are the same and which are different? 6 7. (f) Compute the value of the Cochran-Mantel-Haenszel test statistic for the null hypothesis that alcohol consumption is independent of case/control status within each age group. Report the value of the test statistic, its degrees of freedom, and a p-value. State your conclusions. (g) Consider the data in the 2x2 table for the 65-74 year olds. Can the relative risk of oesophageal cancer for heavy and light alcohol consumption be directly estimated from these data? Explain. Consider the data in Problem 3.24 on Page 173 in Lloyd’s book. (a) Use the odds ratio to quantify the effect of loss of a sibling on the risk of being a “problem” child within each of the three birth order categories. Construct an approximate 95% confidence interval for each odds ratio. (b) Use the Breslow-Day test to test the null hypothesis of homogeneous odds ratios in Part (a). State your conclusion. Is there a trend in the logarithms of the odds ratios? Why is it appropriate to use the Breslow-Day test in this case? (c) Obtain a 2x2 table by collapsing across the birth order categories. Analyze this table. Is this an example of Simpson’s paradox? Explain. 8. Return to the analysis of smooth cavities in 12 year old children performed in the lecture. (a) Use the maximum likelihood estimates for the parameters in the negative binomial model to compute a maximum likelihood estimate of the proportion of 12 year old children with no cavities. (b) Derive a formula for a large sample approximation to the variance of your estimate in Part (a). (c) Evaluate the variance formula in Part (a) and use it to obtain an approximate 95% confidence interval for the proportion of 12 year old children with no cavities. (d) Describe a method for assessing the true coverage probability of the confidence interval constructed in Part (c). Do not perform any calculations, just outline what you would do. 9. Complete problem 5 on assignment 2.