Solutions to Assignment 1 STAT 565 Fall 2005 1. (a) This is an observational study because the investigator did not control the assignment of the women to the exposure groups. The women made their own choices about the use of oral contraceptives. (b) This is a csae-control (or case-referent) study. (c) The sampling units are women from two populations in New Zealand: breast cancer patients and women on the electoral rolls. (d) Let n11 denote the number of breast cancer patients who used oral contraceptives. Let n12 denote the number of breast cancer patients who did not use oral contraceptives. Let n21 denote the number of women with no history of breast cancer who used oral contraceptives. Let n22 denote the number of women with no history of breast cancer patients who did not use oral contraceptives. Then ˆ = n11 ∗ n22 = (310)(189) = 0.673 OR n12 ∗ n21 (123)(708) (e) SElog(OR) = 1/n11 + 1/n12 + 1/n21 + 1/n22 = 1/310 + 1/189 + 1/123 + 1/708 = 0.1344 Approximate 95 % CI for log(OR): log(OR) ± 1.96 ∗ SElog(OR) ⇒ log(0.6728) ± 1.96 ∗ 0.1344 ⇒ (−0.660, −0.133) Approximate 95 % CI for OR: (e−0.660 , e−0.133 ) ⇒ (0.517, 0.876). (f) Relative risk cannot be directly estimated from this case-control study. To estimate relative risk, we need to estimate P (cancer|OC use|) and P (cancer|no OC use). P(cancer|OC use) P(cancer|no OC use) P(OC use|cancer)×P(cancer) P(OC use) = P(no OC use|cancer)×P(cancer) P(no OC use) P(OC use|cancer) P(OC use) = P(no OC use|cancer) P(no OC use) RR = Using the data from this study, we can estimate P(OC use|cancer) and P(no OC use|cancer). To estimate relative risk, we would also need estimates of P(OC use) and P (no OC use), probabilities that cannot be estimated from the data provided by the case-control study. (g) A possible source of bias is the use of electoral rolls to obtain the sample of women with no history of breast cancer. Women who register to vote may differ from the larger population of women in New Zealand who have no history of breast cancer with respect to oral contraceptive use. The electoral polls, for example may contain a higher proportion of older women than the general population of women and older women may have lower levels of past contraceptive use. 1 2. (a) RR = and a direct estimate is P (CHD|SC ≥ 250) P (CHD|SC < 250 ˆ = 49/(49 + 134) = 1.48 RR 86/(86 + 392) (b) Use the Delta method to derive the variance of the estimate of log(RR). Let π1 = P (CHD|SC < 250), and let n be the total number of 50-62 year old men in the study with serum cholesterol level less than 250, let X be the number of those men with coronary heart disease (CHD), and let p = X/n. Let π2 = P (CHD| ≥ 250), and let m be the total number of 50-62 year old men in the study with serum cholesterol level ≥ 250, let Y be the number of those men with coronary heart disease (CHD), and let q = Y /m. Assuming independent binomial distributions for X and Y, p and q are also independent with V ar(p) = π1 (1 − π1 )/n and V ar(q) = π2 (1 − π2 )/m. Then V ar(p, q) = π (1−π ) 1 1 0 n π2 (1−π2 ) m 0 Let g(π1 , π2 ) = log(RR) = log(π1 /π2 ) = log(π1 ) − log(π2 ) ,and derive the vector of first partial derivatives of g: D = 1/π1 −1/π2 Using the delta method, the large sample variance of log(p/q) is V ar(log(p/g)) = D ∗ V ar(p, q) ∗ D t = (1 − π1 )/(nπ1 ) + (1 − π2 )/(mπ2 ) This is estimated as (1 − p)/(np) + (1 − q)/(mq) = 0.02446, and SElog(p/q) = 0.157 An approximate 95 % CI for log(RR) is log(p/q) ± 1.96 ∗ SElog(p/q) ⇒ (0.084, 0.706) Then, an approximate 95 % CI for RR is (e0.084 , e0.706 ) = (1.09, 2.03) (c) The estimated odds ratio ˆ = n11 ∗ n22 = (49)(392) = 1.667 OR n12 ∗ n21 (184)(86) is slightly larger than the direct estimate of relative risk from part A. (d) SElog(OR) ˆ = 1/n11 + 1/n12 + 1/n21 + 1/n22 = 1/49 + 1/134 + 1/86 + 1/392 = 0.20506 An approximate 95 % CI for log(OR) is ˆ ± 1.96 ∗ SE ⇒ log(1.667) ± 1.96 ∗ 0.20506 ⇒ (0.109, 0.913) log(OR) ˆ log(OR) and an approximate 95 % CI for the odds ratio is (e0.109 , e0.913 ) = (1.11, 2.49). 2 (e) Given the width of the 95 % CI for relative risk (1.15, 2.13), the Odds ratio appears to be a moderately good approximation to relative risk. The OR is somewhat larger than RR in this case because CHD occurs in a substantial part (about 20%) of the population of 50-62 year old males in Framingham. The odds ratio is generally a good approximation to relative risk if disease incidence is below 10%. 3. (a) (a) Using π1 to denote the conditional probability that a 50-62 year-old male with serum cholesterol less than 250 (SC<250) has coronary heart disease (CHD) and using π2 to denote the conditional probability that a 50-62 year-old male with serum cholesterol of at least 250 (SC≥250) has coronary heart disease (CHD), we have β1 = logit(π2 )−logit(π1 ). Consequently, β1 e = π2 1−π2 π1 1−π1 is the odds ratio obtained by dividing the odds of CHD for 50-62 years old males with serum blood cholesterol of at least 250 by the odds of CHD for 50-62 year old males with a serum blood cholesterol less than 250. (b) Maximum likelihood estimates are βˆ1 = 0.5109 eˆβ1 = e0.5109 = 1.667 Since no matching was done and the individuals can reasonably be assumed to have responded independently of each other, these estimates can be obtained using either the LOGISTIC or GENMOD procedures in SAS or the glm( ) function in SPLUS or R. (c) An approximate 95 % CI for β1 is βˆ1 ± 1.96 ∗ SEβˆ1 ⇒ (0.109, 0.913) Then, an approximate 95 % CI for eβ1 is (e0.109 , e0.913 ) = (1.115, 2.491) This result is the same as the result in 2D. This is an equivalent way to make inferences about an odds ratio. 4. (a) Maximum likelihood estimates can be obtained from either the LOGISTIC or GENMOD procedures in SAS or the glm( ) function in SPLUS or R or other software that fits logistic regression models. Note that the αi parameters are not identifiable and the following estimates are based on the constraint α4 = 0. Different software packages may impose different contraints which changes the interpretation of the parameter estimates. Analysis of Maximum Likelihood Estimates Parameter Intercept chol 1 chol 2 chol 3 sex age DF 1 1 1 1 1 1 Estimate -3.6434 -0.4493 -0.2968 0.1449 1.1090 1.1600 3 Standard Error 0.1169 0.1245 0.1077 0.0950 0.1156 0.1105 Wald Chi-Square 972.0106 13.0319 7.5916 2.3257 92.1047 110.1591 Pr > ChiSq <.0001 0.0003 0.0059 0.1273 <.0001 <.0001 (b) Conditional on any combination of the sex and age groups, eα2 −α1 is the odds ratio of developing CHD for patients with serum cholesterol level between 190 and 290 compared to those with cholesterol below 190. Regardless of the constraint imposed to obtain the point estimates in part (a), the mle for eα2 −α1 is 1.1648 and an approximate 95 % CI is (0.7887, 1.7202). (c) This model assumes additivity of the effects of the risk factors on the relative odds of developing CHD. This means that the effect of one risk factor on the odds ratio is consistant across all combinations of levels of other risk factors in the model. These assumptions could be checked by including interaction terms like AGE*SEX, AGE*CHOL, SEX*CHOL, or AGE*SEX*CHOL in the model and test the significance of interaction terms with chi-square tests. The counts in the table on the second page of the assignment are large enough to use a large sample chi-squared test comparing the model in problem 4 to the saturated model that has a different conditional probability of CHD for each combination of levels of factors in that table. Do that analysis if you want to determine if the model proposed in part (a) of this problem provides an adequate description of the data. 5. You should consider different ways in which women might be included in the study. Suppose the sub-populations have the following distribution: low calcium intake high calcium intake Probability of Hip fracture π1 π2 Probability of No hip fracture Total 1 − π1 N1 1 − π2 N2 where π1 is the conditional probability oh hip fracture for low calcium intake women and π2 is the conditional probability oh hip fracture for high calcium intake women. If the women participating in the study are randomly selected from these subpopulations, choosing independent random samples of 90% of the high calcium intake women and 60% of the low calcium intake women, the sample data would have the following distribution: low calcium intake high calcium intake Probability of Hip fracture π1 π2 Probability of No hip fracture Total 1 − π1 .9N1 1 − π2 .6N2 Consequently, a consistent estimate of the population odds ratio is obtained, regardless of the difference in the proportion of women sampled from the low and high calcium intake groups. If the women are not randomly selected from the high and low calcium intake groups and women are allowed to volunteer, the women who choose to participate in the study could differ in some systematic way from women who do not choose to participate. In this situation the estimated odds ratio could be severely biased. 4 6. (a) The value of the McNemar test statistic is 8.0667 with one degree of freedom and p-value of 0.0045. The value of the McNemar test statistics with the continuity correction can also be calulated by the following formula (|n12 − n21 | − 1)2 n12 + n21 (|41 − 19| − 1)2 = 41 + 19 = 7.35 X2 = There is a significant (p-value=.0067) association between being a driver and occurence of acute herniated lumbar invertebal discs. The odds of developing herniated lumbar discs are nearly twice as great for drivers than for non-drivers. (b) The exact binomial test had a p-value of 0.0062. The result of this test is similar to the result for McNemar’s test with continuity correction. This was expected because the observed counts are large enough for a normal distirbution to provide a good approximation to the binomial distribution. (c) Using software that constructs the conditional likelihood function for matched casecontrol pairs, the estimate of the odds ratio is 2.157. An approximate 95 % CI is (1.252, 3.717). (d) A consistent estimate of themarginal odds ratio is ˆ = x1+ x+2 = 185 ∗ 54 = 1.915 OR x+1 x2+ 163 ∗ 32 Considering the correlation between matched case/control pairs, the variance of the natrual logarithm of the estimate of the marginal odds ratio is var(log(OR)) = n( 1 1 x11 x22 + −2∗( )) x1+ x2+ x+1 x+2 x1+ x2+ x+1 x+2 V ar(log(OR)) = 0.0522 SElog(OR) = 0.228 An approximate 95 % CI for log(OR): log(OR) ± 1.96 ∗ SElog(OR) ⇒ log(1.915) ± 1.96 ∗ 0.228 ⇒ (0.197, 1.092) An approximate 95 % CI for the marginal odds ratio is : (e0.197 , e1.092 ) ⇒ (1.22, 2.99). 5 In this case the results for the marginal odds ratio, where you average across age groups and all other uncontrolled risk factors, are close to the results from part(b) where you obtained a “pooled” estimate of a common within age group odds ratio. For other data sets, the marginal odds ratio can be quite different from the within-group odds ratio. If you incorrectly assume independence in obtaining an estimate of the natural logarithm of the marginal odds ratio, you will obtain log(1.915) ± (1.96) ( 1 1 1 1 + + + ) ⇒ (0.165, 1.135) 185 59 165 52 The resulting incorrect 95 % CI for the marginal odds ratio is : (e0.165 , e1.135 ) ⇒ (1.18, 3.11). This is slightly wider than the “correct” approximate confidence interval given above, but it is substantially shorter than the condidence interval for the conditional odds ratio obtained in part (c). Note: The following lines of SAS code, using the FREQ procedure data set1; input x y count; datalines; 1 1 185 1 2 32 2 1 163 2 2 54 run; proc freq data=set1; table x*y / chisq all; weight count; run; provides the same value of the estimate of the odds ratio and a confidence interval (1.179, 3.112) which is too wide because it ignores possible correlation between responses for the matched pairs. 7. (a) The estimated odds ratio is 7.955 with approximate 95% confidence interval (3.49, 18.15) Hence, the risk of eudiometrical cancer for those using estrogen relative to those not using estrogen is estimated to be about 3.5 to 18 times more likely. (b) The model in part (a) assumes the odds ratio is homogenous across all age groups. You are estimating a homogeneous within age group odds ratio. This model does not account for the effects of other possible risk factors. 6 (c) After accounting for additive effects of obesity, high blood pressure, gall bladder disease, and non-estrogen drug use in a proportional hazards Weibull model, the risk of eudiometrical cancer for those using estrogen is estimated to be about 7.8 times higher than for those not using estrogen. Inclusion of the other risk factor resulted in small decrease in the odds ratio that is of little practical consequence. Very little of association between estrogen use and increased incidence of eudiometrical cancer was explained by including those additional risk factors in the model. (d) Gallbladder disease has significant association with eudiometrical cancer with a odds ratio of 3.75. There is no significant evidence of any association between risk of eudiometrical cancer and either presence of hypertension, use of nonestrogen drugs or obesity. 7