STAT 565 Fall 2005 Solutions to Assignment 1

advertisement
Solutions to Assignment 1
STAT 565
Fall 2005
1. (a) This is an observational study because the investigator did not control the assignment
of the women to the exposure groups. The women made their own choices about the use
of oral contraceptives.
(b) This is a csae-control (or case-referent) study.
(c) The sampling units are women from two populations in New Zealand: breast cancer
patients and women on the electoral rolls.
(d) Let n11 denote the number of breast cancer patients who used oral contraceptives. Let
n12 denote the number of breast cancer patients who did not use oral contraceptives.
Let n21 denote the number of women with no history of breast cancer who used oral
contraceptives. Let n22 denote the number of women with no history of breast cancer
patients who did not use oral contraceptives. Then
ˆ = n11 ∗ n22 = (310)(189) = 0.673
OR
n12 ∗ n21
(123)(708)
(e)
SElog(OR) =
1/n11 + 1/n12 + 1/n21 + 1/n22 =
1/310 + 1/189 + 1/123 + 1/708 = 0.1344
Approximate 95 % CI for log(OR):
log(OR) ± 1.96 ∗ SElog(OR) ⇒ log(0.6728) ± 1.96 ∗ 0.1344 ⇒ (−0.660, −0.133)
Approximate 95 % CI for OR: (e−0.660 , e−0.133 ) ⇒ (0.517, 0.876).
(f) Relative risk cannot be directly estimated from this case-control study. To estimate
relative risk, we need to estimate P (cancer|OC use|) and P (cancer|no OC use).
P(cancer|OC use)
P(cancer|no OC use)
P(OC use|cancer)×P(cancer)
P(OC use)
=
P(no OC use|cancer)×P(cancer)
P(no OC use)
P(OC use|cancer)
P(OC use)
=
P(no OC use|cancer)
P(no OC use)
RR =
Using the data from this study, we can estimate P(OC use|cancer) and P(no OC use|cancer).
To estimate relative risk, we would also need estimates of P(OC use) and P (no OC use),
probabilities that cannot be estimated from the data provided by the case-control study.
(g) A possible source of bias is the use of electoral rolls to obtain the sample of women
with no history of breast cancer. Women who register to vote may differ from the larger
population of women in New Zealand who have no history of breast cancer with respect to
oral contraceptive use. The electoral polls, for example may contain a higher proportion
of older women than the general population of women and older women may have lower
levels of past contraceptive use.
1
2. (a)
RR =
and a direct estimate is
P (CHD|SC ≥ 250)
P (CHD|SC < 250
ˆ = 49/(49 + 134) = 1.48
RR
86/(86 + 392)
(b) Use the Delta method to derive the variance of the estimate of log(RR). Let π1 =
P (CHD|SC < 250), and let n be the total number of 50-62 year old men in the study
with serum cholesterol level less than 250, let X be the number of those men with
coronary heart disease (CHD), and let p = X/n. Let π2 = P (CHD| ≥ 250), and let
m be the total number of 50-62 year old men in the study with serum cholesterol level
≥ 250, let Y be the number of those men with coronary heart disease (CHD), and let
q = Y /m. Assuming independent binomial distributions for X and Y, p and q are also
independent with V ar(p) = π1 (1 − π1 )/n and V ar(q) = π2 (1 − π2 )/m. Then
V ar(p, q) =
π (1−π )
1
1
0
n
π2 (1−π2 )
m
0
Let g(π1 , π2 ) = log(RR) = log(π1 /π2 ) = log(π1 ) − log(π2 ) ,and derive the vector of first
partial derivatives of g:
D = 1/π1 −1/π2
Using the delta method, the large sample variance of log(p/q) is
V ar(log(p/g)) = D ∗ V ar(p, q) ∗ D t = (1 − π1 )/(nπ1 ) + (1 − π2 )/(mπ2 )
This is estimated as (1 − p)/(np) + (1 − q)/(mq) = 0.02446, and SElog(p/q) = 0.157 An
approximate 95 % CI for log(RR) is
log(p/q) ± 1.96 ∗ SElog(p/q) ⇒ (0.084, 0.706)
Then, an approximate 95 % CI for RR is
(e0.084 , e0.706 ) = (1.09, 2.03)
(c) The estimated odds ratio
ˆ = n11 ∗ n22 = (49)(392) = 1.667
OR
n12 ∗ n21
(184)(86)
is slightly larger than the direct estimate of relative risk from part A.
(d)
SElog(OR)
ˆ =
1/n11 + 1/n12 + 1/n21 + 1/n22 =
1/49 + 1/134 + 1/86 + 1/392 = 0.20506
An approximate 95 % CI for log(OR) is
ˆ ± 1.96 ∗ SE
⇒ log(1.667) ± 1.96 ∗ 0.20506 ⇒ (0.109, 0.913)
log(OR)
ˆ
log(OR)
and an approximate 95 % CI for the odds ratio is (e0.109 , e0.913 ) = (1.11, 2.49).
2
(e) Given the width of the 95 % CI for relative risk (1.15, 2.13), the Odds ratio appears to be
a moderately good approximation to relative risk. The OR is somewhat larger than RR
in this case because CHD occurs in a substantial part (about 20%) of the population of
50-62 year old males in Framingham. The odds ratio is generally a good approximation
to relative risk if disease incidence is below 10%.
3. (a) (a) Using π1 to denote the conditional probability that a 50-62 year-old male with serum
cholesterol less than 250 (SC<250) has coronary heart disease (CHD) and using π2 to
denote the conditional probability that a 50-62 year-old male with serum cholesterol of at
least 250 (SC≥250) has coronary heart disease (CHD), we have β1 = logit(π2 )−logit(π1 ).
Consequently,
β1
e
=
π2
1−π2
π1
1−π1
is the odds ratio obtained by dividing the odds of CHD for 50-62 years old males with
serum blood cholesterol of at least 250 by the odds of CHD for 50-62 year old males with
a serum blood cholesterol less than 250.
(b) Maximum likelihood estimates are
βˆ1 = 0.5109
eˆβ1 = e0.5109 = 1.667
Since no matching was done and the individuals can reasonably be assumed to have
responded independently of each other, these estimates can be obtained using either the
LOGISTIC or GENMOD procedures in SAS or the glm( ) function in SPLUS or R.
(c) An approximate 95 % CI for β1 is
βˆ1 ± 1.96 ∗ SEβˆ1 ⇒ (0.109, 0.913)
Then, an approximate 95 % CI for eβ1 is (e0.109 , e0.913 ) = (1.115, 2.491)
This result is the same as the result in 2D. This is an equivalent way to make inferences
about an odds ratio.
4. (a) Maximum likelihood estimates can be obtained from either the LOGISTIC or GENMOD
procedures in SAS or the glm( ) function in SPLUS or R or other software that fits logistic
regression models. Note that the αi parameters are not identifiable and the following
estimates are based on the constraint α4 = 0. Different software packages may impose
different contraints which changes the interpretation of the parameter estimates.
Analysis of Maximum Likelihood Estimates
Parameter
Intercept
chol
1
chol
2
chol
3
sex
age
DF
1
1
1
1
1
1
Estimate
-3.6434
-0.4493
-0.2968
0.1449
1.1090
1.1600
3
Standard
Error
0.1169
0.1245
0.1077
0.0950
0.1156
0.1105
Wald
Chi-Square
972.0106
13.0319
7.5916
2.3257
92.1047
110.1591
Pr > ChiSq
<.0001
0.0003
0.0059
0.1273
<.0001
<.0001
(b) Conditional on any combination of the sex and age groups, eα2 −α1 is the odds ratio of
developing CHD for patients with serum cholesterol level between 190 and 290 compared
to those with cholesterol below 190. Regardless of the constraint imposed to obtain the
point estimates in part (a), the mle for eα2 −α1 is 1.1648 and an approximate 95 % CI is
(0.7887, 1.7202).
(c) This model assumes additivity of the effects of the risk factors on the relative odds
of developing CHD. This means that the effect of one risk factor on the odds ratio is
consistant across all combinations of levels of other risk factors in the model. These assumptions could be checked by including interaction terms like AGE*SEX, AGE*CHOL,
SEX*CHOL, or AGE*SEX*CHOL in the model and test the significance of interaction
terms with chi-square tests. The counts in the table on the second page of the assignment
are large enough to use a large sample chi-squared test comparing the model in problem
4 to the saturated model that has a different conditional probability of CHD for each
combination of levels of factors in that table. Do that analysis if you want to determine
if the model proposed in part (a) of this problem provides an adequate description of the
data.
5. You should consider different ways in which women might be included in the study.
Suppose the sub-populations have the following distribution:
low calcium intake
high calcium intake
Probability of
Hip fracture
π1
π2
Probability of
No hip fracture Total
1 − π1
N1
1 − π2
N2
where π1 is the conditional probability oh hip fracture for low calcium intake women and
π2 is the conditional probability oh hip fracture for high calcium intake women.
If the women participating in the study are randomly selected from these subpopulations,
choosing independent random samples of 90% of the high calcium intake women and 60%
of the low calcium intake women, the sample data would have the following distribution:
low calcium intake
high calcium intake
Probability of
Hip fracture
π1
π2
Probability of
No hip fracture Total
1 − π1
.9N1
1 − π2
.6N2
Consequently, a consistent estimate of the population odds ratio is obtained, regardless
of the difference in the proportion of women sampled from the low and high calcium
intake groups.
If the women are not randomly selected from the high and low calcium intake groups
and women are allowed to volunteer, the women who choose to participate in the study
could differ in some systematic way from women who do not choose to participate. In
this situation the estimated odds ratio could be severely biased.
4
6. (a) The value of the McNemar test statistic is 8.0667 with one degree of freedom and p-value
of 0.0045. The value of the McNemar test statistics with the continuity correction can
also be calulated by the following formula
(|n12 − n21 | − 1)2
n12 + n21
(|41 − 19| − 1)2
=
41 + 19
= 7.35
X2 =
There is a significant (p-value=.0067) association between being a driver and occurence
of acute herniated lumbar invertebal discs. The odds of developing herniated lumbar
discs are nearly twice as great for drivers than for non-drivers.
(b) The exact binomial test had a p-value of 0.0062. The result of this test is similar to the
result for McNemar’s test with continuity correction. This was expected because the observed counts are large enough for a normal distirbution to provide a good approximation
to the binomial distribution.
(c) Using software that constructs the conditional likelihood function for matched casecontrol pairs, the estimate of the odds ratio is 2.157. An approximate 95 % CI is (1.252,
3.717).
(d) A consistent estimate of themarginal odds ratio is
ˆ = x1+ x+2 = 185 ∗ 54 = 1.915
OR
x+1 x2+
163 ∗ 32
Considering the correlation between matched case/control pairs, the variance of the
natrual logarithm of the estimate of the marginal odds ratio is
var(log(OR)) = n(
1
1
x11 x22
+
−2∗(
))
x1+ x2+ x+1 x+2
x1+ x2+ x+1 x+2
V ar(log(OR)) = 0.0522
SElog(OR) = 0.228
An approximate 95 % CI for log(OR):
log(OR) ± 1.96 ∗ SElog(OR) ⇒ log(1.915) ± 1.96 ∗ 0.228 ⇒ (0.197, 1.092)
An approximate 95 % CI for the marginal odds ratio is : (e0.197 , e1.092 ) ⇒ (1.22, 2.99).
5
In this case the results for the marginal odds ratio, where you average across age groups
and all other uncontrolled risk factors, are close to the results from part(b) where you
obtained a “pooled” estimate of a common within age group odds ratio. For other data
sets, the marginal odds ratio can be quite different from the within-group odds ratio.
If you incorrectly assume independence in obtaining an estimate of the natural logarithm
of the marginal odds ratio, you will obtain
log(1.915) ± (1.96) (
1
1
1
1
+
+
+ ) ⇒ (0.165, 1.135)
185 59 165 52
The resulting incorrect 95 % CI for the marginal odds ratio is : (e0.165 , e1.135 ) ⇒
(1.18, 3.11). This is slightly wider than the “correct” approximate confidence interval
given above, but it is substantially shorter than the condidence interval for the conditional odds ratio obtained in part (c).
Note: The following lines of SAS code, using the FREQ procedure
data set1;
input x y count;
datalines;
1 1 185
1 2 32
2 1 163
2 2 54
run;
proc freq data=set1;
table x*y / chisq all;
weight count;
run;
provides the same value of the estimate of the odds ratio and a confidence interval (1.179,
3.112) which is too wide because it ignores possible correlation between responses for the
matched pairs.
7. (a) The estimated odds ratio is 7.955 with approximate 95% confidence interval (3.49, 18.15)
Hence, the risk of eudiometrical cancer for those using estrogen relative to those not using
estrogen is estimated to be about 3.5 to 18 times more likely.
(b) The model in part (a) assumes the odds ratio is homogenous across all age groups. You
are estimating a homogeneous within age group odds ratio. This model does not account
for the effects of other possible risk factors.
6
(c) After accounting for additive effects of obesity, high blood pressure, gall bladder disease,
and non-estrogen drug use in a proportional hazards Weibull model, the risk of eudiometrical cancer for those using estrogen is estimated to be about 7.8 times higher than
for those not using estrogen. Inclusion of the other risk factor resulted in small decrease
in the odds ratio that is of little practical consequence. Very little of association between
estrogen use and increased incidence of eudiometrical cancer was explained by including
those additional risk factors in the model.
(d) Gallbladder disease has significant association with eudiometrical cancer with a odds ratio
of 3.75. There is no significant evidence of any association between risk of eudiometrical
cancer and either presence of hypertension, use of nonestrogen drugs or obesity.
7
Download