Statistical Methods - HUMIS

advertisement
Case-Control Studies: Statistical
Analysis
Greg Stoddard
December 16, 2010
University of Utah School of Medicine
Rothman claims, “Properly carried out, casecontrol studies provide information that
mirrors what could be learned from a
cohort study, usually at considerably less
cost and time.”
[Rothman KJ, Epidemiology: An Introduction, 2002, p.73]
Goal: contrast the statistical approaches of
the two study designs to verify Rothman’s
claim.
Diagrammatically,
Cohort study
E
not-E
D
not-D
D
not-D
Case-Control Study
D
not-D
E
not-E
E
not-E
Data Layout,
E
Not-E
D
a
b
nD
Not-D
c
d
nnot-D
Cohort Study
NE Nnot-E
E
Not-E
D
a
b
ND
Not-D
c
d
Nnot-D
nE
nnot-E
N = fixed , n = free to vary
Case-Control
Study
E
Cohort Study
Not-E
D
a
b
nD
Not-D
c
d
nnot-D
NE Nnot-E
E
Not-E
a
b
ND
Not-D
c
d
Nnot-D
nnot-E
disease cases / persons at
risk
Case-Control Study
D
nE
incidence proportion =
incidence proportion =
(not estimable)
The incidence proportion not being
estimable is not much of a shortcoming.
Given a study’s inclusion/exclusion criteria,
the incidence proportion does not actually
apply to a very wide patient population,
anyway.
The goal is not to estimate incidence, but
rather to assess an exposure-disease
association.
We can do that just fine with relative
measures of effect, the risk ratio and odds
ratio.
E
Cohort Study
Not-E
D
a
b
nD
Not-D
c
d
nnot-D
=(
NE Nnot-E
risk ratio = (a/NE)/(b/Nnot-E)
odds ratio = odds(D|E)/odds(D|not-E)
= (a/c)/(b/d) = (ad)/(bc)
Case-Control Study
E
Not-E
exposure odds ratio
D
a
b
ND
Not-D
c
d
Nnot-D
nE
= odds(E|D)/odds(E|not-D)
= (a/b)/(c/d) = (ad)/(bc)
nnot-E
So, as long as either E or D is free to vary, you get the same
relative effect measure, the odds ratio, with both study designs.
E
Not-E
D
a
b
nD
Not-D
c
d
nnot-D
NE Nnot-E
Cohort Study
a
a
RR= NE  a+c
b
b
Nnot-E b+d
If the disease is rare (<10% in both E and Not-E groups), so
a ≈ 0 and b ≈ 0, then c ≈ a + c and d ≈ b + d.
Substituting,
a
a
ad
a+c
c
RR=
 = =OR
b
b bc
b+d d
So, OR from case-control study approximates RR from
cohort study, when the rare disease assumption is met.
Why the 10%, or 0.10, incidence proportion
is a good cutpoint for “rare disease” is
illustrated nicely in a figure published in:
Zhang J, Yu KF. What’s the relative risk? A
method of correcting the odds ratio in cohort
studies of common outcomes. JAMA 1998;
280(19):1690-91.
Aside:
The formula in Zhang and Yu (1998) for
converting an odds ratio to a risk ratio in
cohort studies has been convincing
criticized as unreliable (Zou, 2004) so you
should avoid using it.
[Zou G. A modified Poisson regression approach to
prospective studies with binary data. Am J Epidemiol
2004;159(7):702-706.]
Checking our progress
How far have we gotten, thus far, in verifying
that a case-control study can mirror what
can be learned in a cohort study?
Checking our progress
We have seen that the OR is the same in
both study designs.
We have seen that the OR approximates the
RR under the rare disease assumption,
and so it has a straightforward
interpretation.
Checking our progress
However, cohort studies rarely use the odds
ratio, nor do they use the risk ratio.
Instead, cohort studies use survival analysis.
Why?
Risk Ratio Analysis
This type of analysis ignores time-at-risk.
That is, it assumes an equal follow-up time
for every study subject.
Exposed
Non-Exposed
Followup day
Begin
N
Disease
Cases
DaySpecific
Risk
Begin
N
Disease
DayDayCases Specific Specific
Risk
Risk
Ratio
1
50
5
0.10
50
2
0.04
2.5
2
30
10
0.33
40
8
0.20
1.7
3
10
10
1.00
20
10
0.50
2.0
Total
90
25
110
20
The risk ratio uses partial information (shown in
blue) from the complete data in the life table.
Risk ratio analysis data
Exposed
Not-Exposed
Disease
25 (50%)
20 (40%)
Not-Disease
25
30
N
50
50
Risk Ratio = (25/50)/(20/50) =1.25
Chi-square test, p = 0.31
Analyzing these data in this way, we do not
demonstrate a significant effect. In fact, this
crude RR underestimates each of the dayspecific RR estimates.
Rate Ratio Analysis
Let’s see if we can do better with a rate ratio
analysis. It uses a person-time
denominator, so in that sense, it relaxes
the equal time-at-risk assumption of the
risk ratio analysis.
Exposed
Non-Exposed
Followup day
Begin
N
Disease
Cases
DaySpecific
Risk
Begin
N
Disease
DayDayCases Specific Specific
Risk
Risk
Ratio
1
50
5
0.10
50
2
0.04
2.5
2
30
10
0.33
40
8
0.20
1.7
3
10
10
1.00
20
10
0.50
2.0
Total
90
25
110
20
The rate ratio uses partial information (shown in
blue) from the complete data in the life table.
Rate ratio analysis data
Exposed
Not-Exposed
Disease
25 (50%)
20 (40%)
Person-Days
90
110
Rate Ratio = (25/90)/(20/110) =1.53
Binomial probability mid-p exact test for person-time data,
p = 0.080
Analyzing these data in this way, we almost
demonstrate a significant effect. Again, this
crude rate ratio underestimates each of the dayspecific risk ratio (rate ratio) estimates.
Inefficient Use of Time in Rate Ratio
Analysis
The reason the rate ratio analysis failed to
convey the information in the life table is
because it only considers ratio of cases to
average person-time, without
distinguishing times to event and times to
censoring.
person-time = total time for subjects
= mean time x N
Suppose the individual times-at-risk for a
sample are: 10, 20, and 30. The persontime is computed as:
PT = total time for subjects
= 10+20+30 = 60
which is equivalent to :
PT = mean time x N
= (10+20+30)/3 x 3 = 20 x 3 = 60
So, a rate ratio analysis would find the
following two scenarios equal (even
though Group B outperforms Group A)
(let x----x denote time)
x-------------------------------------x (censored)
x-----x (died)
x--------x (died)
x--------------------------------------------x (censored)
Group A
x-------------------------------------x (died)
x-----x (censored)
x--------x (censored)
x--------------------------------------------x (died)
Group B
Hazard Ratio Analysis (Survival Analysis)
This analysis uses time-at-risk is a very
complete way, using all of the information
in the life table.
Exposed
Non-Exposed
Followup day
Begin
N
Disease
Cases
DaySpecific
Risk
Begin
N
Disease
DayDayCases Specific Specific
Risk
Risk
Ratio
1
50
5
0.10
50
2
0.04
2.5
2
30
10
0.33
40
8
0.20
1.7
3
10
10
1.00
20
10
0.50
2.0
Total
90
25
110
20
From Cox regression, HR = 1.92, p = 0.032
The HR is identically the Mantel-Haenzsel
summary risk ratio.
Aside
Showing a life table like this and pointing out
that the HR is just the weighted average of
the day specific risk ratios, and so is a
relative risk estimate, is a very clear way
to explain the HR to a researcher.
Checking our progress
Recall, we are trying to verify that a casecontrol study can mirror what can be
learned in a cohort study.
It appears, then, that we need to incorporate
survival analysis into the case-control
framework in order to keep up with what a
cohort study can do.
It turns out we can do this, use survival
analysis in the case-control framework, if
we tweak the study design slightly.
The slight variant is called the case-cohort
design (also called the density casecontrol design).
While presenting this design, I am going to
show some simulation results. In this way,
I can demonstrate that the case-cohort
design really does perform as well as a
cohort study design.
Dataset
The dataset comes from Breslow and Day
[Breslow NE, Day NE. (1987). Statistical Methods in
Cancer Research, Vol II: The Design and Analysis of
Cohort Studies, Lyon, France, IARC, 1987.]
Men (n=679) employed in a nickel refinery in South Wales
were investigated to determine whether the risk of
developing carcinoma of the bronchi and nasal sinuses
(ICD = 160), which had been associated with the refining
of nickel from previous studies in the 1930s, was present
in this cohort.
Modified Dataset
I also modified the dataset, to create a
second dataset that does not meet the
rare disease assumption, by duplicating
the cases five times.
Treating this dataset as the “population”,
and then analyzing it, we know what the
answer is that a case-control design which
samples from this cohort is supposed to
achieve.
The population relative measures are:
Population
Relative Effect
Measure
Actual Dataset
with almos rare
disease (3% in
unexposed, 12%
in exposed)
Augmented
Dataset with
frequent disease
(15% in
unexposed, 60%
in exposed)
Odds Ratio
3.76
3.76
Risk Ratio
3.43
2.65
Rate Ratio
4.76
3.87
Hazard Ratio
5.02
4.19
Classical Case-Control Study (controls are
sampled from the population controls only)
Using a 2:1 sampling ratio
Exposed
to nickel
Not
exposed
to nickel
Total
Tumor
46
10
56
No Tumor
343
280
56
Total
389
290
679
use all 56 cases
sample 56 x 2
controls
Monte Carlo simulation, computing OR from
1,000 samples, to get long-run average of
OR.
(Each sample keeps all 56 subjects from
the tumor row of the population 2 x 2 table,
and the randomly samples 112 subjects
from the no-tumor row of the population 2
x 2 table.)
The simulations results are:
Classical case-control design (sample
controls from no-tumor subjects only)
Population
Relative Effect
Measure
Actual Dataset
with almos rare
disease (3% in
unexposed, 12%
in exposed)
Augmented
Dataset with
frequent disease
(15% in
unexposed, 60%
in exposed)
Odds Ratio
3.76 (OR=3.81)
3.76 (OR=3.77)
Risk Ratio
3.43
2.65
Rate Ratio
4.76
3.87
Hazard Ratio
5.02
4.19
Case-Cohort Study Design
- In this design, we keep the cases. Then,
we sample our controls from the total
row of the population 2 x 2 table.
- For those cases that get mixed in with
the controls, we set their status variable to
0, the control value.
- We then calculate the OR in the usual
way.
Case-Cohort Study (controls are sampled
from the population row totals, which
includes both cases and controls)
Using a 2:1 sampling ratio
Exposed
to nickel
Not
exposed
to nickel
10 b
Total
use all 56 cases
Tumor
46 a
56
No Tumor
343
280
56
Total
389 c
290 d
679
sample 56 x 2
controls
The odds ratio is then a direct calucation of the risk ratio.
OR = (a x kd)/(b x kc) = (kad)/(kbc) = (ad)/(bc) , where k=(56x2)/679
RR = (a/c)/(b/d) = (ad)/(bc) = OR
The simulations results are:
Case-cohort design (sample controls
from total row of population 2 x 2 table)
Population
Relative Effect
Measure
Actual Dataset
with almos rare
disease (3% in
unexposed, 12%
in exposed)
Augmented
Dataset with
frequent disease
(15% in
unexposed, 60%
in exposed)
Odds Ratio
3.76
3.76
Risk Ratio
3.43 (OR=3.48)
2.65 (OR=2.67)
Rate Ratio
4.76
3.87
Hazard Ratio
5.02
4.19
Case-Cohort Study Design
For the case-cohort design, the rare-disease
assumption is not required for the OR to
be an estimate of RR (Rothman and
Greenland, 1998, p.110). We have
demonstrated that to be the case.
[Rothman KJ, Greenland S. (1998). Modern Epidemiology,
2nd ed. Philadelphia, PA.]
Case-Cohort Study Design
It is nice to be able to use the OR to directly
estimate RR, and not worry about the rare
disease assumption at all.
It comes with a price, however. Since your
controls are now “messy”, with cases mixed in,
you do not have as clear of a signal for the
effect, so statistical power is reduced. You need
to sample additional controls to make up the
difference (to get it back to the power of the
classic case-control study).
Case-Cohort Study Design With Risk Set
Sampling
In this design, you again keep all of the cases.
You then, again, sample controls from the total row
of population 2 x 2 table (sampled from cases &
controls). This time, however, you sample from
total row subjects which have the same or longer
time-at-risk. This is called risk set sampling.
Exposed
Non-Exposed
Followup day
Begin
N
Disease
Cases
DaySpecific
Risk
Begin
N
Disease
DayDayCases Specific Specific
Risk
Risk
Ratio
1
50
5
0.10
50
2
0.04
2.5
2
30
10
0.33
40
8
0.20
1.7
3
10
10
1.00
20
10
0.50
2.0
In this design, we also use a type of “total row”
sampling. That is, we select our controls from
the “Beginning N” column’s of the life table.
Exposed
Non-Exposed
Followup day
Begin
N
Disease
Cases
DaySpecific
Risk
Begin
N
Disease
DayDayCases Specific Specific
Risk
Risk
Ratio
1
50
5
0.10
50
2
0.04
2.5
2
30
10
0.33
40
8
0.20
1.7
3
10
10
1.00
20
10
0.50
2.0
For the 5+2 cases that occurred on day 1, we
sample our controls from the 50+50 persons still
at risk on day 1.
Exposed
Non-Exposed
Followup day
Begin
N
Disease
Cases
DaySpecific
Risk
Begin
N
Disease
DayDayCases Specific Specific
Risk
Risk
Ratio
1
50
5
0.10
50
2
0.04
2.5
2
30
10
0.33
40
8
0.20
1.7
3
10
10
1.00
20
10
0.50
2.0
For the 10+8 cases that occurred on day 2, we
sample our controls from the 30+40 persons still
at risk on day 2. …and so on.
We do this by forming risk sets. For every case,
we form a risk set that includes all subjects with
an equal or longer follow-up time. Then we
sample 2 controls from that risk set, if we are
using a 2:1 sampling ratio, that we match with
that case.
This is identical to sampling on the correct row
from the Beginning N column, like we did above.
We have already seen that the OR from a casecohort study design directly estimates the RR.
We are now doing a version of the case-cohort
approach for each row of the life table.
We know that the HR is just the summary RR
across the rows of the life table.
If we use conditional logistic regression, then, to
account for the row-specific matching, it would
seem the OR should directly estimate the HR.
Let’s see if that is true.
This time in the simulation, we will take the OR
from the conditional logistic regression, rather
than calculate if from a 2 x 2 table like we did for
the previous simulations.
The mean of the 1,000 conditional logistic
regression ORs will be our estimate of the HR.
The simulations results are:
Case-cohort design with risk set
sampling.
Population
Relative Effect
Measure
Actual Dataset
with almos rare
disease (3% in
unexposed, 12%
in exposed)
Augmented
Dataset with
frequent disease
(15% in
unexposed, 60%
in exposed)
Odds Ratio
3.76
3.76
Risk Ratio
3.43
2.65
Rate Ratio
4.76
3.87
Hazard Ratio
5.02 (OR=5.42)
4.19 (OR=4.43)
We were close, but the estimates appear to be biased.
The way it is really done is to use risk set sampling
followed by an actual Cox regression.
To adjust the standard error for the way the sampling was
done, there are three approaches:
Prentice
Self and Prentice
Barlow
In Stata,
Prentice:
stcascoh, alpha(.18) // risk set sampling
stcox nickel, robust
Self and Prentice
stcascoh, alpha(.18) // risk set sampling with log weights (_wSelPre)
stcox nickel, robust offset(_wSelPre)
Barlow
stcascoh, alpha(.18) // risk set sampling with log weights (_wBarlow)
stcox nickel, robust offset(_wBarlow)
The simulations results are:
Case-cohort design with risk set
sampling (Prentice Method)
Population
Relative Effect
Measure
Actual Dataset
with almos rare
disease (3% in
unexposed, 12%
in exposed)
Augmented
Dataset with
frequent disease
(15% in
unexposed, 60%
in exposed)
Odds Ratio
3.76
3.76
Risk Ratio
3.43
2.65
Rate Ratio
4.76
3.87
Hazard Ratio
5.02 (HR=5.08)
4.19
Estimates appear unbiased using this approach.
Download