Conditional Logistic Regression

advertisement
Chapter 5-17. Conditional Logistic Regression--Analyzing Dichotomous
Outcomes from Matched Case-Control Study Designs
Matching is a design strategy to eliminate, or control for, confounding. When the data are from a
matched case-control study design, conditional logistic regression is commonly used for the
analysis. It could be used, as well, in a match cohort study; however, cohort study data are
usually analyzed using a Cox regression or Poisson regression approach since a time at risk
variable is available.
Review of Confounding
A confounding variable must have two associations (Rothman, 2002, p.108):
1) A confounder must be associated with the disease (either as a cause or as a proxy for a
cause, but not as an effect of the disease).
2) A confounder must be associated with exposure (imbalanced between the exposure
groups).
Diagrammatically, the two necessary associations for confounding are:
Confounder
association
association
Exposure
Disease
confounded effect
There is also a third requirement.
A factor that is an effect of the exposure and an intermediate step in the causal pathway
from exposure to disease will have the above associations, but causal intermediates are
not confounders; they are part of the effect that we wish to study.
Thus, the third property of a confounder is as follows:
3) A confounder must not be an effect of the exposure.
_________________
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah
School of Medicine, 2010.
Chapter 5-17 (revision 16 May 2010)
p. 1
Motivation for Matching
The motivation for conducting a matched study is to control for confounding variables.
Matching removes one of the association arrows in the diagram.
Confounder
association
association
Exposure
Disease
confounded effect
In a cohort study, an exposed subject is matched with one or more unexposed subjects, such as
matching on age. Creating this balance removes the confounder-exposure association.
In a case-control study, a diseased subject is matched with one or more non-diseased subjects.
Creating this balance removes the confounder-disease association.
It was originally thought that matching enhanced validity, where validity is obtaining the correct
risk ratio or odds ratio. (Miettinen, 1970).
Later it was shown that the real advantage is greater efficiency in the statistical analysis. That is,
the matching may lead to a smaller p value and tighter confidence interval around the risk ratio or
odds ratio than would be achieved without matching. (Kupper et al, 1981).
Today, it is understood that for both cohort and case-control studies, matching can lead to either
a gain or a loss in efficiency. (Rothman and Greenland, 1998, p.148; Kupper et al, 1981; Smith
and Day, 1981; Thompson et al, 1982; Thomas and Greenland, 1983; Howe and Choi, 1983;
Greenland and Morgenstern, 1990).
Matching is clearly useful in certain well-defined circumstances, and clearly not useful in other
well-defined circumstances. In between, it is probably not worth the effort. A well-written
chapter on matching can be found in Rothman and Greenland (1998, Chapter 10 “Matching”)
and in Rothman, Greenland, and Lash (2008, Chapter 11, “Design Strategies to Improve Study
Accuracy”).
A Matched Design Requires a Matched Analysis
If matching was used, for either a cohort study or case-control study, one should perform a
matched analysis as well. This is particularly important in a matched case-control study, where if
not used, a selection bias can be introduced into the study, leading to invalid OR estimates.
(Rothman and Greenland, 1998, pp. 147). This will be illustrated below.
Chapter 5-17 (revision 16 May 2010)
p. 2
Matched Analysis Using Stratification
Matched sample data can be analyzed using stratification, where the individual strata are each a
matched set.
For example, if we had a paired matched case-control study (1 case matched with 1 control)
involving 100 matched pairs (total n=200), the analysis would involve 100 strata, each stratum
containing the two subjects forming the matched pair. These data could be analyzed using the
Cochran-Mantel-Haenszel chi-square test for association based on the 100 strata. The summary
Mantel-Haenszel odds ratio would be computed as the estimate of effect.
Example
We will use the mi2.dta dataset (see box).
Dataset: mi2.dta (source: Kleinbaum and Klein, 2002, Chapter 8)
This file is a 1:1 matched case-control study in which n=78 subjects are formed into 39 matched
strata. Each stratum contains two subjects, one of whom is a case diagnosed with myocardial
infarction and the other is a matched control. Matching was done on age, race, sex, and hospital
status.
Codebook (mi2.dta)
outcome
mi
myocardial infarction (1=presence, 0=absence)
predictors
smk smoker (1=current smoker, 0=not current smoker)
sbp
systolic blood pressure (continuous)
ecg
electrocardiogram abnormality (1=presence, 0=absence)
data management
match
variable indicating subject’s matched stratum (range 1 to 39)
person
subject identifier (unique #, one observation per subject)
Chapter 5-17 (revision 16 May 2010)
p. 3
Reading the data in,
File
Open
Find the directory where you copied the course CD
Change to the subdirectory datasets & do-files
Single click on mi2.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
Biostats & Epi With Stata\datasets & do-files\mi2.dta", clear
*
which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\datasets & do-files"
use mi2, clear
Listing the first six lines of data with a separator between every two lines.
Data
Describe data
List data
by/if/in tab: Use a range of observations: From 1 to 6
Options tab: Separators
Place sepators every N lines: 2
OK
list in 1/6 , sep(2)
1.
2.
3.
4.
5.
6.
+---------------------------------------+
| match
person
mi
smk
sbp
ecg |
|---------------------------------------|
|
1
1
1
0
160
1 |
|
1
2
0
0
140
0 |
|---------------------------------------|
|
2
4
1
0
160
1 |
|
2
5
0
0
140
0 |
|---------------------------------------|
|
3
7
1
0
160
0 |
|
3
8
0
0
140
0 |
+---------------------------------------+
Each two consecutive lines represents a matched pair (matched strata). A matched pair
identifier, called “match” in this dataset, is included to identify a unique pair. This variable will
be needed by Stata.
We notice there is 1 case (mi = 1) and 1 control (mi = 0) for each matched strata. The variable
smk is our exposure variable of interest. The two covariates, sbp and ecg are potential
confounding variables.
Chapter 5-17 (revision 16 May 2010)
p. 4
Looking at the 2 × 2 table for each strata (each value of match),
Statistics
Summaries, tables, & tests
Tables
Twoway tables with measures of assocation
Main tab: row variable: mi
column variable: smk
by/if/in tab: Repeat command by groups:
variables that define groups: match
OK
bysort match: tab mi smk
-> match = 1
|
smk
mi |
0 |
Total
-----------+-----------+---------0 |
1 |
1
1 |
1 |
1
-----------+-----------+---------Total |
2 |
2
...
-> match = 17
|
smk
mi |
0
1 |
Total
-----------+----------------------+---------0 |
1
0 |
1
1 |
0
1 |
1
-----------+----------------------+---------Total |
1
1 |
2
...
Chapter 5-17 (revision 16 May 2010)
p. 5
We find four possible permutations in the data:
smk
smk
present absent
1
0
mi present 1
0
1
1
absent 0
0
1
1
* have n = 19 of this pattern (matches 1-16, 26-27,31)
smk
smk
present absent
1
0
mi present 1
1
0
1
absent 0
1
0
1
* have n = 3 of this pattern (matches 33,38,39)
smk
smk
present absent
1
0
mi present 1
1
0
1
absent 0
0
1
1
* have n = 12 of this pattern (matches 17-25,32,34,35)
smk
smk
present absent
1
0
mi present 1
0
1
1
absent 0
1
0
1
* have n = 5 of this pattern (matches 28-30,36,37)
Chapter 5-17 (revision 16 May 2010)
p. 6
These tables follow the data layout for stratified case-control studies:
Exposed
Unexposed
Cases
ai
bi
Controls
ci
di
Total
N1i
N0i
where i denotes a specific stratum
Total
M1i
M0i
Ti
To test for an association between exposure and disease (between smoking and MI), we can use
the Cochran-Mantel-Haenszel chi-square test (also called the Mantel-Haenszel chi-square test).
The formula is:
2
 CMH

N1i M 1i 
  ai  

Ti 
i
i


N N M M
i 1Ti 2 (0Ti 1i1) 0i
i
i
2
, which is a chi-square statistic with 1 degree of freedom.
Rothman’s formula (2002, p.162) is the square root of this, as he choose to use the standard
normal distribution for computing the p value (from the identity,
 df2 1  z )
The CMH Chi-square is simply a weighted average of all the stratum-specific chi-square tests.
The Mantel-Haenszel summary odds ratio (also called pooled odds ratio) is given by (Rothman,
2002, p.156):
ai di
Ti
 i
bci
i T
i

ORMH
The MH odds ratio is simply a weighted average of all the stratum-specific odds ratios.
Chapter 5-17 (revision 16 May 2010)
p. 7
Computing the Mantel-Haenszel summary odds ratio and testing it for significance with the
Cochran-Mantel-Haenszel chi-square test,
...I can’t find mhodds in the menus...
mhodds mi smk match
Mantel-Haenszel estimate of the odds ratio
Comparing smk==1 vs. smk==0, controlling for match
note: only 17 of the 39 strata formed in this analysis contribute
information about the effect of the explanatory variable
---------------------------------------------------------------Odds Ratio
chi2(1)
P>chi2
[95% Conf. Interval]
---------------------------------------------------------------2.400000
2.88
0.0896
0.845521
6.812364
----------------------------------------------------------------
The note in this output states that only 17 of the 39 strata contributed to the analysis. This comes
from the fact that if a stratum has a row or column of zeros, it has no variability to be explained
(everyone in the stratum was a smoker, or everyone was not).
We see that the summary odds ratio is 2.4, with a p value of 0.0896.
We can also get these statistics using
Statistics
Observational/Epi. analysis
Tables for epidemiologists
Tabulate odds of failure by category
Main tab: Case exposed variable: mi
Control exposed variable: smk
Report odds ratios adjusted for variables: match
OK
tabodds mi smk ,adjust(match)
Mantel-Haenszel odds ratios adjusted for match
--------------------------------------------------------------------------smk | Odds Ratio
chi2
P>chi2
[95% Conf. Interval]
-------------+------------------------------------------------------------0 |
1.000000
.
.
.
.
1 |
2.400000
2.88
0.0896
0.845521
6.812364
--------------------------------------------------------------------------Score test for trend of odds: chi2(1) =
2.88
Pr>chi2 = 0.0896
Chapter 5-17 (revision 16 May 2010)
p. 8
McNemar Test
The McNemar test can be found in any introductory statistics textbook.
It turns out that for 1:1 matched design, when we do not adjust for covariates beyond the
matching variables, that the Cochran-Mantel-Haenszel chi-square test is identically the McNemar
test.
The McNemar test (also called the McNemar test for the significance of changes) is popularly
used to analyze categorical data in a “before and after” design, where each subject is used as its
own control (Siegel and Castellan, 1988, p.75).
The data layout for the McNemar test is
Data Layout for McNemar Test
After
+
Before
+
A
B
C
D
The McNemar test statistic is
2 
( B  C )2
BC
with df = 1
We see that it only uses information in the “discordant pairs” (+ on one, - on the other), or cells
where the Before and After differ, and ignores the other cells. This was the case with the
Cochran-Mantel-Haenszel chi-square test above, as well, where it ignored the permutations for
which both the case and control were smokers, or both were non-smokers.
The odds ratio is computed from this data layout as B/C.
Small Expected Frequencies
The chi-square test requires a sufficiently large sample size to provide an accurate p value. The
rule-of-thumb for the McNemar test version of the chi-square test is that when (B + C) < 10, the
exact form of the test should be used (Siegel and Castellan, 1988, p.79). Since the data are
paired, the Fisher’s exact test is not appropriate, and so the binomial test is used. In Stata, this
binomial test is labeled “Exact McNemar”.
Chapter 5-17 (revision 16 May 2010)
p. 9
To compute the McNemar test, Stata expects to find the “before” and “after”, the case and
control pair, on the same line. We can get Stata to reshape the dataset into this form using the
following commands in a do-file:
* -- re-format for McNemar test
use mi2, clear
keep match mi smk
list in -10/l
reshape wide smk , i(match) j(mi)
list in -5/l
//
//
//
//
easier to follow if limit to needed variables
look at 10th from last to last
side by side, instead of line by line
look at 5th from last to last
Before the reshape, the data are in “long” format, with cases and controls on separate consecutive
lines.
69.
70.
71.
72.
73.
74.
75.
76.
77.
78.
+------------------+
| match
mi
smk |
|------------------|
|
35
1
1 |
|
35
0
0 |
|
36
1
0 |
|
36
0
1 |
|
37
1
0 |
|------------------|
|
37
0
1 |
|
38
1
1 |
|
38
0
1 |
|
39
1
1 |
|
39
0
1 |
+------------------+
After the reshape, the data are in “wide” format, with each case and its control on the same line
(as if it were the same person), the smk0 variable being the smoking status of the control and the
smk1 variable being the smoking status of the case (as if it were pretest and posttest data).
35.
36.
37.
38.
39.
+---------------------+
| match
smk0
smk1 |
|---------------------|
|
35
0
1 |
|
36
1
0 |
|
37
1
0 |
|
38
1
1 |
|
39
1
1 |
+---------------------+
Chapter 5-17 (revision 16 May 2010)
p. 10
Calculating the McNemar test on these reformatted data,
Statistics
Observational/Epi. analysis
Tables for epidemiologists
Matched case-control studies
Main tab: Exposed case variable: smk1
Exposed control variable: smk0
OK
mcc smk1 smk0
| Controls
|
Cases
|
Exposed
Unexposed |
Total
-----------------+------------------------+---------Exposed |
3
12 |
15
Unexposed |
5
19 |
24
-----------------+------------------------+---------Total |
8
31 |
39
McNemar's chi2(1) =
2.88
Prob > chi2 = 0.0896
Exact McNemar significance probability
= 0.1435
Proportion with factor
Cases
.3846154
Controls
.2051282
--------difference .1794872
ratio
1.875
rel. diff. .2258065
odds ratio
2.4
[95% Conf. Interval]
--------------------.0455585
.4045329
.8966452
3.920865
-.003563
.4551759
.7870459
8.695981
(exact)
Comparing this with the output from above,
Mantel-Haenszel estimate of the odds ratio
Comparing smk==1 vs. smk==0, controlling for match
note: only 17 of the 39 strata formed in this analysis contribute
information about the effect of the explanatory variable
---------------------------------------------------------------Odds Ratio
chi2(1)
P>chi2
[95% Conf. Interval]
---------------------------------------------------------------2.400000
2.88
0.0896
0.845521
6.812364
----------------------------------------------------------------
we see that the odds ratios from the two approaches are identical, the chi-square statistics are
identical, and the p values are identical.
This verifies that the CMH Chi-square test using the matched pairs as the stratification variable is
identically the McNemar test, and that the MH summary odds ratio is identically the odds ratio
computed from the discordant pairs (the off diagonal) of the McNemar 2  2 table.
Chapter 5-17 (revision 16 May 2010)
p. 11
Conditional Logistic Regression
If the data were not matched, we could test the smoker-MI association using ordinary logistic
regression (also called unconditional logistic regression).
Since they are matched, we must use the matched version, which is called conditional logistic
regression.
Now, let’s compute the conditional logistic regression model for comparison.
Bringing the original data back in, which are in long format,
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\datasets & do-files"
use mi2, clear
Requesting a conditional logistic regression,
Statistics
Categorical outcomes
Conditional logistic regression
Model: Dependent variable: mi
Independent variables: smk
Group variable: match
OK
clogit mi smk, group(match)
Conditional (fixed-effects) logistic regression
Log likelihood = -25.547795
Number of obs
LR chi2(1)
Prob > chi2
Pseudo R2
=
=
=
=
78
2.97
0.0848
0.0549
-----------------------------------------------------------------------------mi |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------smk |
.8754687
.5322906
1.64
0.100
-.1678018
1.918739
------------------------------------------------------------------------------
Unlike the command “logistic”, the default is the regression coefficient, rather than the “odds
ratio”.
This coefficient has to be exponentiated to convert it to an odds ratio, exp(coef)=OR. This is
done by specifying the OR option.
Let’s try again.
Chapter 5-17 (revision 16 May 2010)
p. 12
Requesting a conditional logistic regression with the OR display option,
Statistics
Categorical outcomes
Conditional logistic regression
Model tab: Dependent variable: mi
Independent variables: smk
Group variable: match
Reporting tab: Report odds ratio.
OK
clogit mi smk, group(match) or
Conditional (fixed-effects) logistic regression
Log likelihood = -25.547795
Number of obs
LR chi2(1)
Prob > chi2
Pseudo R2
=
=
=
=
78
2.97
0.0848
0.0549
-----------------------------------------------------------------------------mi | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------smk |
2.4
1.277498
1.64
0.100
.8455214
6.812364
------------------------------------------------------------------------------
The odds ratios and CIs are identical to previous approaches, but a different test statistic
(Likelihood Ratio Chi-Square) is used, which is slightly different than the CMH Chi-Square.
The conditional logistic regression is similar to CMH Chi-square, in that it stratifies on the
matched pairs (or more generally, matched sets 1:1 match, 1:2 match, etc). It then uses an
approach called conditional maximum likelihood estimation.
The conditional maximum likelihood approach is necessary, since the model includes a large
number of dummy variables (one less than the number of matched sets) relative to the sample
size. For the example dataset, there are 39-1, or 38, dummy variables included in the model
(behind the scenes) in addition to the smk main effect term. The restricted maximum likelihood
estimation method is able to handle this without “overfitting”. (Kleinbaum and Klein, 2002, pp.
235-238)
Chapter 5-17 (revision 16 May 2010)
p. 13
Something similar can be done with unconditional logistic regression, which uses
“unconditional” maximum likelihood estimation. This is illustrated in the following model,
where dummy variables are included to form the matched strata.
We will use the “xi:” facility, which creates dummy variables for each variable that is preceded
by “i.”.
xi: logistic mi smk i.match // version 10
* <or>
logistic mi smk i.match // version 11
Logistic regression
Log likelihood = -51.095591
Number of obs
LR chi2(39)
Prob > chi2
Pseudo R2
=
=
=
=
78
5.94
1.0000
0.0549
-----------------------------------------------------------------------------mi | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------smk |
5.759999
4.33597
2.33
0.020
1.317229
25.18742
|
match |
2 |
1
2
0.00
1.000
.0198425
50.39681
3 |
1
2
0.00
1.000
.0198425
50.39681
4 |
1
2
0.00
1.000
.0198425
50.39681
5 |
1
2
0.00
1.000
.0198425
50.39681
6 |
1
2
0.00
1.000
.0198425
50.39681
7 |
1
2
0.00
1.000
.0198425
50.39681
8 |
1
2
0.00
1.000
.0198425
50.39681
9 |
1
2
0.00
1.000
.0198425
50.39681
10 |
1
2
0.00
1.000
.0198425
50.39681
11 |
1
2
0.00
1.000
.0198425
50.39681
12 |
1
2
0.00
1.000
.0198425
50.39681
13 |
1
2
0.00
1.000
.0198425
50.39681
14 |
1
2
0.00
1.000
.0198425
50.39681
15 |
1
2
0.00
1.000
.0198425
50.39681
16 |
1
2
0.00
1.000
.0198425
50.39681
17 |
.4166667
.8887804
-0.41
0.681
.0063696
27.2561
18 |
.4166667
.8887804
-0.41
0.681
.0063696
27.2561
19 |
.4166667
.8887804
-0.41
0.681
.0063696
27.2561
20 |
.4166667
.8887804
-0.41
0.681
.0063696
27.2561
21 |
.4166667
.8887804
-0.41
0.681
.0063696
27.2561
22 |
.4166667
.8887804
-0.41
0.681
.0063696
27.2561
23 |
.4166667
.8887804
-0.41
0.681
.0063696
27.2561
24 |
.4166667
.8887804
-0.41
0.681
.0063696
27.2561
25 |
.4166667
.8887804
-0.41
0.681
.0063696
27.2561
26 |
1
2
0.00
1.000
.0198425
50.39681
27 |
1
2
0.00
1.000
.0198425
50.39681
28 |
.4166667
.8887804
-0.41
0.681
.0063696
27.2561
29 |
.4166667
.8887804
-0.41
0.681
.0063696
27.2561
30 |
.4166667
.8887804
-0.41
0.681
.0063696
27.2561
31 |
1
2
0.00
1.000
.0198425
50.39681
32 |
.4166667
.8887804
-0.41
0.681
.0063696
27.2561
33 |
.1736111
.3710028
-0.82
0.413
.0026338
11.44392
34 |
.4166667
.8887804
-0.41
0.681
.0063696
27.2561
35 |
.4166667
.8887804
-0.41
0.681
.0063696
27.2561
36 |
.4166667
.8887804
-0.41
0.681
.0063696
27.2561
37 |
.4166667
.8887804
-0.41
0.681
.0063696
27.2561
38 |
.1736111
.3710028
-0.82
0.413
.0026338
11.44392
39 |
.1736111
.3710028
-0.82
0.413
.0026338
11.44392
------------------------------------------------------------------------------
Chapter 5-17 (revision 16 May 2010)
p. 14
We see that the odds ratio for smk is much larger. In fact it is exactly the square of the
conditional odds ratio when pair matching is used (1:1 matching) (Kleinbaum and Klein, 2002,
p.236).
unconditional OR = (conditional OR)2.
display 2.4^2
which returns
5.76
exactly the odds ratio in the unconditional logistic model, where an indicator variable is included
for each match pair, except one left out as the referent.
The unconditional OR is an overestimate. The conditional OR is the correct result. (Kleinbaum
and Klein, 2002, p.236). In other words, putting the indicator variables into an ordinary logistic
regression is an incorrect analysis (only shown here for illustration).
The advantage of using conditional logistic regression over the McNemar test or the CMH Chisquare test, is that covariates can be included in the model that are not in the list of the matching
variables.
Statistics
Categorical outcomes
Conditional logistic regression
Model tab: Dependent variable: mi
Independent variables: smk sbp ecg
Group variable: match
Reporting tab: Report odds ratio.
OK
clogit mi smk sbp ecg, group(match) or
Conditional (fixed-effects) logistic regression
Log likelihood = -20.752435
Number of obs
LR chi2(3)
Prob > chi2
Pseudo R2
=
=
=
=
78
12.56
0.0057
0.2323
-----------------------------------------------------------------------------mi | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------smk |
2.666869
1.76445
1.48
0.138
.7291737
9.753765
sbp |
1.036602
.0173276
2.15
0.032
1.003191
1.071126
ecg |
4.369953
3.953285
1.63
0.103
.7420546
25.73461
------------------------------------------------------------------------------
Chapter 5-17 (revision 16 May 2010)
p. 15
R-to-1 matching
We can also use any number of controls for each case, the R-to-1 matches simply being their own
strata in the conditional logistic regression model. It is not even necessary that the same number
of controls be used for every matched set (for some it could be 3:1, for others 2:1 or 1:1). This
occurs in practice, since it is not always possible to find the full complement of R controls in the
same matching category for some cases. (Kleinbaum and Klein, 2002, p.231)
A 2:1 match is illustrated with the mi.dta file.
Reading the data in,
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\datasets & do-files"
use mi, clear
Listing the first nine lines of data with a separator between every three
lines.
Data
Describe data
List data
by/if/in tab: Use a range of observations: From 1 to 9
Options tab: Separators
Place sepators every N lines: 3
OK
list in 1/9 , sep(3)
* <or>
list in 1/9 , sepby(match) // better if not always have 2 controls
1.
2.
3.
4.
5.
6.
7.
8.
9.
+---------------------------------------+
| match
person
mi
smk
sbp
ecg |
|---------------------------------------|
|
1
1
1
0
160
1 |
|
1
2
0
0
140
0 |
|
1
3
0
0
120
0 |
|---------------------------------------|
|
2
4
1
0
160
1 |
|
2
5
0
0
140
0 |
|
2
6
0
0
120
0 |
|---------------------------------------|
|
3
7
1
0
160
0 |
|
3
8
0
0
140
0 |
|
3
9
0
0
120
0 |
+---------------------------------------+
We cannot analyze these data with McNemar test (which requires a 1:1 match). We can use
either the CMH Chi-square or the conditional logistic, either is appropriate. (Kleinbaum and
Klein, 2002, p.231)
Chapter 5-17 (revision 16 May 2010)
p. 16
However, the stratified approach (CHM Chi-square) is known to not agree exactly with the
conditional logistic regression, except for the 1:1 match design.
mhodds mi smk match
clogit mi smk , group(match) or
Mantel-Haenszel estimate of the odds ratio
Comparing smk==1 vs. smk==0, controlling for match
note: only 21 of the 39 strata formed in this analysis contribute
information about the effect of the explanatory variable
---------------------------------------------------------------Odds Ratio
chi2(1)
P>chi2
[95% Conf. Interval]
---------------------------------------------------------------2.200000
3.43
0.0641
0.934342
5.180115
---------------------------------------------------------------Conditional (fixed-effects) logistic regression
Log likelihood = -41.162776
Number of obs
LR chi2(1)
Prob > chi2
Pseudo R2
=
=
=
=
117
3.37
0.0665
0.0393
-----------------------------------------------------------------------------mi | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------smk |
2.324173
1.083284
1.81
0.070
.9322409
5.794406
------------------------------------------------------------------------------
Consistent with what is known from statistical theory, we no longer get identical odds ratios
between the Mantel-Haenszel approach and conditional logistic regression, for this 2:1 matched
design. That’s not a problem—it is correct to use either approach.
Chapter 5-17 (revision 16 May 2010)
p. 17
How a Nonmatched Analysis in a Matched Case-Control Study Can Introduce Selection
Bias
We started out with Rothman and Greenland’s claim (1998, p.147):
If matching was used, for either a cohort study or case-control study, one should perform
a matched analysis as well. This is particularly important in a matched case-control
study, where if not used, a selection bias can be introduced into the study, leading to
invalid OR estimates.
We will now see how this occurs with Rothman and Greenland’s (1998, p.152) hypothetical data
example:
This example shows how a crude analysis provides a biased estimate of effect in a casecontrol study when the matching variable is correlated with the exposure. Being
correlated, the matching variable is basically a surrogate for exposure. When we match,
we make the cases and controls look similar on the matching variable, which
simultaneously makes the cases and controls look similar on the exposure variable (since
we matched on a surrogate for the exposure variable).
In the population, the data look like:
Disease
N
Yes
No
Males
Exposed Unexposed
450
10
89,550
9,990
90,000
10,000
RR = 5
Females
Exposed
Unexposed
50
90
9,950
89,910
10,000
90,000
RR = 5
The population crude risk ratio is:
RR = (Disease/Exposed) / (Disease/Unexposed)
= [(450+50)/(90,000+10,000)] / [(10+90)/(10,000+90,000)]
=5
Chapter 5-17 (revision 16 May 2010)
p. 18
Reading the data in,
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\datasets & do-files"
use RothmanP152, clear
and listing
list, abbrev(15)
count
1.
2.
3.
4.
5.
6.
7.
8.
+----------------------------------------+
| female
disease
exposed
cellcount |
|----------------------------------------|
|
0
1
1
450 |
|
0
1
0
10 |
|
0
0
1
89550 |
|
0
0
0
9990 |
|
1
1
1
50 |
|----------------------------------------|
|
1
1
0
90 |
|
1
0
1
9950 |
|
1
0
0
89910 |
+----------------------------------------+
. count
8
We see that there are 8 observations, from the count command. The variable, cellcount, is the
cell count from the table from the Rothman and Greenland text.
Disease
N
Yes
No
Males
Exposed Unexposed
450
10
89,550
9,990
90,000
10,000
RR = 5
Chapter 5-17 (revision 16 May 2010)
Females
Exposed
Unexposed
50
90
9,950
89,910
10,000
90,000
RR = 5
p. 19
Expanding to create one observation per subject,
expand cellcount
drop cellcount
count
. expand cellcount
(199992 observations created)
. count
200000
The count command informs us that there are now 200,000 observations in the data editor.
Verifying the data were entered correctly,
bysort female: tab disease exposed
-> female = 0
|
exposed
disease |
0
1 |
Total
-----------+----------------------+---------0 |
9,990
89,550 |
99,540
1 |
10
450 |
460
-----------+----------------------+---------Total |
10,000
90,000 |
100,000
-> female = 1
|
exposed
disease |
0
1 |
Total
-----------+----------------------+---------0 |
89,910
9,950 |
99,860
1 |
90
50 |
140
-----------+----------------------+---------Total |
90,000
10,000 |
100,000
which agrees with the table from the text.
Disease
N
Yes
No
Males
Exposed Unexposed
450
10
89,550
9,990
90,000
10,000
RR = 5
Chapter 5-17 (revision 16 May 2010)
Females
Exposed
Unexposed
50
90
9,950
89,910
10,000
90,000
RR = 5
p. 20
Verifying that the stratum-specific RR’s are 5, and that the crude RR is 5 as well,
cs disease exposed , by(female)
female |
RR
[95% Conf. Interval]
M-H Weight
-----------------+------------------------------------------------0 |
5
2.672822
9.353411
9
1 |
5
3.540793
7.060566
9
-----------------+------------------------------------------------Crude |
5
4.034626
6.196361
M-H combined |
5
3.496972
7.149042
------------------------------------------------------------------Test of homogeneity (M-H)
chi2(1) =
0.000 Pr>chi2 = 1.0000
We will be fitting logistic regression models to these data, which compute odds ratios, ORs,
rather than relative risks, RRs. To see what these are,
cc disease exposed , by(female)
female |
OR
[95% Conf. Interval]
M-H Weight
-----------------+------------------------------------------------0 |
5.020101
2.703308
10.54258
8.955 (exact)
1 |
5.020101
3.47729
7.176783
8.955 (exact)
-----------------+------------------------------------------------Crude |
5.020101
4.041642
6.285267
(exact)
M-H combined |
5.020101
3.508934
7.182071
------------------------------------------------------------------Test of homogeneity (M-H)
chi2(1) =
0.00 Pr>chi2 = 1.0000
Test that combined OR = 1:
Mantel-Haenszel chi2(1) =
Pr>chi2 =
96.37
0.0000
Fitting a logistic regression model to these data, to see the stratum-specific estimates
logistic disease exposed female
Logistic regression
Log likelihood = -3938.6321
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
=
=
=
=
200000
291.91
0.0000
0.0357
-----------------------------------------------------------------------------disease | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------exposed |
5.020101
.7764691
10.43
0.000
3.70728
6.797817
female |
1
.1363785
-0.00
1.000
.7654457
1.306428
------------------------------------------------------------------------------
Obtaining the crude estimates (not stratified by gender)
logistic disease exposed
Logistic regression
Log likelihood = -3938.6321
Number of obs
LR chi2(1)
Prob > chi2
Pseudo R2
=
=
=
=
200000
291.91
0.0000
0.0357
-----------------------------------------------------------------------------disease | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------exposed |
5.020101
.5503828
14.72
0.000
4.049396
6.223498
------------------------------------------------------------------------------
Chapter 5-17 (revision 16 May 2010)
p. 21
Let’s let these data be the population we will sample from for this illustration.
Disease
Yes
No
N
Males
Exposed Unexposed
450
10
89,550
9,990
90,000
10,000
RR = 5
Females
Exposed
Unexposed
50
90
9,950
89,910
10,000
90,000
RR = 5
Now, we will conduct a case-control study, where we do not match.
Disease
N
Yes
No
Males
Exposed Unexposed
450
10
89,550
9,990
90,000
10,000
RR = 5
Females
Exposed
Unexposed
50
90
9,950
89,910
10,000
90,000
RR = 5
Displaying the “population” 2 × 2 table,
tab disease exposed
|
exposed
disease |
0
1 |
Total
-----------+----------------------+---------0 |
99,900
99,500 |
199,400
1 |
100
500 |
600
-----------+----------------------+---------Total |
100,000
100,000 |
200,000
We will randomly sample 600 controls, which equals the sample size for the 600 available cases.
set seed 777
sample 600, count, if disease==0
// so can replicate sample
Seeing what our sample now looks like,
tab disease exposed
|
exposed
disease |
0
1 |
Total
-----------+----------------------+---------0 |
321
279 |
600
1 |
100
500 |
600
-----------+----------------------+---------Total |
421
779 |
1,200
Chapter 5-17 (revision 16 May 2010)
<- a sample of first row above
<- same as second row above
p. 22
Looking at the odds ratios,
cc disease exposed , by(female)
female |
OR
[95% Conf. Interval]
M-H Weight
-----------------+------------------------------------------------0 |
5.60241
2.616844
13.00143
3.364865 (exact)
1 |
5.37037
3.127288
9.271097
5.869565 (exact)
-----------------+------------------------------------------------Crude |
5.752688
4.36561
7.59901
(exact)
M-H combined |
5.454921
3.585552
8.298908
------------------------------------------------------------------Test of homogeneity (M-H)
chi2(1) =
0.01 Pr>chi2 = 0.9255
Test that combined OR = 1:
Mantel-Haenszel chi2(1) =
Pr>chi2 =
0.0000
73.12
Or, doing the same thing with logistic regression model to these data,
logistic
logistic
logistic
logistic
disease
disease
disease
disease
exposed if female==0
exposed if female==1
exposed
exposed female
. logistic disease exposed if female==0
//
//
//
//
male stratum
female stratum
crude
combined
// male stratum
Logistic regression
Number of obs
=
740
-----------------------------------------------------------------------------disease | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------exposed |
5.60241
2.084942
4.63
0.000
2.701465
11.61851
-----------------------------------------------------------------------------. logistic disease exposed if female==1
// female stratum
Logistic regression
Number of obs
=
460
-----------------------------------------------------------------------------disease | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------exposed |
5.37037
1.399316
6.45
0.000
3.222651
8.949427
-----------------------------------------------------------------------------. logistic disease exposed
// crude
Logistic regression
Number of obs
=
1200
-----------------------------------------------------------------------------disease | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------exposed |
5.752688
.7866572
12.80
0.000
4.4002
7.52089
-----------------------------------------------------------------------------. logistic disease exposed female
// combined
Logistic regression
Number of obs
=
1200
-----------------------------------------------------------------------------disease | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------exposed |
5.446019
1.160892
7.95
0.000
3.586198
8.270355
female |
.9336106
.1920174
-0.33
0.738
.623875
1.397121
------------------------------------------------------------------------------
We get a result close to the population estimates (OR = 5.0), being different due to sampling
variability. (It would require Monte Carlo simulation to verify the long-run average ORs match
the population ORs).
That is, NOT using a matched case-control study design, the result does not appear to be biased.
Chapter 5-17 (revision 16 May 2010)
p. 23
Next we will conduct a matched case-control study, matching on gender.
Starting with the full dataset,
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\datasets & do-files"
use RothmanP152, clear
expand cellcount
drop cellcount
First, take a close look at the data. Notice that gender is a close proxy for exposure in this dataset
(almost all males are exposed, while almost all females are unexposed).
Disease
Yes
No
N
Males
Exposed Unexposed
450
10
89,550
9,990
90,000
10,000
RR = 5
Females
Exposed
Unexposed
50
90
9,950
89,910
10,000
90,000
RR = 5
This strong association can be revealed with a Pearson correlation coefficient.
corr
(obs=200000)
|
female disease exposed
-------------+--------------------------female |
1.0000
disease | -0.0293
1.0000
exposed | -0.8000
0.0366
1.0000
We see that the female-exposed association is strong (r = -0.80).
When we match on gender, then, we will introduce a selection bias if a nonmatched analysis is
not performed, but the bias will not exist if a matched analysis is performed (the point of this
illustration).
Chapter 5-17 (revision 16 May 2010)
p. 24
We will conduct a “frequency match”, rather than a one-to-one match, on gender.
Seeing what the gender distribution is for cases,
tab disease female, row
|
female
disease |
0
1 |
Total
-----------+----------------------+---------0 |
99,540
99,860 |
199,400
|
49.92
50.08 |
100.00
-----------+----------------------+---------1 |
460
140 |
600
|
76.67
23.33 |
100.00
-----------+----------------------+---------Total |
100,000
100,000 |
200,000
|
50.00
50.00 |
100.00
We see 140 females and 460 males in the diseased group.
Sampling 140 females and 460 males from the control pool (the non-disease row of the
population table)
set seed 777
sample 140, count, if disease==0 & female==1
sample 460, count, if disease==0 & female==0
tab disease female
|
female
disease |
0
1 |
Total
-----------+----------------------+---------0 |
460
140 |
600
1 |
460
140 |
600
-----------+----------------------+---------Total |
920
280 |
1,200
Chapter 5-17 (revision 16 May 2010)
p. 25
Performing a nonmatched analysis of these data,
cc disease exposed
logistic disease exposed
. cc disease exposed
Proportion
|
Exposed
Unexposed |
Total
Exposed
-----------------+------------------------+-----------------------Cases |
500
100 |
600
0.8333
Controls |
436
164 |
600
0.7267
-----------------+------------------------+-----------------------Total |
936
264 |
1200
0.7800
|
|
|
Point estimate
|
[95% Conf. Interval]
|------------------------+-----------------------Odds ratio |
1.880734
|
1.409024
2.515956 (exact)
Attr. frac. ex. |
.4682927
|
.2902887
.6025367 (exact)
Attr. frac. pop |
.3902439
|
+------------------------------------------------chi2(1) =
19.89 Pr>chi2 = 0.0000
. logistic disease exposed
Logistic regression
Log likelihood = -821.75147
Number of obs
LR chi2(1)
Prob > chi2
Pseudo R2
=
=
=
=
1200
20.05
0.0000
0.0121
-----------------------------------------------------------------------------disease | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------exposed |
1.880734
.2685639
4.42
0.000
1.421602
2.488151
------------------------------------------------------------------------------
we see that the crude OR of 1.88 is a very biased estimate of the population crude OR of 5.02.
Chapter 5-17 (revision 16 May 2010)
p. 26
Next, performing a matched analysis of these data,
cc disease exposed, by(female)
logistic disease exposed female
. cc disease exposed, by(female)
female |
OR
[95% Conf. Interval]
M-H Weight
-----------------+------------------------------------------------0 |
4.403341
2.132179
9.972196
4.554348 (exact)
1 |
4.019608
2.105171
7.908589
5.464286 (exact)
-----------------+------------------------------------------------Crude |
1.880734
1.409024
2.515956
(exact)
M-H combined |
4.194048
2.639437
6.664316
------------------------------------------------------------------Test of homogeneity (M-H)
chi2(1) =
0.04 Pr>chi2 = 0.8479
Test that combined OR = 1:
Mantel-Haenszel chi2(1) =
Pr>chi2 =
41.22
0.0000
. logistic disease exposed female
Logistic regression
Log likelihood = -810.07334
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
=
=
=
=
1200
43.41
0.0000
0.0261
-----------------------------------------------------------------------------disease | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------exposed |
4.183013
.9848099
6.08
0.000
2.636879
6.635723
female |
2.833092
.6508466
4.53
0.000
1.805984
4.44434
------------------------------------------------------------------------------
Alternatively, performing a matched analysis of these data using conditional logistic regression,
clogit disease exposed, group(female) or
Conditional (fixed-effects) logistic regression
Log likelihood = -803.44316
Number of obs
LR chi2(1)
Prob > chi2
Pseudo R2
=
=
=
=
1200
43.30
0.0000
0.0262
-----------------------------------------------------------------------------disease | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------exposed |
4.168447
.9801334
6.07
0.000
2.629238
6.608739
------------------------------------------------------------------------------
The OR of 4.2 is quite close to the population OR of 5.02, being different due to sampling
variability. Monte Carlo simulation could verify that the long-run average of the sample ORs is
the same as the population OR.
This verifies that a matched sample analysis eliminates the selection bias, introduced by a
matching factor that is correlated with the exposure, that is present if a nonmatched analysis is
used.
Chapter 5-17 (revision 16 May 2010)
p. 27
Notice that the ordinary logistic regression, which included the matching variable as a predictor,
was nearly identical to the conditional logistic regression.
When the matching is done on only a few distinct levels of a matching variable, such as with sex
in this example, the data can be analyzed with ordinary logistic regression as long as the
variables that we used to form the match are included in the model. (Jewell, 2004, p.258).
Consistent with Jewell’s statement, Cheung (2003) points out frequency-matched data do not
require conditional logistic regression, as long as the match variables are included in the
unconditional model as covariates:
“Frequency-matched data do not require conditional logistic regression; individually
matched data do. As Rahman et al. themselves pointed out, the unconditional logistic
regression is a suitable analytic method if the number of parameters is small in relation to
the number of subjects. This is usually the case in the analysis of frequency-matched
case-control data. As long as the matching variables are included in the unconditional
logistic regression model as covariates, the odds ratios will not be biased by the procedure
of frequency matching. To avoid suspicion, frequency-matched case-control studies
should always report whether they have included the matching variables in the analysis.”
When you report the model, however, you would not show the lines for the match variables—
which would have no significant ORs since you forced balance on these variables, but merely
mention in a footnote that they were included in the model. Frequency matched studies, which
general only have a few distinct levels of a matching variable, are often analyzed this way.
Analysis of Matched Cohort Studies
Usually cohort studies are analyzed with Cox regression, or with Poisson regression if the data
are in aggregated form or simply counts of events. Both models allow for specification of one or
more stratification, or matching, variables and are available in Stata.
Chapter 5-17 (revision 16 May 2010)
p. 28
References
Cheung Y-B. (2003). Analysis of matched case-control data. J Clin Epidemiol 56:814.
Cochran WG. (1954). Some methods of strengthening the common chi-square test. Biometrics
10:417-51.
Gart JJ. (1976). Letter: Contingency tables. The American Statistician 30:204.
Greenland S, Morgenstern H. Matching and efficiency in cohort studies. Am J Epidemiol
131:151-159.
Howe GR, Choi BCK. (1983). Methodological issues in case-control studies: validity and power
of various design/analysis strategies. Int J Epidemiol 12:238-245.
Jewell NP. (2004). Statistics for Epidemiology. New York, Chapman & Hall/CRC.
Kleinbaum DG, Klein M. (2002). Logistic Regression: A Self-Learning Text. 2nd ed.
New York, Springer-Verlag.
Kupper LL, Karon JM, Kleinbaum DG et al. (1981). Matching in epidemiologic studies: validity
and efficiency considerations. Biometrics 37:271-292.
Mantel N. (1977). Letter: Contingency tables—a reply. The American Statistician 31:135.
Mantel N, Haenszel W. (1959). Statistical aspects of the analysis of data from the retrospective
studies of disease. J. National Cancer Inst. 22:719-48.
Miettinen OS. (1970). Matching and design efficiency in retrospective studies. Am J Epidemiol
91:111-118.
Monson RR. (1980). Occupational Epidemiology. Boca Raton, FL, CRC Press, Inc.
Rosner B. (1995). Fundamentals of Biostatistics, 4th ed., Belmont CA, Duxbury Press.
Rothman KJ. (2002). Epidemiology: An Introduction. Oxford, Oxford University Press.
Rothman KJ, Greenland S. (1998). Modern Epidemiology, 2nd ed. Philadelphia, PA,
Lippincott-Raven Publishers.
Rothman KJ, Greenland S, Lash TL. (2008). Design Stategies to Improve Study Accuracy. In
Rothman KJ, Greenland S, Lash TL eds. Modern Epidemiology, 3rd ed. Philadelphia, PA,
Lippincott Williams & Wilkins, 2008, pp. 168-182.
Siegel S and Castellan NJ Jr (1988). Nonparametric Statistics for the Behavioral
Sciences, 2nd ed. New York, McGraw-Hill.
Chapter 5-17 (revision 16 May 2010)
p. 29
Smith PG, Day NE. (1981). Matching and confounding in the design and analysis of
epidemiological case-control studies. In: Blithell JF, Coppi R, eds. Perspectives in Medical
Statistics. New York, Academic Press.
Thomas DC, Greenland S. (1983). The efficiency of matching case-control studies of risk factor
interaction. J Chronic Dis 38:569-574.
Thompson WD, Kelsey JL, Walter SD. (1982). Cost and efficiency in the choice of matched and
unmatch case-control studies. Am J Epidemiol 116:840-851.
Chapter 5-17 (revision 16 May 2010)
p. 30
Download