Lecture 8: Selection Bias,
Matching, & Control Selection
Matthew Fox
Advanced Epidemiology
What is selection bias?
Which studies can have selection
bias: cohort or case control?
Selection bias or confounding?
Comparison of mortality among office
workers and longshoremen from MI
 Comparison is biased because those
who self-select into longshoremen are
fitter which leads to less MI
 What is the bias?

In a case control study, can we
match cases to controls based on
exposure?
If we match, do we need to adjust
for the matched factor?
What is overmatching?
Misclassification Summary I
#1 Non-differential and independent
misclassification of dichotomous exposure or
disease (usually) creates an expectation that
estimates of effect are biased towards the null.
#2 Non-differential and independent
misclassification of a covariate creates an
expectation that the relative risk due to
confounding is biased towards the null, yielding
residual confounding.
Misclassification Summary II
#3 Errors due to misclassification can be
corrected algebraically
#4 Differential misclassification yields an
unpredictable bias of the estimates of effect
(still correctable).
#5 There are important exceptions to the mantra
that “non-differential misclassification biases
towards the null.”
This Session

Selection bias
–

Definition & control
Matching
–
–
Cohort vs. Case-control studies
When to adjust, when not to adjust
Control selection
 Adjustment

–
Is it possible?
Selection bias — definition

Distortions of the estimate of effect arising
from procedures to select subjects and
from factors that influence participation
–

Common element is that the exposure-disease
relation is different among participants than among
those theoretically eligible
Observed estimate of effect reflects a
mixture of forces affecting participation and
forces affecting disease occurrence
Separate from Confounding

Cohort studies don’t have selection bias at
entry even if subjects self select
–
–

Selection into cohort can create confounding, but
this can be undone by adjustment
Or becomes an issue of generalizablity
Cohort studies/RCTs can have selection
bias at end through differential LTFU
–
Some can be undone if we know enough about the
selection mechanism
Selection bias — Fallacy

Formerly frequently viewed as diseasedependent selection forces
–

Sometimes selection factors can be
controlled as if they were confounders
–

Exposure-dependent selection forces were thought to
be confounders or part of the population definition.
For example, matched factors in case-control studies
and two-stage studies.
However, not all selection factors related to
exposure can be so treated
Selection bias
Adjust for selection proportions
Selection bias — Simple method
Selection
probabilities
Cases (A)
Non-cases (B)
Exposure = 1
Exposure = 0
SA,1
SB,1
SA,0
SB,0
S
S
A
,
0
B
,
1
ˆ
OR  OR
SA,1SB,0
Selection bias
Truth
Cases (A)
Total
Occupational
Radiation
50
10,000
Selection
Probabilities
Occupational
Radiation
Cases (A)
Non-cases (B)
100%
40%
No
Occupational
Radiation
100
20,000
No
Occupational
Radiation
40%
40%
OR = [50/4000] / [40/8000] = 2.5
Selection bias
Truth
Cases (A)
Total
Occupational
Radiation
50
10,000
Selected Data
Occupational
Radiation
Cases (A)
Non-cases (B)
50
4000
No
Occupational
Radiation
100
20,000
No
Occupational
Radiation
40
8000
Selection bias
Observation
Cases (A)
Total
Occupational
Radiation
50
4,000
No
Occupational
Radiation
40
8,000
S
S
A,0 B,1
ˆ
OR  OR
SA,1SB,0
Selection bias
Observation
Cases (A)
Total
Occupational
Radiation
50
4,000
No
Occupational
Radiation
40
8,000
40%  40%
1  2.5
100%  40%
https://sites.google.com/site/biasanalysis/
Structure of Selection Bias
Selection forces don’t create bias
if they are not related to both
exposure and disease
Selection bias — Simple method
Selection
probabilities
Cases (A)
Non-cases (B)
Exposure = 1
Exposure = 0
SA,1
SB,1
SA,0
SB,0
S
S
A
,
0
B
,
1
ˆ
OR  OR
SA,1SB,0
Selection bias
Truth
Cases (A)
Total
Occupational
Radiation
50
10,000
Selection
Probabilities
Occupational
Radiation
Cases (A)
Non-cases (B)
100%
40%
No
Occupational
Radiation
100
20,000
No
Occupational
Radiation
100%
40%
OR = [50/4000] / [100/8000] = 1
Selection bias
Truth
Cases (A)
Total
Occupational
Radiation
50
10,000
Selected Data
Occupational
Radiation
Cases (A)
Non-cases (B)
50
4000
No
Occupational
Radiation
100
20,000
No
Occupational
Radiation
100
8000
Selection bias
Observation
Cases (A)
Total
Occupational
Radiation
50
4,000
No
Occupational
Radiation
100
8,000
100 %  40%
11
100 %  40%
Selection bias
Truth
Cases (A)
Total
Occupational
Radiation
50
10,000
Selection
Probabilities
Occupational
Radiation
Cases (A)
Non-cases (B)
100%
40%
No
Occupational
Radiation
100
20,000
No
Occupational
Radiation
50%
20%
OR = [50/4000] / [50/4000] = 1
Selection bias
Truth
Cases (A)
Total
Occupational
Radiation
50
10,000
Selected Data
Occupational
Radiation
Cases (A)
Non-cases (B)
50
4000
No
Occupational
Radiation
100
20,000
No
Occupational
Radiation
50
4000
Selection bias
Observation
Cases (A)
Total
Occupational
Radiation
50
4,000
No
Occupational
Radiation
100
8,000
50%  40%
11
100 %  20%
Selection bias
Truth
Cases (A)
Total
Occupational
Radiation
50
10,000
Selection
Probabilities
Occupational
Radiation
Cases (A)
Non-cases (B)
100%
100%
No
Occupational
Radiation
100
20,000
No
Occupational
Radiation
40%
40%
Selection bias
Truth
Cases (A)
Total
Occupational
Radiation
50
10,000
Selection
Probabilities
Occupational
Radiation
Cases (A)
Non-cases (B)
100%
100%
No
Occupational
Radiation
100
20,000
No
Occupational
Radiation
40%
40%
OR = [50/10000] / [40/8000] = 1
Selection bias
Truth
Cases (A)
Total
Occupational
Radiation
50
10,000
Selected Data
Occupational
Radiation
Cases (A)
Non-cases (B)
50
10,000
No
Occupational
Radiation
100
20,000
No
Occupational
Radiation
40
8,000
Selection bias
Observation
Cases (A)
Total
Occupational
Radiation
50
4,000
No
Occupational
Radiation
100
8,000
100 %  40%
11
100 %  40%
Selection Bias Occurs When
Selection is Related to Both the
Exposure and the Outcome
Sounds like confounding,
but this time E and D
affect Selection
Remember back to common causes
and common effects (Hernán 2004)
Selection Bias in a
Case Control Study:

Case controls study of the
relationship between estrogens and
myocardial infarction
–
–

Cases are those hospitalized for MI
Controls are those hospitalized for hip
fracture
Could this cause selection bias?
Selection Bias in a
Case Control Study:


E= estrogens
F= hip fracture
D = myocardial infarction
C = selection into study
Selection bias occurs because we condition on a
common effect of both E and D
Selection Bias in a
Cohort Study:

Cohort study of relationship between
HAART and progression to AIDS
–
–
–

LTFU occurs more among those with low
CD4
LTFU occurs more among those with AIDS
But now selection out occurs before AIDS
Could this cause selection bias?
Selection Bias in a
Cohort Study: Differential LTFU



E = ART, D = AIDS, L = vector of symptoms
U = True immunosuppression (unmeasured)
C= Drop out (LTFU)
Selection bias occurs because we condition on a common
effect of both E and a common cause C and D
Selection Bias in a
Cohort Study: Differential LTFU



E = ART, D = AIDS, L = vector of symptoms
U = True immunosuppression (unmeasured)
C= Drop out
Selection Bias vs. Confounding

Bias is a systematic difference between the
truth and the observed
–
–

Pr[Ya=1=1] - Pr[Ya=0=1] ≠ Pr[Y=1|a=1] - Pr[Y=1|a=0]
Separate from random error which is not structural
Using DAGs we can see the common
structures
–
–
Confounding = common causes (directly or
through other mechanisms)
Selection bias = conditioning on common effects
To see the difference



Comparison of mortality among office workers and
longshoremen from MI
Comparison is biased because those who self-select
into longshoremen are fitter which leads to less MI
What is the DAG?
Occupation
Fitness
MI
Adjustment for Selection Bias
Adjustment for loss to follow up
through weighting

Because selection bias means we are only
looking at those included in the study we
can’t adjust through stratification
–

Can use weighting, because this does not
require us to have data on those missing
–

We don’t have the data on those not included
Inverse probability of censoring weighting
Assumes we have enough data to predict
the drop out
Now we ask, what if the censored
were not censored?
Complete data
E+
D+
300
D1200
Total
1500
0.2
RR
2.0
E20 D+
180 D200 Total
0.1
RR
Censored
E+
E?
?
?
?
100
200
?
?
?
Now we ask, what if the censored
were not censored?
Complete data
E+
D+
300
D1200
Total
1500
0.2
RR
2.0
E20 D+
180 D200 Total
0.1
RR
Censored
E+
E?
?
?
?
100
200
0.2
0.1
2
Now we ask, what if the censored
were not censored?
Complete data
E+
D+
300
D1200
Total
1500
0.2
RR
2.0
E20 D+
180 D200 Total
0.1
RR
Censored
E+
E20
20
80
180
100
200
0.2
0.1
2.0
Now we ask, what if the censored
were not censored?
Total Data
E+
ED+
320
40
D1280 360
Total 1600 400
0.2
0.1
RR
2.0
Complete data
Censored
E+ EE+ ED+
300 20 D+
20 20
D1200 180 D80 180
Total 1500 200 Total 100 200
0.2 0.1
0.2 0.1
RR
2.0
RR
2.0
Further stratify IPC weights for
predictors of censoring

As shown assumes those lost are same as
those retained
–

Calculate weights within levels of
predictors of censoring
–

Not likely to be true
Valid if we can produce conditional exchangeability
between those lost and those not lost
Weights can be multiplied by IPTW weights
to simultaneously adjust for confounding
Matching
Matching

Matching in follow-up studies
–

Matching in case-control studies
–
–

Controls confounding by the matched factor
Introduces selection bias to gain efficiency
The bias and confounding must be controlled for in
the analysis to get unbiased results
Matching may be necessary to control for
certain finely divided confounders
–
Sibship, neighborhood, occupation
Matching differs by study design
D+
DN
Cohort Studies
index
reference
a
b
c
d
n1
n0
Case-control studies
index
reference
a
b
c
d
Matching in a cohort study (1)
D+
DN
risk
RD
RR
Males
Females
E+
EE+
4500
50
100
895500
99950
99900
900000
100000
100000
0.005
0.0005
0.001
0.0045
0.0009
10
10
Crude = 4600/1,000,000
140/1,000,000
= 32.9
E90
899910
900000
0.0001
10% sample of exposed
For a each exposed,
because we don’t
one(2)
unexposed on
Matching
in ahave
cohort match
study
enough unexposed
sex
Males
E+
D+
DN
risk
RD
RR
450
89550
90000
0.005
0.0045
10
This should remind you
of standardization
E45
89955
90000
0.0005
Females
E+
E10
1
9990
9999
10000
10000
0.001
0.0001
0.0009
10
No change in risks
Matching in a cohort study (3)


In fact, we can collapse across the matched factor and
the estimates of effect will be unconfounded.
After matching, we can ignore the matched factor in the
analysis (so long as the matched factor does not affect
loss to follow-up).
E+
D+
DN
risk
RD
RR
460
99540
100000
0.0046
0.00414
10
E46
99954
100000
0.00046
Matched case-control study (1)
D+
Controls
OR
Males
E+
E4500
4095
10
Females
E+
E50
100
455
19
10
Same study, same cases,
just a sample of the base
90
171
Matched case-control study (1)
D+
Controls
OR
Males
E+
E4500
4095
10
Females
E+
E50
100
455
19
10
4550 male cases, so choose 4550
male controls. Expect 90% to be
exposed (4095) and the remainder to
be unexposed (Go back to matching
in cohort to see full population)
90
171
Matched case-control study (1)
D+
Controls
OR
Males
E+
E4500
4095
10
Females
E+
E50
100
455
19
10
190 female cases, so take 190
female controls. Expect 10% to be
exposed (19) and remainder to be
unexposed
90
171
Matched case-control study (1)
D+
Controls
OR
Males
E+
E4500
4095
10
Females
E+
E50
100
455
19
10
Distribution of exposure is very
different in males and females
90
171
Matched case-control study (1)
D+
Controls
OR
Males
E+
E4500
4095
10
Females
E+
E50
100
455
19
10
Distribution of D+ is very different in
males and females among E-
90
171
Matched case-control study (1)
D+
Controls
OR
Males
E+
E4500
4095
10
Females
E+
E50
100
455
19
10
Stratum specific estimates are correct
90
171
Matched case-control study (2):
Now collapse
E+
D+
N
OR


E4600
4114
5.0
140
626
The odds ratio obtained after collapsing across strata is biased,
the opposite direction of the original confounding.
We cannot ignore the matched factor in the analysis when
matching in a case-control study.
Selection Bias by Design


In a case control study, we cannot match
based on exposure as this is selection bias
Matching in a case-control is selection bias
by design
–
–

We match on a factor related to exposure
Means we match on exposure to an extent
If S perfectly predicted E, it would be
matching on E
Why the dependence on study
design?


Cohort studies
match unexposed to
exposed
matching affects the
distribution of the
confounder in both
diseased and
undiseased


Case-control studies
match controls to cases
matching affects the
distribution of the
confounder only in the
controls, not in the
diseased (cases)
Advantages of matching:



Cohort studies
control of confounding
efficient if information on
the matched factors is
inexpensive
control for special
variables



Case-control studies
control of confounding
in the analysis
statistically efficient
analytic control of
confounding by some
factors
control for special
variables
Disadvantages of matching:



Cohort studies
expensive, difficult
seldom cost-efficient
compared with analytic
control
must control in analysis
if associated with
censoring or competing
risks (loss to follow-up)




Case-control studies
expensive, difficult
seldom cost efficient
exclude cases with no
match
can’t examine effect of
matched factor
Over-matching

Definition 1: matching that harms statistical
efficiency
–

Definition 2: matching that harms validity
–

On a variable associated with E but not D
On an intermediate between E and D
Definition 3: matching that harms cost
efficiency
–
Preferred definition
When to match: cohort studies




When the matched factor is strongly
associated with both exposure and disease
When information on the matched factor is
easily obtained (e.g., large databases)
When follow-up period is short and the
exposure and matched factor are not likely
associated with censoring or competing risks
Special variables (finely divided)
When to match: case-control studies



When the matched factor is strongly associated
with both exposure and disease
When information on the matched factor is
easily obtained
Special variables (finely divided, like sibship,
neighborhood, occupation)