Lecture 8: Selection Bias, Matching, & Control Selection Matthew Fox Advanced Epidemiology What is selection bias? Which studies can have selection bias: cohort or case control? Selection bias or confounding? Comparison of mortality among office workers and longshoremen from MI Comparison is biased because those who self-select into longshoremen are fitter which leads to less MI What is the bias? In a case control study, can we match cases to controls based on exposure? If we match, do we need to adjust for the matched factor? What is overmatching? Misclassification Summary I #1 Non-differential and independent misclassification of dichotomous exposure or disease (usually) creates an expectation that estimates of effect are biased towards the null. #2 Non-differential and independent misclassification of a covariate creates an expectation that the relative risk due to confounding is biased towards the null, yielding residual confounding. Misclassification Summary II #3 Errors due to misclassification can be corrected algebraically #4 Differential misclassification yields an unpredictable bias of the estimates of effect (still correctable). #5 There are important exceptions to the mantra that “non-differential misclassification biases towards the null.” This Session Selection bias – Definition & control Matching – – Cohort vs. Case-control studies When to adjust, when not to adjust Control selection Adjustment – Is it possible? Selection bias — definition Distortions of the estimate of effect arising from procedures to select subjects and from factors that influence participation – Common element is that the exposure-disease relation is different among participants than among those theoretically eligible Observed estimate of effect reflects a mixture of forces affecting participation and forces affecting disease occurrence Separate from Confounding Cohort studies don’t have selection bias at entry even if subjects self select – – Selection into cohort can create confounding, but this can be undone by adjustment Or becomes an issue of generalizablity Cohort studies/RCTs can have selection bias at end through differential LTFU – Some can be undone if we know enough about the selection mechanism Selection bias — Fallacy Formerly frequently viewed as diseasedependent selection forces – Sometimes selection factors can be controlled as if they were confounders – Exposure-dependent selection forces were thought to be confounders or part of the population definition. For example, matched factors in case-control studies and two-stage studies. However, not all selection factors related to exposure can be so treated Selection bias Adjust for selection proportions Selection bias — Simple method Selection probabilities Cases (A) Non-cases (B) Exposure = 1 Exposure = 0 SA,1 SB,1 SA,0 SB,0 S S A , 0 B , 1 ˆ OR OR SA,1SB,0 Selection bias Truth Cases (A) Total Occupational Radiation 50 10,000 Selection Probabilities Occupational Radiation Cases (A) Non-cases (B) 100% 40% No Occupational Radiation 100 20,000 No Occupational Radiation 40% 40% OR = [50/4000] / [40/8000] = 2.5 Selection bias Truth Cases (A) Total Occupational Radiation 50 10,000 Selected Data Occupational Radiation Cases (A) Non-cases (B) 50 4000 No Occupational Radiation 100 20,000 No Occupational Radiation 40 8000 Selection bias Observation Cases (A) Total Occupational Radiation 50 4,000 No Occupational Radiation 40 8,000 S S A,0 B,1 ˆ OR OR SA,1SB,0 Selection bias Observation Cases (A) Total Occupational Radiation 50 4,000 No Occupational Radiation 40 8,000 40% 40% 1 2.5 100% 40% https://sites.google.com/site/biasanalysis/ Structure of Selection Bias Selection forces don’t create bias if they are not related to both exposure and disease Selection bias — Simple method Selection probabilities Cases (A) Non-cases (B) Exposure = 1 Exposure = 0 SA,1 SB,1 SA,0 SB,0 S S A , 0 B , 1 ˆ OR OR SA,1SB,0 Selection bias Truth Cases (A) Total Occupational Radiation 50 10,000 Selection Probabilities Occupational Radiation Cases (A) Non-cases (B) 100% 40% No Occupational Radiation 100 20,000 No Occupational Radiation 100% 40% OR = [50/4000] / [100/8000] = 1 Selection bias Truth Cases (A) Total Occupational Radiation 50 10,000 Selected Data Occupational Radiation Cases (A) Non-cases (B) 50 4000 No Occupational Radiation 100 20,000 No Occupational Radiation 100 8000 Selection bias Observation Cases (A) Total Occupational Radiation 50 4,000 No Occupational Radiation 100 8,000 100 % 40% 11 100 % 40% Selection bias Truth Cases (A) Total Occupational Radiation 50 10,000 Selection Probabilities Occupational Radiation Cases (A) Non-cases (B) 100% 40% No Occupational Radiation 100 20,000 No Occupational Radiation 50% 20% OR = [50/4000] / [50/4000] = 1 Selection bias Truth Cases (A) Total Occupational Radiation 50 10,000 Selected Data Occupational Radiation Cases (A) Non-cases (B) 50 4000 No Occupational Radiation 100 20,000 No Occupational Radiation 50 4000 Selection bias Observation Cases (A) Total Occupational Radiation 50 4,000 No Occupational Radiation 100 8,000 50% 40% 11 100 % 20% Selection bias Truth Cases (A) Total Occupational Radiation 50 10,000 Selection Probabilities Occupational Radiation Cases (A) Non-cases (B) 100% 100% No Occupational Radiation 100 20,000 No Occupational Radiation 40% 40% Selection bias Truth Cases (A) Total Occupational Radiation 50 10,000 Selection Probabilities Occupational Radiation Cases (A) Non-cases (B) 100% 100% No Occupational Radiation 100 20,000 No Occupational Radiation 40% 40% OR = [50/10000] / [40/8000] = 1 Selection bias Truth Cases (A) Total Occupational Radiation 50 10,000 Selected Data Occupational Radiation Cases (A) Non-cases (B) 50 10,000 No Occupational Radiation 100 20,000 No Occupational Radiation 40 8,000 Selection bias Observation Cases (A) Total Occupational Radiation 50 4,000 No Occupational Radiation 100 8,000 100 % 40% 11 100 % 40% Selection Bias Occurs When Selection is Related to Both the Exposure and the Outcome Sounds like confounding, but this time E and D affect Selection Remember back to common causes and common effects (Hernán 2004) Selection Bias in a Case Control Study: Case controls study of the relationship between estrogens and myocardial infarction – – Cases are those hospitalized for MI Controls are those hospitalized for hip fracture Could this cause selection bias? Selection Bias in a Case Control Study: E= estrogens F= hip fracture D = myocardial infarction C = selection into study Selection bias occurs because we condition on a common effect of both E and D Selection Bias in a Cohort Study: Cohort study of relationship between HAART and progression to AIDS – – – LTFU occurs more among those with low CD4 LTFU occurs more among those with AIDS But now selection out occurs before AIDS Could this cause selection bias? Selection Bias in a Cohort Study: Differential LTFU E = ART, D = AIDS, L = vector of symptoms U = True immunosuppression (unmeasured) C= Drop out (LTFU) Selection bias occurs because we condition on a common effect of both E and a common cause C and D Selection Bias in a Cohort Study: Differential LTFU E = ART, D = AIDS, L = vector of symptoms U = True immunosuppression (unmeasured) C= Drop out Selection Bias vs. Confounding Bias is a systematic difference between the truth and the observed – – Pr[Ya=1=1] - Pr[Ya=0=1] ≠ Pr[Y=1|a=1] - Pr[Y=1|a=0] Separate from random error which is not structural Using DAGs we can see the common structures – – Confounding = common causes (directly or through other mechanisms) Selection bias = conditioning on common effects To see the difference Comparison of mortality among office workers and longshoremen from MI Comparison is biased because those who self-select into longshoremen are fitter which leads to less MI What is the DAG? Occupation Fitness MI Adjustment for Selection Bias Adjustment for loss to follow up through weighting Because selection bias means we are only looking at those included in the study we can’t adjust through stratification – Can use weighting, because this does not require us to have data on those missing – We don’t have the data on those not included Inverse probability of censoring weighting Assumes we have enough data to predict the drop out Now we ask, what if the censored were not censored? Complete data E+ D+ 300 D1200 Total 1500 0.2 RR 2.0 E20 D+ 180 D200 Total 0.1 RR Censored E+ E? ? ? ? 100 200 ? ? ? Now we ask, what if the censored were not censored? Complete data E+ D+ 300 D1200 Total 1500 0.2 RR 2.0 E20 D+ 180 D200 Total 0.1 RR Censored E+ E? ? ? ? 100 200 0.2 0.1 2 Now we ask, what if the censored were not censored? Complete data E+ D+ 300 D1200 Total 1500 0.2 RR 2.0 E20 D+ 180 D200 Total 0.1 RR Censored E+ E20 20 80 180 100 200 0.2 0.1 2.0 Now we ask, what if the censored were not censored? Total Data E+ ED+ 320 40 D1280 360 Total 1600 400 0.2 0.1 RR 2.0 Complete data Censored E+ EE+ ED+ 300 20 D+ 20 20 D1200 180 D80 180 Total 1500 200 Total 100 200 0.2 0.1 0.2 0.1 RR 2.0 RR 2.0 Further stratify IPC weights for predictors of censoring As shown assumes those lost are same as those retained – Calculate weights within levels of predictors of censoring – Not likely to be true Valid if we can produce conditional exchangeability between those lost and those not lost Weights can be multiplied by IPTW weights to simultaneously adjust for confounding Matching Matching Matching in follow-up studies – Matching in case-control studies – – Controls confounding by the matched factor Introduces selection bias to gain efficiency The bias and confounding must be controlled for in the analysis to get unbiased results Matching may be necessary to control for certain finely divided confounders – Sibship, neighborhood, occupation Matching differs by study design D+ DN Cohort Studies index reference a b c d n1 n0 Case-control studies index reference a b c d Matching in a cohort study (1) D+ DN risk RD RR Males Females E+ EE+ 4500 50 100 895500 99950 99900 900000 100000 100000 0.005 0.0005 0.001 0.0045 0.0009 10 10 Crude = 4600/1,000,000 140/1,000,000 = 32.9 E90 899910 900000 0.0001 10% sample of exposed For a each exposed, because we don’t one(2) unexposed on Matching in ahave cohort match study enough unexposed sex Males E+ D+ DN risk RD RR 450 89550 90000 0.005 0.0045 10 This should remind you of standardization E45 89955 90000 0.0005 Females E+ E10 1 9990 9999 10000 10000 0.001 0.0001 0.0009 10 No change in risks Matching in a cohort study (3) In fact, we can collapse across the matched factor and the estimates of effect will be unconfounded. After matching, we can ignore the matched factor in the analysis (so long as the matched factor does not affect loss to follow-up). E+ D+ DN risk RD RR 460 99540 100000 0.0046 0.00414 10 E46 99954 100000 0.00046 Matched case-control study (1) D+ Controls OR Males E+ E4500 4095 10 Females E+ E50 100 455 19 10 Same study, same cases, just a sample of the base 90 171 Matched case-control study (1) D+ Controls OR Males E+ E4500 4095 10 Females E+ E50 100 455 19 10 4550 male cases, so choose 4550 male controls. Expect 90% to be exposed (4095) and the remainder to be unexposed (Go back to matching in cohort to see full population) 90 171 Matched case-control study (1) D+ Controls OR Males E+ E4500 4095 10 Females E+ E50 100 455 19 10 190 female cases, so take 190 female controls. Expect 10% to be exposed (19) and remainder to be unexposed 90 171 Matched case-control study (1) D+ Controls OR Males E+ E4500 4095 10 Females E+ E50 100 455 19 10 Distribution of exposure is very different in males and females 90 171 Matched case-control study (1) D+ Controls OR Males E+ E4500 4095 10 Females E+ E50 100 455 19 10 Distribution of D+ is very different in males and females among E- 90 171 Matched case-control study (1) D+ Controls OR Males E+ E4500 4095 10 Females E+ E50 100 455 19 10 Stratum specific estimates are correct 90 171 Matched case-control study (2): Now collapse E+ D+ N OR E4600 4114 5.0 140 626 The odds ratio obtained after collapsing across strata is biased, the opposite direction of the original confounding. We cannot ignore the matched factor in the analysis when matching in a case-control study. Selection Bias by Design In a case control study, we cannot match based on exposure as this is selection bias Matching in a case-control is selection bias by design – – We match on a factor related to exposure Means we match on exposure to an extent If S perfectly predicted E, it would be matching on E Why the dependence on study design? Cohort studies match unexposed to exposed matching affects the distribution of the confounder in both diseased and undiseased Case-control studies match controls to cases matching affects the distribution of the confounder only in the controls, not in the diseased (cases) Advantages of matching: Cohort studies control of confounding efficient if information on the matched factors is inexpensive control for special variables Case-control studies control of confounding in the analysis statistically efficient analytic control of confounding by some factors control for special variables Disadvantages of matching: Cohort studies expensive, difficult seldom cost-efficient compared with analytic control must control in analysis if associated with censoring or competing risks (loss to follow-up) Case-control studies expensive, difficult seldom cost efficient exclude cases with no match can’t examine effect of matched factor Over-matching Definition 1: matching that harms statistical efficiency – Definition 2: matching that harms validity – On a variable associated with E but not D On an intermediate between E and D Definition 3: matching that harms cost efficiency – Preferred definition When to match: cohort studies When the matched factor is strongly associated with both exposure and disease When information on the matched factor is easily obtained (e.g., large databases) When follow-up period is short and the exposure and matched factor are not likely associated with censoring or competing risks Special variables (finely divided) When to match: case-control studies When the matched factor is strongly associated with both exposure and disease When information on the matched factor is easily obtained Special variables (finely divided, like sibship, neighborhood, occupation)