PowerPoint Presentation, April 2 2014

PROPENSITY SCORES
MAKING SENSE OF NON-RANDOMIZED OBSERVATIONAL DATA
Atul Sharma MD, MSc, FRCPC(ret)
Biostatistical Consulting Unit
April 2014
Propensity score 1996 - 2013
RCT – the gold standard
R.A. Fisher: The Design of Experiments, 1925
• Introduced the concept of randomization
Neyman-Ruben Counterfactual Framework
• Ruben and Rosenbaum, 1983
• Formalizes conditions for valid RCT
Goal: estimate treatment effect in ith individual ti = Yi1 – Yi0, where
• Yi1 = outcome of ith patient with treatment, Yi0 = outcome with control
• Usually, we observe only one outcome i.e. “counterfactual” is hypothetical
• This is the “fundamental problem of causal inference “ -> randomization
Neyman-Ruben Theory Counterfactual Framework
• With randomization, average treatment effect estimated by difference in
means or proportions between treatment & control groups i.e.
• Randomization ensures groups balanced for all important covariates ->
unbiased estimate of average treatment effect. Treatment assignment is
‘strongly ignorable’
• 1:1 randomization allows direct comparison of groups with simple
statistics e.g t-tests for means; RR, OR, NNT for proportions
• With more complex study designs e.g. 1:k randomization, weighted
averages still permit unbiased estimates of average treatment effect
• All such estimates rely implicitly on knowing the probability of
assignment to treatment group = propensity score (e.g. ½ )
Neyman-Ruben Theory Counterfactual Framework
Sometimes randomization fails:
• Selection biases arise when there is an imbalance between groups for
observed or unobserved covariate, and probability of treatment now depends
on covariates e.g. older or sicker patients more likely to receive treatment
• Confounding occurs there is an imbalance between treatment groups on
observed or unobserved covariates that influence outcome e.g. older or sicker
patients more likely to receive treatment and more likely to die
• Potential confounders: Covariates correlated with treatment and outcomes
Neyman-Ruben Theory Counterfactual Framework
• Imbalance may be due to failure of randomization, non-compliance,
unblinded treatment assignment.
• Imbalance also arises as a result of non-randomized treatment assignment
chosen by patient/physician or institutional/social policies i.e.
• ‘natural experiments’ or ‘quasi-experimental designs’
• A non-randomized study can yield unbiased estimates of treatment effects
Conventional methods for adjusting for imbalance:
• Stratification (limited to a few key covariates e.g. age, gender). For example,
matching on 10 binary factors e.g disease history -> 1024 strata
• Matching case-control (also limited to a few key covariates). Given
covariates, treatment assignment is ‘strongly ignorable’
• Regression adjustment : Models effects of confounders on outcome directly
• Can adjust for multiple confounders, but effectiveness depends on
correctly specifying the model e.g. non-linear terms, interactions, etc.
Misspecification may actually worsen bias
• Overfitting may improve fit with current data (‘internal validity’), but
limit generalizability to larger population (‘external validity’). Principle
of parsimony limits model complexity
• Almost impossible to assess effectiveness of ‘bias control’ (unlike PS)
Balancing Property of the Propensity Score:
• PS is probability of assignment to treatment given confounders, summarizes
all observed confounders in a single number (scalar)
• The ‘balancing property’ tells us that conditioning on the PS by stratifying,
matching or regression adjustment will improve balance between groups
• Alternatively,
• The ‘fundamental theorem’: as sample size increases, treatment is
assignment independent of observed covariates after conditioning on
propensity score (Ruben and Rosenbaum, 1983)
• Given PS, treatment assignment is said to be ‘strongly ignorable’
Propensity scores in the real world:
• Advantages:
• PS estimated by familiar methods (e.g. logistic regression)
• Balances multiple covariates simultaneously
• Model of treatment assignment applies only to current data, no external validity
• Similarly, less need for parsimony. Over-fitting is not a problem if balancing works
• After conditioning on PS, balance can be assessed easily and directly
• Caveats:
• Unlike randomization, no assurance that unobserved covariates will balance
• You can only balance observed/ measured covariates
• Works best with large N, as imbalance may be unavoidable with small N
• Take care to avoid covariates that are really treatment outcomes (mediators).
Adjusting for covariates that are effected by treatment may worsen bias
A.F. Connors et al. The Effectiveness of Right Heart Catheterization in the Initial
Care of Critically Ill Patients. JAMA 1996, Sep 18; 276(11): 889-97
Objective
Examine impact of RHC use during 1st 24h of ICU care
& outcome i.e. survival, length of stay, intensity of
care, cost of care
Design
Prospective cohort study
Setting
5 US teaching hospitals, 1989 – 1994.
Subjects
5735 critically ill adult patients in ICU for 1 of 9 prespecified disease categories: acute respiratory failure,
COPD, CHF, cirrhosis, coma, metastatic colon cancer,
non-small cell lung cancer, multi-organ failure with
malignancy or sepsis
Treatment
RHC at discretion of physician, possibly confounded
with patient factors related to the outcome e.g. BP
Outcome(s)
Survival, costs, intensity of care, hospital stay
Dataset and codebook available at http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets
A.F. Connors et al. The Effectiveness of Right Heart Catheterization in the Initial
Care of Critically Ill Patients. JAMA 1996, Sep 18; 276(11): 889-97
•
RCT is often impractical for ethical, logistic, financial reasons e.g. “ the popularity of
this procedure and the widespread belief that it is beneficial makes the
performance of an RCT difficult”
•
Treatment selection at discretion of MD, likely to be confounded with patient
characteristics related to the outcome e.g. patients with low BP more likely to
receive RHC and more likely to die
•
7 critical care specialists identified 50 variable likely to influence treatment

Dependent variable is RHC in first 24h (2184 cases vs 3551 controls)

Independent variables are age, sex, race, education, income, medical
insurance, 9 primary disease categories, 10 admission diagnoses, 12
comorbid conditions, pre-admission ADL and DASI, day 1 DNR, cancer, 2 mo.
survival probability, APACHE III APS score, Glasgow Coma Score, weight,
temperature, BP, respiratory rate, heart rate, PaO2/FiO2, PaCO2, pH, WBC,
hematocrit, sodium, potassium, creatinine, bilirubin, albumin, urine output

Great care taken to identify all confounders that might bias outcomes
through literature review, expert panels
A.F. Connors et al. The Effectiveness of Right Heart Catheterization in the Initial
Care of Critically Ill Patients. JAMA 1996, Sep 18; 276(11): 889-97
Pre-treatment balance: 8 select covariates
Covariate
Age (yrs)
Gender % male
APACHE III APS
Weight (kg)
Mean BP mmHg
Respiratory Rate min-1
WBC
Creatinine
RHC+
60.8
58.5
60.7
72.4
68.2
26.7
16.3
218
RHC61.8
53.9
50.9
65.0
84.9
29.0
15.3
169
p-value
= 0.026
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
= 0.002
< 0.001
How do we compare groups that are fundamentally different?
A.F. Connors et al. The Effectiveness of Right Heart Catheterization in the Initial
Care of Critically Ill Patients. JAMA 1996, Sep 18; 276(11): 889-97
100
Kaplan Meier Survival, death by day 30
swang1
80
70
60
% Surviving
90
No RHC
RHC
0
5
10
15
20
25
Survival Time in Days
RHC
RHC
No RHC
N
2184
3551
Events
830
1088
% death
38.0
30.6
Log-rank
p< 0.0001
30
A.F. Connors et al. The Effectiveness of Right Heart Catheterization in the Initial
Care of Critically Ill Patients. JAMA 1996, Sep 18; 276(11): 889-97
Cox proportional hazards regression
• lt = hazard function = probability of death at time t given survival until t
• log(lt) = l0 + b1 x RHC
• e b1 = hazard ratio for RHC
Unadjusted Cox PH model:
• coxph(survival~swang1, data=rhc))
Cox Model Hazard Ratio 95% CI Log-rank
Unadjusted
1.30
1.19-1.43 p<0.0001
Given selection biases, patients with RHC sicker (lower BP, higher creatinine,
higher APACHE scores), HR likely overestimate of true risk of treatment
0.8
0.6
0.4
0.2
0.0
p = Prob{ RHC }
1.0
Estimating propensity scores
• Logistic regression for binary outcome e.g. RCH vs No RHC (swang1)
• LHS p = probability of treatment assignment = propensity score
• RHS covariates may be numeric or categorical (0/1 dummy variables)
−6
−4
−2
0
2
4
linear predictor
• b’s or OR = eb measure impact of each predictor on probability of RHC
• For each subject, fitted value of p or linear predictor may be used as PS
Estimating propensity scores
Logistic regression: RHC+/RHC- adjusted for 50 risk factors
model1 = glm(swang1~age + sex + race + edu + income + ninsclas + cat1 + das2d3pc +
dnr1 + ca + surv2md1 + aps1 + scoma1 + wtkilo1 + temp1 + meanbp1 + resp1 + hrt1 +
pafi1 + paco21 + ph1 +wblc1 + hema1 + sod1 + pot1 + crea1 + bili1 + alb1 + resp +
card + neuro + gastr + renal + meta + hema + seps + trauma + ortho + cardiohx +
chfhx + dementhx + psychhx + chrpulhx + renalhx + liverhx + gibledhx + malighx +
immunhx + transhx + amihx, data=rhc, family=binomial)
prop_score = fitted.values(model1)
• 30 of 50 were significant independent predictors of treatment (p < 0.05)
• OR = eb measures impact of predictors on probability of RHC
Covariates
Predictor
CHF
Day 1 DNR Status
10 mmHg drop BP
0.1U drop in arterial pH
10 meq/l drop sodium
100mmol/l increase in creatinine
OR
p value
2.19
<0.0001
0.51
1.07
<0.0001
<0.0001
1.18
1.13
<0.0001
0.016
1.06
0.014
• in general, sicker patients more likely to undergo RHC
Estimating propensity scores
Does LR model for PS adequately discriminate treatment groups? Check ROC
Sensitivity
0.2
0.6
1.0
• Data: 3551 controls < 2184 cases
• Area under the curve: 0.80
• 95% CI: 0.79-0.81
−0.2
Optimal Model
Threshold Specificity Sensitivity
0.41
0.73
0.71
1.0
0.6
0.2
Specificity
If discrimination inadequate, refine model:
• Add covariates, nonlinear terms, interactions
• Alternate PS estimation procedure
• Probit regression
• Discriminant analysis
Estimating propensity scores:
• Compare frequency distributions
• PS should discriminate, but overlap is essential!
Frequency Histogram
No RHC
Propensity
11
33
109
180
287
360
467
558
839
900
600
300
N
67
0.95
0.85
0.75
0.65
0.55
0.45
0.35
0.25
0.15
0.05
707
0
RHC
225
305
344
322
308
243
189
139
42
0
300
N
600
900
Using Propensity Scores to balance covariates in observational data:
1. Stratify by quintile (Ruben and Rosenbaum, 1983):
• Each quintile analyzed separately as if from RCT (~meta-analysis)
• With perfect stratification, average treatment effect across strata is unbiased
estimate of true treatment effect
• 5 groups eliminate 90% of bias from observed confounders
2. Covariate (regression) adjustment: outcome vs both treatment and PS quintile
• Continuous numeric outcomes: ordinary linear regression
• Binary categorical outcomes: logistic regression
• Survival analysis: Cox proportional hazards regression
3. Match cases and controls by PS
• Compare groups as if from RCT
• Almost always reduces sample size (unlike 1 ,2)
Stratify by Propensity Score:
•
As few as 5 strata can eliminate 90% bias (Ruben and Rosenbaum 1983)
centiles= quantile(prop_score, probs=(0,0.2,0.4,0.6,0.8,1))
Probability
0
0.2
0.4
0.6
0.8
1
Percentile 0.0027 0.1299 0.2734 0.4353 0.6235 0.9781
ps_quintiles = cut(prop_score, centiles, include.lowest=TRUE);
Quintile
1
2
3
4
5
Sample N 1147 1147 1147 1147 1147
No RHC 1073 908 737 551 282
RHC
74
239 410 596 865
Balancing property of propensity score:
• Like randomization, treated and control patients in each PS quintile have similar
distribution of measured covariates
• Unlike randomization, does not mean they will be similar for unobserved covariates
• Does not imply that pairs matched by PS will be similar on covariates
Assess balance by quintile:
250
Creatinine
0
50
100
150
200
No RHC
RHC
Q1
Q2
Q3
Q4
Q5
ps_quintiles
Quintile
Mean (mmol/l)
1
129.4
2
157.3
3
182.1
4
211.3
5
258.5
Assess balance by quintile:
100
Mean BP
0
20
40
60
80
No RHC
RHC
Q1
Q2
Q3
Q4
Q5
ps_quintiles
Quintile
Mean (mmHg)
1
102.8
2
87.4
3
4
77.2 67.7
5
57.4
Assess balance by quintile:
PO2/FiO2
0
50
100
200
No RHC
RHC
Q1
Q2
Q3
Q4
Q5
ps_quintiles
Quintile
Mean
1
287.7
2
246.7
3
4
220.2 203.0
5
153.8
Are covariates related to RHC after PS adjustment?
• Covariate balance after stratification must be assessed formally
• Simple t-tests affected by sample-size differences (N=5735 vs 1147)
• Differences between RHC+/RHC- should be tested using methods unaffected
by sample size
• Recommended : Regress each covariate on RHC ± PS quintile
•
covariate ~ treatment ± ps_quintile
• If covariate unrelated to RHC after adjustment for PS, strata balanced
Are covariates related to RHC after PS adjustment?
Linear regression: age vs RHC status ± PS quintile (categorical, 5 levels):
•
•
lm(age~swang1, data=rhc)
lm(age~swang1 + ps_quintiles, data=rhc)
Logistic regression: sex vs RHC status ± PS quintile:
•
•
glm(sex~swang1, family = binomial, data=rhc)
glm(sex~swang1 + ps_quintiles, family = binomial, data=rhc)
PS-adjusted p-values
Covariate
Before
Age (yrs)
0.026
Gender % male
<0.001
APACHE III APS
<0.001
Weight (kg)
<0.001
Mean BP mmHg
<0.001
Respiratory Rate min-1
<0.001
WBC
0.002
Creatinine
<0.001
After
0.947
0.730
0.100
0.530
0.260
0.529
0.610
0.470
After adjusting for PS quintile, cases and controls are well balanced.
• May need to refine treatment model by adding covariates, non-linear terms,
interactions - iteratively until balanced
Stratify by Quintile : Analyze each separately
• Unadjusted Cox PH model applied to each stratum
Quintile
1
2
3
4
5
Hazard Ratio
1.21
1.30
1.24
1.21
1.22
Sometimes meaningful to report strata individually (distinct profiles, outcomes)
Average across strata (HR=1.24) is an ‘unbiased estimate of true treatment effect’
For most analyses, formulae exist for calculating pooled significance tests adjusted
for sample sizes and variances in each stratum
•
coxph(survival ~ swang1+strata(ps_quintiles), data=rhc)
Cox Model Hazard Ratio 95% CI Log-rank
Unadjusted
1.30
1.19-1.43 p<0.0001
Pooled
1.24
1.11-1.37 P<0.0001
Stratify by Quintile : Analyze each separately
• Unadjusted Cox model applied to each stratum
• Individual strata can be reported separately (sometimes meaningful)
•
coxph(survival~ swang1, data=rhc3)
PS Quintile 3
RHC
No RHC
RHC
N
737
410
Deaths
222
148
% deaths
30.1
36.1
HR
95% CI
Log-rank
1.24
1.01-1.53
p= 0.03
Using Propensity Scores to balance covariates in observational data:
1. Stratify by quintile (Ruben and Rosenbaum, 1983):
• Each quintile analyzed separately as if from RCT (~meta-analysis)
• With perfect stratification, average treatment effect across strata is unbiased
estimate of true treatment effect
• 5 groups eliminate 90% of bias from observed confounders
2. Covariate (regression) adjustment: outcome vs both treatment and PS quintile
• Continuous numeric outcomes: ordinary linear regression
• Binary categorical outcomes: logistic regression
• Survival analysis: Cox proportional hazards regression
3. Match cases and controls by PS
• Compare groups as if from RCT
• Almost always reduces sample size
Regression Adjustment using propensity scores
• Regression of outcome on treatment with adjustment for PS quintile
Cox proportional hazards models:
Unadjusted : coxph(survival~swang1, data=rhc)
Quintile PS : coxph(survival~swang1+ps_quintile, data=rhc)
Linear PS : coxph(survival~swang1+prop_score, data=rhc)
Cox Model
Unadjusted
PS quintiles
PS linear
Hazard Ratio
1.30
1.24
1.23
95% CI
1.19-1.43
1.12-1.37
1.04-1.58
Log-rank
p<0.0001
p<0.0001
P<0.0001
Regression Adjustment using propensity scores
In weighted regression, underrepresented subjects given greater weight
so that weighted sample more representative
• Inverse-Probability-Treatment Weights (IPTW)
• treatment effect in population whose risk factor distribution equals all study subjects
• regression weights : 1/PS for controls and 1/(1-PS) for treated
• iptw=ifelse(swang1==TRUE,1/(1-prop_score), 1/prop_score))
• Standardized Mortality Estimator (SMR) Weights
• treatment effect in population whose risk factor distribution equals treated subjects*
• regression weights : PS/(1-PS) for controls and 1 for treated
• smr=ifelse(swang1==TRUE,1, prop_score/(1-prop_score))
• Weighted regression
•
coxph(survival ~ swang1, weights=smr, data=rhc)
• Results disappointing:
• Freedman, Berk 2008: better just to fit the model without weights
Regression Adjustment without propensity scores
Adjusted Cox PH model: model effect of confounders on outcome directly
coxph(survival~ swang1 + age + sex + race + edu + income + ninsclas + cat1 +
das2d3pc + dnr1 + ca + surv2md1 + aps1 + scoma1 + wtkilo1 + temp1 + + meanbp1
+ resp1 + hrt1 + pafi1 + paco21 + ph1 + wblc1 + hema1 + sod1 + pot1 + crea1 +
bili1 + alb1 + resp + card + neuro + gastr + renal + meta + hema + seps +
trauma + ortho + cardiohx + chfhx + dementhx + psychhx + chrpulhx + renalhx +
liverhx + gibledhx + malighx + immunhx + transhx + amihx, data=rhc)
Cox Model Hazard Ratio 95% CI Log-rank
Unadjusted
1.30
1.19-1.43 p<0.0001
Adjusted
1.24
1.12-1.38 P<0.0001
• Misspecification and over-fitting may limit generalizability (external validity)
• Difficult to assess adequacy of balance when modeling outcome directly
• May provide reassuring confirmation of bias adjustment by PS
Using Propensity Scores to balance covariates in observational data:
1. Stratify by quintile (Ruben and Rosenbaum, 1983):
• Each quintile analyzed separately as if from RCT (~meta-analysis)
• With perfect stratification, average treatment effect across strata is unbiased
estimate of true treatment effect
• 5 groups eliminate 90% of bias from observed confounders
2. Covariate (regression) adjustment: outcome vs both treatment and PS quintile
• Continuous numeric outcomes: ordinary linear regression
• Binary categorical outcomes: logistic regression
• Survival analysis: Cox proportional hazards regression
3. Match cases and controls by PS
• Compare groups as if from RCT
• Almost always reduces sample size
Propensity score matching:
• Create matched case-control pairs by matching on PS
• Confirm balance before and after matching by comparing treatment
groups with procedure independent of sample size (e.g. standardized
differences). If inadequate, refine PS model or matching procedure
iteratively until it works
• Compare groups as if from randomized study
• Caveats:
• ‘Balancing property’ of PS does not ensure matched pairs will
match on covariates, only groups on average
• Matching constraints usually reduce sample size (stratification
and regression adjustment use all cases and controls)
• Requires specialized software
Propensity score matching:
• Matching on PS:
• Nearest neighbour: matches each case to closest control
• Caliper matching: only if PS within a pre-specified caliper
distance, typically 0.2 sd’s
• Matching on original covariates
• Minimize sum of distances between cases and controls
i.
Euclidean distance:
ii.
Mahalanobis distance weights the contributions by inverse
covariance, to avoid over-counting correlated measures,
generally preferable
i.
Exact matching for categorical variables e.g. gender, disease
status
Propensity score matching:
Options include:
• matching ± replacement (re-use controls; less bias, more analysis)
• one:many, many:one or variable ratio matching
• greedy (closest) vs optimal matching (minimizes sum of distances)
• almost exact matching, fine balance (without matching)
Suggested approach:
• start with nearest neighbor matching without replacement
• ± caliper ≤ 0.20 SD’s
• ± exact matching for priority covariates (gender, disease)
• check balance using standardized differences
• consider optimal matching (effective, but intensive)
• experiment with other options until cases and controls balanced
• may need to use specialized packages (Stata, R)
Propensity score matching in R:
Matching http://sekhon.berkeley.edu/matching
Sekhon, J. S. (2011). Multivariate and propensity score matching software with
automated balance optimization: The Matching package for R. Journal of Statistical
Software 42(7).
• Uses automated procedure to select matches, based on univariate and
multivariate balance diagnostics. Primarily 1:M matching (where M is a positive
integer), allows matching with or without replacement, caliper, exact
MatchIt http://gking.harvard.edu/matchit
Ho, D.E., Imai, K., King, G., and Stuart, E.A. (2011). MatchIt: Nonparametric
preprocessing for parameteric causal inference. Journal of Statistical Software 42(8)
• Two-step process: does matching, then user does outcome analysis. Wide array
of estimation procedures and matching methods available: nearest neighbor,
Mahalanobis, caliper, exact, full, optimal, subclassification
optmatch http://cran.r-project.org/web/packages/optmatch/index.html
Hansen, B.B., and Fredrickson, M. (2009).
• optmatch: Functions for optimal matching. Variable ratio, optimal, and full
matching.
Propensity score matching in Stata:
psmatch2 http://ideas.repec.org/c/boc/bocode/s432001.html
Leuven, E. and Sianesi, B. (2003). Stata module to perform full Mahalanobis and
propensity score matching, common support graphing, and covariate imbalance testing.
• Allows k:1 matching, kernel weighting, Mahalanobis matching
pscore http://www.lrz-muenchen.de/~sobecker/pscore.html
Becker, S.O. and Ichino, A. (2002). Estimation of average treatment effects based on
propensity scores (2002) The Stata Journal 2(4): 358-377.
• k:1 matching, radius (caliper) matching, and stratification (subclassification)
match http://www.economics.harvard.edu/faculty/imbens/software_imbens
Abadie, A., Drukker, D., Herr, J. L., and Imbens, G. W. (2004). Implementing matching
estimators for average treatment effects in Stata. The Stata Journal 4(3): 290-311.
• Primarily k:1 matching (with replacement)
cem http://gking.harvard.edu/cem/
Iacus, S.M., King, G., and Porro, G. (2008). Matching for Causal Inference Without
Balance Checking.
• Implements coarsened exact matching
Propensity score matching in SAS:
Local and global optimal propensity score matching
Coca-Perraillon, M. (2007). Local and global optimal propensity score matching. In SAS
Global Forum 2007. Paper 185-2007.
• Variety of matching methods. No built in diagnostics.
Greedy matching (1:1 nearest neighbor)
Parsons, L. S. (2001). Reducing bias in a propensity score matched-pair sample using
greedy matching techniques. In SAS SUGI 26, Paper 214-26.
Parsons, L.S. (2005). Using SAS software to perform a case-control match on propensity
score in an observational study. In SAS SUGI 30, Paper 225-25.
Kosanke, J., and Bergstralh, E. (2004): Match 1 or more controls to cases using the
GREEDY algorithm
1:1 Mahalanbois matching within propensity score calipers
Feng, W.W., Jun, Y., and Xu, R. (2005). A method/macro based on propensity score and
Mahalanobis distance to reduce bias in treatment comparison in observational study.
www.lexjansen.com/pharmasug/2006/publichealthresearch/pr05.pdf
Propensity Score Matching
Using R Match() function to implement
• Nearest neighbour match on PS with caliper = 0.1 (PS within 0.025)
• Exact matching on gender, admission diagnosis (9 disease categories)
Original number of observations..............
Original number of treated obs...............
Matched number of observations...............
Matched number of observations (unweighted).
5735
2184
1461
1461
Number of obs dropped by 'exact' or 'caliper'
2090 controls
723 cases,
Balance checks before and after matching are mandatory
• Test for balance must be independent of sample size
• Given numeric covariate x, standardized mean difference =
• For proportions p,
• Most authors equate balance with a standardized mean difference < 10%
• Full range pre- and post- match test statistics output by MatchBalance()
Propensity Score Matching
Output of MatchBalance() function includes tests for equality, treatment vs controls
• means, standardized mean differences (< 10%)
• variances (treatment vs controls) (0.5-2.0, Rubin 2001)
• empiric quantiles (quantile-quantile plot)
• empiric cumulative distribution function
***** (V5) meanbp1 *****
Before Matching
Matching
mean treatment........
68.198
mean control..........
84.869
std mean diff.........
-48.685
After
73.523
73.673
-0.42029
mean raw eQQ diff.....
med raw eQQ diff.....
max raw eQQ diff.....
16.673
11
49
1.6547
1
74
mean eCDF diff........
med eCDF diff........
max eCDF diff........
0.092555
0.06949
0.21172
0.0092787
0.0058179
0.039014
var ratio (Tr/Co).....
0.77589
T-test p-value........ < 2.22e-16
KS Naive p-value...... < 2.22e-16
KS Statistic..........
0.21172
1.0215
0.90308
0.21612
0.039014
Propensity Score Matching: Assessing matching balance
dot chart effective for display of many simultaneous covariates
Treatment and control groups balance when Standardized Mean Differences < 10%
Propensity Score Matching
Output of MatchBalance() function includes tests for equality, treatment vs controls
• means, standardized mean differences (< 10%)
• variances (treatment vs controls) (0.5-2.0, Rubin 2001)
• empiric quantiles (quantile-quantile plot)
• empiric cumulative distribution function
***** (V5) meanbp1 *****
Before Matching
Matching
mean treatment........
68.198
mean control..........
84.869
std mean diff.........
-48.685
After
73.523
73.673
-0.42029
mean raw eQQ diff.....
med raw eQQ diff.....
max raw eQQ diff.....
16.673
11
49
1.6547
1
74
mean eCDF diff........
med eCDF diff........
max eCDF diff........
0.092555
0.06949
0.21172
0.0092787
0.0058179
0.039014
var ratio (Tr/Co).....
0.77589
T-test p-value........ < 2.22e-16
KS Naive p-value...... < 2.22e-16
KS Statistic..........
0.21172
1.0215
0.90308
0.21612
0.039014
Propensity Score Matching: Assessing matching balance
Equality of distributions may be assessed by QQ plots, eCDF, KS statistic
Quantile – quantile plot: Creatinine (mmol/l)
Empiric CDF: Creatinine (mmol/l)
Formally compared via KS statistic
Propensity Score Matching
RHC
RHC
No RHC
Cox PH regression: 1461 matched pairs
coxph(survival ~ swang1, data = rhcm)
Cox Model
N
Hazard Ratio
Unadjusted
5735
1.30
PS quintiles
5735
1.24
PS matched pairs 2922
1.23
95% CI
Log-rank
1.19-1.43 p<0.0001
1.12-1.37 p<0.0001
1.09-1.41† p=0.001
N
1461
1461
Deaths
536
454
% Deaths
36.7
31.1
Log-rank
p = 0.001
Testing the fundamental assumption:
“Estimated treatment effects are unbiased if there are no unobserved
confounders i.e. all relevant covariates have been included”
Sensitivity Analysis seeks to test robustness of conclusions to hidden biases
from unobserved confounders
Connors. 1996:
• Asked 13 practicing clinicians to identify 10 variables with the greatest
influence on their decision to use RHC in practice. None missing in study
• From logistic regression, identified 4 physiologic variables with the
largest effect on the decision to treat: PaO2/FiO2, mean BP, pulse, and
resp. rate. Individually omitted from PS model -> hazard ratio ± 0.01
• Formal ‘Rosenbaum bounds’ to estimate how much hidden bias would
need to be present to invalidate conclusions, implemented in rbounds
libraries for R, Stata.
Rosenbaum Sensitivity Test for matched pairs:
• Matched patients have same PS and same probability of treatment (RHC) if no
unmeasured confounders
• Assume true probabilities differ due to an unmeasured confounder that correlates
perfectly with outcome (worst-case scenario). Let G = odds ratio for its effect on
treatment assignment (in logistic regression for PS)
• How large does G need to be to invalidate the conclusion RCH increased mortality?
• For binary outcome: 2 x 2 table for survival in 1461 matched pairs
•
Control dies
Control lives
RHC dies
651
356
RHC lives
274
180
Discordant outcomes in 274+356 = 730 of 1461 matched pairs
McNemar test for matched pairs: Under Ho , how likely is it that two discordant
cells (274 vs 356 ) arose from same binomial distribution by chance?
Rosenbaum Sensitivity Test for matched pairs:
• For binary outcome: 2 x 2 table for survival in 1461 matched pairs
Control dies
Control lives
RHC dies
651
356
RHC lives
274
180 ← McNemar p= 0.001
binarysens(274,356,Gamma=1.2,GammaInc=0.02) ← varies G -> effect on McNemar p-value
Rosenbaum Sensitivity Test
p-value of estimate .... 0.0006
Gamma Lower bound Upper bound
1.00
0.00062
0.00062 ← no confounding by unobserved covariate
1.02
0.00025
0.00143
1.04
0.00010
0.00308
1.06
0.00004
0.00619
1.08
0.00001
0.01170
1.10
0.00000
0.02082
1.12
0.00000
0.03503
1.14
0.00000
0.05590 ← McNemar p > 0.05
Unmeasured confounder needs to increase odds of RHC by 1.14 to nullify the
increase in mortality with treatment
Sensitivity analysis: Interpreting Rosenbaum bounds
• Becker et al, 2007: “These are worst case scenarios…a critical value of G =
1.15 does not mean there is no effect of treatment on the outcome.
However, the results are sensitive to deviations from the unconfoundedness
assumption, and we advise some caution when interpreting”
• Connors et al used a ‘less worst-case’ scenario to estimate G needed to
‘invert’ HR i.e. reduce it from 1.2 to 0.8 and cited a 3-fold increase in odds of
RHC due to unobserved confounders.
• rbounds now the standard measure of sensitivity to hidden biases in
matched studies
• psens() performs similar analysis for numeric outcomes based on Wilcoxon
rank-sum test (≠ McNemar)
• On Friday, we’ll demonstrate both binarysens() and psens() to perform
sensitivity analysis for different outcomes
Devices to increase robustness
• Multiple control groups:
• consistency between control groups that differ in unobserved confounders
• e.g. different ways to escape treatment e.g. declined vs denied treatment
• Coherence among outcomes:
• e.g. length or cost of hospitalization rather than survival, looking for consistency
• subgroup analysis: looking for consistency or plausible variations e.g. elderly,
women, whites, shock/sepsis/ post-op comparable
• Dose response relationships: looking for a consistent effect
• e.g. SVC pressure line vs. full RHC
• Known effects
• e.g. outcomes known to be unaffected by treatment
A brief review
Quasi-experimental designs require as much care and attention as an RCT
• Carefully identify all potential confounders by
• Clinical judgment
• Literature review
• Expert panels
• Prioritize confounders to identify those which must be balanced (a priori)
• Avoid controlling for post-treatment covariates
After estimating PS, make sure treatment model discriminates. Add covariates, nonlinear terms, or interactions until it does. Make sure that group scores overlap
After conditioning on PS by stratification, regression, or matching, make sure that
groups are balanced using measures independent of sample size. If needed, revise the
treatment model or tighten up matching criteria
Assess robustness of conclusions to hidden bias due to unobserved confounders:
• Have external panel of experts review choice of predictors
• Re-analyze after omitting important confounders
• Estimate Rosenbaum bounds i.e. G = how much hidden bias needed to
invalidate conclusions
Useful References:
•
Rosenbaum, P.R., & Rubin, D.B. The central role of the propensity score in observational
studies for causal effects. Biometrika 1983; 70(1), 41-55
•
A.F. Connors et al. The Effectiveness of Right Heart Catheterization in the Initial Care of
Critically Ill Patients. JAMA 1996; 276(11): 889-97
•
Rosenbaum, P.R. (2010). Design of Observational Studies. New York: Springer
•
Pattanayaka, C.W. et al. Propensity Score Methods for Creating Covariate Balance in
Observational Studies. Rev Esp Cardiol. 2011; 64(10):897–903
•
Stuart, E. A. Matching methods for causal inference: A review and a look forward. Statistical
Science 20. 2010; 25: 1-21.
•
Stukel, T.A. et al. Analysis of Observational Studies in the Presence of Treatment Selection
Bias. JAMA 2007; 297(3) 278-85
•
J.S. Sekhon. Multivariate and Propensity Score Matching Software with Automated Balance
Optimization: The Matching Package for R. J Stat Software. 2011; 42(7): 1-52
Thank-you for your patience:
QUESTIONS?