A Practical Guide to Propensity Score Models Paul L. Hebert, PhD Investigator, VA HSR&D Puget Sound and Research Associate Professor, Department of Health Services Paul.Hebert2@va.gov Heberp@u.washington.edu June 29, 2009 Funding provided by NIH/NIDDK Motivation Researcher is using observational data to compare outcomes among two or more treatments Observed covariates differ substantially between treatment groups Propensity Score Models attempt to affect a balance in observed covariates between treatment groups Create a single variable—a propensity score– that captures how differences in these covariates contribute to a patient’s probability of receiving treatment A vs. treatment B Use this propensity score to create groups of treatment A versus treatment B patients that look similar to each other. Compare outcomes between these groups of well-matched patients Motivation, continued Especially useful in two situations Substantial non-overlap of treatment groups on important covariates. Some people in Treatment A group don’t look like anybody in Treatment B group Rare outcomes with common treatments. Multivariate models would have too few events per right-hand side variable Using propensity score models requires five steps 1. Estimate propensity score 2. Use the propensity score to create a balance in observed covariates across treatment groups 3. Evaluate the quality of the balance 4. Estimate differences in outcomes between balanced treatment groups 5. Perform sensitivity analyses This talk focuses on the first three steps 1. Estimate propensity score 2. Use the propensity score to create a balance in observed covariates across treatment groups 3. Evaluate the quality of the balance 4. Estimate differences in outcomes between matched treatment groups 5. Perform sensitivity analyses Data for Today’s Presentation Are all Angiotensin Converting Ezyme Inhibitors (ACEIs) alike? HOPE RCT Trial showed ACEI ramipril was good for high-risk CV patients Are other, cheaper ACEIs just as effective? Ramipril $53.03/ 30 day supply Captopril $12.99 No RCT has been, or ever will be, conducted to answer this. It’s either causal models or we guess Problem: ACEI users differ on a number of important variables California Medicaid/Medicare beneficiaries prescribed ACEIs in 1997 ramipril N 1,269 Demographics Mean Age 70.3 (sd) 6.6 Female 65% Race White 53% African American 11% Hispanic 11% Asian 7% Other/Unknown Race 19% Long Term Care 4% Disabled 26% Per-capita income in ZIP 17,519 Hospitalized in 1996 13% Charlson Comorbidity Score 1.57 Median number of meds in 1996 7 (3, 12) ACEI Users captoprl enalapril 1,521 71.4 7.2 70% 4,935 benazepril P-value 3,412 71.4 6.6 69% 71.0 6.9 71% 0.000 0.002 59% 54% 14% 14% 8% 7% 5% 8% 15% 17% 14% 7% 32% 24% 17,606 19,051 20% 23% 1.74 1.71 12 (7, 20) 9 (5, 14) 54% 13% 7% 8% 18% 6% 27% 18,858 22% 1.64 9 (5, 14) 0.024 0.002 0.000 0.016 0.000 0.000 0.000 0.000 0.000 0.000 Using propensity score models requires five steps 1. Estimate propensity score 2. Use the propensity score to create a balance in observed covariates across treatment groups 3. Evaluate the quality of the balance 4. Estimate differences in outcomes between matched treatment groups 5. Perform sensitivity analyses What is a propensity score Usually a logit (probit if you’re an econometrician) model of treatment (W) as a function of X’s W=1 if patient i takes ramipril, 0 if captopril Wi = f ( X i ; β ) For each patient, calculate the propensity score ßX from the ß’s and X’s ( ) ( ) exp βˆX i ˆ Pi = 1 + exp βˆX i Step 1: Estimate PS equation What X’s do you use? What should you do with the Xs? E.g, transformations, interactions How do you know you have a good model? What X’s do you use? Risk Factor Instrument Treatment e.g, ramipril Outcome Confounders What X’s do you use? Rubin (2007) suggests expansive definition of X. To estimate effect of smoking on costs, modeled smoking as a function of: Age, education, occupation, etc.. Seatbelt use, arthritis, number of friends, frequency of having friends over for dinner, membership in clubs, etc. Variables that should not be included are “…are effectively known to have no possible connection to the outcomes, such as random numbers…, or the weather half-way around the world” (Rubin 2007; p 29) Rubin, DB, Statist Med 2007; 26:20-36 What X’s do you use? Monte Carlo Simulations by Austin (2006, 2008) and Brookhart (2006) DO include all variables related to the outcome Could get biased results (Brookhart) or imbalance in covariates in matched samples (Austin, 2006) if you include only known confounders. DO NOT include variables related to the treatment but not the outcome (i.e, instrumental variables) Including these variables increases the variance of the estimated treatment effect but doesn’t decrease the bias (Brookhart) Including these variables reduces ability to match (Austin, 2006) Austin PC, et al Statisti Med 2006; 26(4) 734-753 Brookhart, et al, Am J Epi 2006; 163:1149-56 Austin PC, J Clinical Epi 2008; 26: 537-545 What X’s do you use? Risk Factor Instrument Treatment e.g, ramipril Outcome Confounders Step 1: Estimate Propensity Score What X’s do you use? What should you do with these Xs? Rubin (2007) suggests an expansive use of transformations and interaction terms. Smoking equations includes Log(weight)*log(height), log(weight)2, etc. “Other, unspecified non-linear terms” Should not include “five-way interactions” Austin (2006): Modeled statin use using 257 variables: 24 main variables and 233 transformations and two-way interactions of those variables. Dehejia and Wahba (1999, 2000): Add higher-order terms if imbalance in covariates within quintiles of the propensity score (more later) Austin PC et al, Statist. Med 2006; 25:2084-2106 Rubin D, Statist. Med 2007; 26:20-36 Step 1: Estimate PS equation What X’s do you use? What should you do with the Xs? How do you know you have a good model? Should you care about significance of variables in model (i.e., p-values)? Should you care about the overall fit or predictive properties (e.g., c-statistic)? Answer: NO. Its all about affecting a balance in the X’s Using propensity score models requires five steps 1. Estimate propensity score 2. Use the propensity score to create a balance in observed covariates across treatment groups 3. Evaluate the quality of the balance 4. Estimate differences in outcomes between matched treatment groups 5. Perform sensitivity analyses Kernal Density plot of logit of the propensity score for Ramipril users Versus captopril users Before PScore Matching 0 kdensity pr .2 .4 .6 Density plots of linear propensity score -4 -2 0 x Ramipril Captopril 2 Step 2: Creating balanced treatment groups Three basic options Conditioning on the propensity score g(Yi)=b0+b1Ramiprili+b2F(βXi)+ei Stratification on the propensity score Yi , j = β 0, j + β1, jWi , j 1 4 β1 = ∑ β1, j 5 j =0 Matching on the propensity score What techniques are researchers using Weitzen (2004) reviewed 47 in Medline in 2001 Conditional adjustment: Stratification: Matching: Stratified covariate: Unspecified: 25 9 7 4 2 Weitzen, S et al Pharmacoepi Drug Safety. 2004; 13:841-853 Which should you use? Conditioning is inappropriate for odds ratios and hazard ratios Conditioning on the propensity score results in biased (toward null) estimates of odds ratios and hazard ratios, but not rate ratios (Austin 2007), risk ratios (Austin, 2008), or differences in means or proportions (Rosenbaum and Rubin, 1985) Stratifying on the quintiles of any propensity score model resulted in residual imbalance between treated and untreated subjects in the upper and lower quintiles. (Austin, 2006) Rosenbaum and Rubin (1983) Biometrika; 70:41-45 Austin PC, et al Stat Med 2006; 26(4): 734-753 Austin PC, et al Stat Med 2007; 26(4): 754-768 Austin PC, et al J Clinical Epi 2008; 26: 537-545 Kernal Density plot of logit of the propensity score for Ramipril users Versus captopril users Before PScore Matching 0 kdensity pr .2 .4 .6 Density plots of linear propensity score -4 -2 0 x Ramipril Captopril 2 Which should you use? Conditioning is inappropriate for odds ratios and hazard ratios Stratifying on the quintiles of any propensity score model resulted in residual imbalance between treated and untreated subjects in the upper and lower quintiles. (Austin, 2006) Matching on the propensity score resulted in the least bias when estimating relative risks, whereas stratifying resulted in the greatest bias (Austin, 2008) Austin PC, et al Stat Med 2006; 26(4): 734-753 Austin PC, et al J Clinical Epi 2008; 26: 537-545 If you choose to match, there are several techniques Nearest Neighbor Match subject in Treatment A group to subject in Treatment B group with the closest propensity score 5→1 match Match on the 5th digit of the propensity score first Of the resulting unmatched sample, match on the 4th digit of the propensity score Repeat until 1 digit match Mahanalobis Matching Match on the basis of the Mahanalobis distance between subjects Distance between vectors of characteristics, including the propensity score (more later). Options for use with each type of matching procedure 1-1 matching versus 1-many Gain efficiency if you have many treatment B subjects that match one Treatment A subject. Efficiency gain of 1-many is surprisingly small (Rosenbaum, 1985) Matching with replacement Allow a subject in treatment group B to serve as a match for multiple subjects in treatment group A Matching with calipers Match Treatment B subjects within a specified caliper or range of a Treatment A’s score (e.g., +/-0.01 of the propensity score) e.g., discard a nearest neighbor match if the propensity score of the matched Treatment B patient is > caliper Rosenbaum and Rubin, (1985) Biometrics 41, 103-116 What matching techniques are researchers using Austin (2008) reviewed 47 articles published in medical journals 1996-2003 Matching technique cited 19 Used calipers of various sizes 9 Used 5→1 matching 3 Used nearest neighbor 1 matched within quintiles of propensity score 15 no information 1-1 versus 1-many 39 used 1-1 4 used 1-many 4 no information With/without replacement 14 Without 33 no information Austin, PC Stat Medicine. 2008; 27:2037-2049 What should you do? Bias versus efficiency tradeoffs in matching techniques Bias 1-1 matching 1-N matching With replacement Without replacement With calipers Without calipers Inefficiency How do you know when you have a good match? Look at differences in X’s between matched (or stratified) treatment group T-tests and chi2 tests could misrepresent balance because large N’s give small p-values, and vice versa Better: calculate standardized differences for each variable (j) in your analysis Standardized differences >10 could be a problem Dj = 100 * ( xtreatment − xcontrol ) (s 2 treatment ) 2 /2 + scontrol Before propensity score matching, ramipril users versus captopril users Standardized differences for selected variables ramipril versus captopril NSAID Statin Diuretic Calcium Channel Blocker Beta Blocker COPD Diabetes Previous Hosp Disabled Longterm Care Other Race Asian Hispanic Black Female Number Meds Income Age -70 -60 -50 -40 -30 -20 -10 Standardized Differences 0 10 20 Standardized differences after 5→1 digit matching, without replacement, 1-1 matching Standardized differences after propensity score matching ramipril versus captopril NSAID Statin Diuretic Calcium Channel Blocker Beta Blocker COPD Diabetes Previous Hosp Disabled Longterm Care Other Race Asian Hispanic Black Female Number Meds Income Age -70 -60 -50 -40 -30 -20 -10 Standardized Differences 0 10 20 Compare standardized differences within quintiles of the estimated propensity score Standardized differences overall and within propensity score quintiles; ramipril Users versus Captopril users BEFORE propensity score matching Overall Age Per Capital Income Number Meds Female Black Hispanic Asian Other/Unk Race Longterm Care Disabled Previous Hosp Diabetes COPD Beta Blocker Calcium Channel Blocker Diuretic Statin NSAID -16.9 -1.2 -68.6 -11.1 -9.8 10.5 8.5 11.2 -37.9 -14.2 -12.2 -17.0 -2.4 -12.6 -21.1 -34.9 5.1 3.8 Good Uncertain Bad 38% 41% 22% Quintiles of the Propensity Score 0 1 2 3 -18.9 -4.6 -21.6 -6.2 -7.5 9.2 2.1 -7.6 -30.6 2.5 7.8 -1.4 6.7 1.2 22.5 16.2 -7.1 14.9 -1.3 5.0 -2.3 13.7 -8.1 -8.8 8.2 6.8 -8.0 -13.1 -0.9 -4.7 2.1 21.6 24.0 9.1 -3.3 -2.4 -0.2 -6.5 -9.3 3.0 2.7 1.7 1.4 -1.4 5.0 -3.9 4.6 12.9 -1.8 -5.0 7.3 6.0 6.7 -1.5 7.5 -1.0 -11.2 -8.5 1.8 -2.4 -12.1 0.7 7.9 10.8 -9.5 -1.1 -7.9 -3.8 -19.4 -13.2 -3.8 -1.8 4 0.9 -9.1 -4.1 -12.1 8.9 5.2 1.5 2.1 -7.3 5.8 -7.5 -6.0 3.5 -18.4 -35.9 -18.9 14.9 0.7 After matching Standardized differences overall and within propensity score quintiles; ramipril Users versus Captopril users Age Per Capital Income Number Meds Female Black Hispanic Asian Other/Unk Race Longterm Care Disabled Previous Hosp Diabetes COPD Beta Blocker Calcium Channel Blocker Diuretic Statin NSAID Good Uncertain Bad Overall -1.7 0.2 -0.2 -0.7 -2.8 -2.4 -1.8 0.9 -3.4 2.5 -2.0 2.0 -1.4 -3.2 -0.2 -0.5 1.1 1.4 94% 6% 0% Quintiles of the 0 1 -19.9 4.8 7.7 5.5 8.4 -5.3 2.2 11.0 -4.1 -14.9 0.1 -9.2 -2.7 10.0 5.6 13.8 -16.5 11.3 7.4 -20.3 1.7 4.1 -3.0 -11.5 6.6 11.0 0.3 12.7 20.7 16.9 25.0 -5.9 -11.6 7.7 25.9 -29.7 Propensity Score 2 3 5.5 -1.2 -7.6 -3.3 -11.3 3.1 5.9 -11.8 4.2 -2.2 4.4 1.6 -11.2 -7.8 3.7 -6.6 3.5 10.6 -5.0 16.0 -1.3 -2.2 22.5 4.5 -10.1 -11.1 -13.9 2.8 -3.1 -7.8 6.1 -19.0 -0.5 3.4 4.9 15.1 4 4.3 -0.3 -0.8 -9.2 4.3 -8.3 1.8 -8.1 -10.6 16.6 -18.7 -1.8 -5.8 -24.0 -27.7 -10.7 6.5 -9.7 After matching, treatment groups look balanced on propensity score After PScore Matching Density plots of linear propensity score 0 0 .2 .2 kdensity pr kdensity pr .4 .4 .6 .6 Before PScore Matching Density plots of linear propensity score -4 -2 0 2 -3 -2 -1 x Ramipril 0 1 x Captopril Ramipril Captopril 2 After propensity score matching, ramipril vs. captopril Model Characteristics Initial Sample % Matched (N) Captopril Ramipril p-value 1521 1269 56% (859) 68% (859) - -- 0.7569 -- 70.6 69% 12% 8% 6% 17% 5% 28% 5.4 17,663 1.6 17% 10.4 18% 50% 46% 24% 6% 7% 1% 5% 0.944 0.917 0.557 0.34 0.605 0.949 0.736 0.748 0.103 0.662 0.678 0.893 0.481 0.658 0.923 0.663 0.651 0.92 0.707 1 0.747 C-statistic Characteristics of matched samples Mean age Female Black Hispanic Asian Other/Unknown race Long-term care Disabled Ramipril equivalent dose (mg) Per Capita Income in ZIP code ($) Charlson comorbidity score Hospitalized 1996 Number of Medications 1996 Beta blockers Calcium Channel Blockers Diuretics Lipid Lowering Agents NSAIDS Anti-arrythmias Anti-platelet Hydralazine Nitrates 70.6 69% 13% 9% 5% 17% 5% 29% 5.6 17,487 1.6 17% 10.6 17% 50% 45% 23% 6% 7% 1% 5% Using propensity score models requires five steps Estimate propensity score Use the propensity score to create a balance in observed covariates across treatment groups Evaluate the quality of the balance Estimate differences in outcomes between matched treatment groups 1. 2. 3. 4. 5. If you matched, consider using appropriate technique for matched samples (Austin, 2008. And associated editorials). Perform sensitivity analyses (Rosenbaum 2002) Austin PC, Statics. Med 2008; 27(12) 2037-49 Rosenbaum (2002) Observational Studies. New York: Springer Example: Rubin’s suggested method 1. 2. 3. Anti-parsimonious logit for propensity score equation For each Treatment A patient (i), define a “donor pool” of Treatment B patients (j) with linear propensity score within .2 sd of patient i’s Calculate the Mahalanobis distance (M) to each patient in the donor pool Mij=(xi-xj)Ω-1(xi-xj) x= vector of linear propensity score and a few other important covariates Ω=covariance matrix for X’s Rubin, continued 1-1 matching 4. 5. Match without replacement 6. Use a greedy matching algorithm Match the hardest to match patient to the best match in the donor pool first based on Mahanalobis distance Repeat with next hardest to match patient until all Treatment A patients are matched or there are no Treatment B patients left in the donor pool. Summary Goal of Propensity Score modeling is to create treatment groups that are balanced on observed covariates To estimate the propensity score Include all variables believed to be related to the outcome, not just the confounders. Be anti-parsimonious. Do not include variables that are only related to the treatment Use interactions, transformations, liberally to effect a balance Condition, stratify or match on the propensity score Do not condition on the propensity score if you are estimating a logit, or Cox model If you match, Tradeoff between bias and efficiency regarding with replacement, caliper, and 1-many matching evaluate the quality of the match use standardized differences Consider using statistics appropriate for matched samples Last Slide- References Look at “good” versus “bad” matches For a given variable (say, age) calculate the variance that is left in this variable, after controlling for the propensity score. A good variable is one that has about as much unexplained variance in the treatment group as the control group Using OLS, estimate Agei = α 0 + α1Wˆi + ui where W is the estimated propensity score Get the residuals (u) from this equation and calculate BAge = su2,treatment su2,control B is good if it is between 4/5 and 5/4 (i.e., around 1.0) B is bad if it is >2 or <1/2 B is uncertain otherwise Do this for each variable in the propensity score model