Confounding adjustment: Ideas in Action -a case study Xiaochun Li, Ph.D. Associate Professor Division of Biostatistics Indiana University School of Medicine Outline • Description of the data set • Quantity to be estimated • Summary of baseline characteristics • Approaches to data analyses • Results • Discussion 2 Simulation Setup • Linder Center data described and analyzed in Kereiakes et al. (2000) 6 month follow-up data on 996 patients who underwent an initial Percutaneous Coronary Intervention (PCI) were treated with “usual care” alone or usual care plus a relatively expensive blood thinner (IIB/IIIA cascade blocker • has10 variables Y: 2 outcomes, mort6mo (efficacy) and cardcost (cost) X: 1 treatment variable, and 7 baseline covariates, stent, height, female, diabetic, acutemi, ejecfrac and ves1proc 3 Baseline characteristics Stent coronary stent deployment female patient sex diabetic diabetes mellitus acutemi acute myocardial infarction ves1proc number of vessels involved in initial PCI height In centimeter ejecfrac left ejection fraction % 4 The “LSIM10K” dataset • Simulation data set was based on the Linder Center data 17 copies of the clustered Lindner data, with fudge factors added to ejfract and hgt, and some clipping same correlation among covariates, same clustering patterns • Contains the values of 10 simulated variables for • • 10,325 hypothetical patients To simplify analyses, the data contain no missing values. Details and dataset available from Bob’s website 5 What do we want to estimate? The population average treatment effect (ATE), i.e., E(Y1) - E(Y0) Y1 and Y0 are conterfactual outcomes In plain words: what if scenarios The expected response if treatment had been assigned to the entire study population minus the expected response if control had been assigned to the entire study population 6 Baseline covariate balance assessment Variable C (Usual care alone) T (Usual care + Abciximab) P value stent 63% 69% <0.001 female 33% 34% 0.36 diabetic 23% 19% <0.001 acutemi 7% 15% <0.001 ves1proc 1.4 (±0.6) 1.3 (±0.6) <0.001 height (cm) 172.5 (±10) 171.5 (±10) <0.001 ejfract 53 (±8) 50 (±10) <0.001 7 Visualizing overall imbalance Deep blue = high values 8 C T Analytical Methods for confounding adjustment The following methods were applied to lsim10k • Outcome regression adjustment (OR) • Propensity score (PS) stratification • Inverse-probability-treatment-weighted (IPTW) • Doubly robust estimation • Matching by Mahalonobis distance PS only 9 ANALYSIS OF MORT6MO OR model for mort6mo : • treatment indicator (trtm) • main effect terms for all seven covariates • quadratic terms for both height and ejfract • Residual deviance: 2410.4 on 10323 degrees of freedom PS model: • saturated model for the five categorical covariates (main effects and interaction terms up to fifth-order) • main effects and quadratic terms for height and ejfract Covariates Balance Evaluations based on PS Quintiles Stent 1 2 Female 1 3 Diabetic 1 4 Acutemi 1 5 Ves1proc 1 6 Height strata 2 (0.95 cm) and 3 (-1.50cm) 1 7 Height • • Existence of residual confounding after adjusting for PS quintiles The within-stratum between-group height difference mean Stratum 2: Stratum 3: 0.949 -1.497 s.d. p 0.44 0.032 0.43 0.0005 1 8 Ejfract strata 1 (0.81), 2 (-1.32) and 3 (-0.72) 1 9 Ejfract • • Existence of residual confounding after adjusting for PS quintiles The within-strata between-group height difference mean Stratum 1: 0.812 s.d. 0.41 p-value 0.0475 Stratum 2: -1.322 0.33 7.38e-5 Stratum 3: -0.721 0.32 0.025 2 0 PS Stratification • Residual confounding within strata • In PS stratification method, height and ejfract are further adjusted stratum specific Treatment effect Height, ejfract main effects and their quadratic terms 2 1 Results – mort6mo True △=-0.036 u1 u0 △ SE Outcome Regression 0.010 0.043 -0.032 0.0038 PS strat. 0.012 0.044 -0.033 0.0039 IPTW1 0.011 0.045 -0.034 0.0038 IPTW2 0.011 0.045 -0.034 0.0037 DR 0.011 0.043 -0.032 0.0037 NA NA -0.037 -0.036 0.0044 0.0039 Method Match Mahalanobis PS 2 2 Results of all methods are consistent, providing evidence of treatment effectiveness at preventing death at 6 months. ANALYSIS OF CARDCOST PS MODEL: SAME AS BEFORE cardcost model: •treatment indicator (trtm) • main effect terms for all seven covariates • quadratic terms for both height and ejfract cardcost model of CA with PS stratification: stratum specific Treatment effect Height, ejfract main effects and their quadratic terms Model checking – OR Adjusted R-squared: 0.0386 2 4 Model checking – OR (log transformed) Adjusted R-squared: 0.0693 2 5 Results – cardcost Method u1 △ u0 SE OR: original scale 15308 15300 8 210 OR: Log transformed 13536 13702 -166 111 PS strat. 13580 13639 -59 119 IPTW1 15545 15226 -319 409 IPTW2 15408 15303 -105 229 DR 15393 15292 -101 226 NA NA 150 -3 178 215 Match Mahalanobis PS 2 6 Discussion • All methods give consistent results on the 2 • • • • • outcomes All PS based results have similar variance except IPTW1 IPTWs depend on approx. correct PS model OR depends on approx. correct outcome model DR is a fortuitous combination of OR and IPTW: depends on one of models being right Nonparametric models of either models may be an alternative to parametric models 2 7 Double Robustness • wrong PS model: adjust for one covariate ‘acutemi’ only • wrong OR model for card cost: adjust for the treatment indicator ‘trtm’ and the ‘acutemi’ covariate △ Method PS outcome IPTW2 wrong NA 464 214 DR wrong wrong right wrong right wrong 463 166 -131 217 214 233 By “right”, we mean approximately. SE 2 8 Propensity score estimation • • The majority applications in literature use a parametric logistic regression model that assume covariates are linear and additive on the log odds scale May include selected interactions and polynomial terms Accurate PS estimation is impeded by High dimensional covariates – which ones should we deconfound? Unknown functional form – how do they relate to the treatment selection • • PS model misspecification can substantially bias the estimated treatment effect Nonparametric approach is flexible to accommodate nonlinear/non-additive relationship of covariates to treatment assignment, e.g., trees 2 9 Nonparametric regression techniques • Generalized Boosted Models (GBM) to estimate the propensity score function Friedman, 2001; Madigan and Ridgeway, 2004; McCaffrey, Ridgeway, and Morral, 2004 • R package: twang Regression tree model to predict cardcost Ripley, 1996; Therneau and Atkinson, 1997 R package: rpart 3 0 Generalized Boosted Models (GBM) • • • • • • • • A multivariate nonparametric regression technique Sum of a large set of simple regression trees modelling log-odds gbm finds mle of g(x)=log(p(x)/(1-p(x)), p(x)=P(T=1|x) Predict treatment assignment from a large number of pretreatment covariates – adaptively choose them Nonlinear No need to select variables Can model complex interactions Invariant to monotone transformations of x E.g, same PS estimates whether use age, log(age) or age2 Outperforms alternative methods in prediction error 3 1 Results – cardcost nonparametric approach Method u1 u0 △ SE DR: parametric models 15393 15292 -101 226 DR: Gbm + parametric model 15303 15213 -90 210 DR: Gbm + tree 15233 15356 123 172 3 2 Future research • People try quintiles, deciles for propensity score • stratification – need data driven approach (based on bias-variance tradeoff) for number of strata Model selection: PS model, and outcome model Nonparametric estimation of models may be intuitive, but not clear about the properties of the causal estimates Nonparametric caveat: still need to define a set of “confounders” based on knowledge of causal relationship among treatment, outcome and covariates rather than conditioning indiscriminatly on all covariates that have associations with treatment and outcome 3 3