Honest Inference from Observational Studies in Healthcare David Madigan Columbia University Patrick Ryan Janssen http://www.omop.org http://www.ohdsi.org “The sole cause and root of almost every defect in the sciences is this: that whilst we falsely admire and extol the powers of the human mind, we do not search for its real helps.” — Novum Organum: Aphorisms [Book One], 1620, Sir Francis Bacon 141 patients exposed in pivotal randomized clinical trial for metformin >1,000,000 new users of metformin in one administrative claims database Patient profiles from observational data Major Use-Cases • Population-level estimation – Effect estimation: Does metformin cause lactic acidosis? – Comparative effectiveness: Does metformin cause lactic acidosis more than glyburide? • Patient-level prediction/Precision medicine – Given everything you know about me and my medical history, if I start taking metformin, what is the chance that I am going to have lactic acidosis in the next year? • Clinical characterization: – Natural history: Who are the patients that take metformin? What happens to them? – Quality improvement: what proportion of patients with diabetes experience disease-related complications? How well do we do estimation? August2010: “Among patients in the UK General Practice Research Database, the use of oral bisphosphonates was not significantly associated with incident esophageal or gastric cancer” Sept2010: “In this large nested casecontrol study within a UK cohort [General Practice Research Database], we found a significantly increased risk of oesophageal cancer in people with previous prescriptions for oral bisphosphonates” What is the quality of the current evidence from observational analyses? April2012: “Patients taking oral fluoroquinolones were at a higher risk of developing a retinal detachment” Dec2013: “Oral fluoroquinolone use was not associated with increased risk of retinal detachment” What is the quality of the current evidence from observational analyses? BJCP May 2012: “In this study population, pioglitazone does not appear to be significantly associated with an increased risk of bladder cancer in patients with type 2 diabetes.” BMJ May 2012: “The use of pioglitazone is associated with an increased risk of incident bladder cancer among people with type 2 diabetes.” What is the quality of the current evidence from observational analyses? Nov2012: FDA released risk communication about the bleeding risk of dabigatran, based on unadjusted cohort analysis performed within Mini-Sentinel Dec2013: “This analysis shows that the RCTs and Mini-Sentinel Program show completely opposite results” Aug2013: “However, the absence of any adjustment for possible confounding and the paucity of actual data made the analysis unsuitable for informing the care of patients” 2010-2014 OMOP Research Experiment • Open-source • Standards-based OMOP Methods Library Inception cohort Case control Logistic regression Common Data Model • 10 data sources • Claims and EHRs • 200M+ lives • 14 methods • Epidemiology designs • Statistical approaches adapted for longitudinal data Aplastic Anemia Acute Liver Injury Bleeding Hip Fracture Hospitalization Myocardial Infarction Mortality after MI Renal Failure GI Ulcer Hospitalization B nt ib s u io lfo tic na s: m er id yt A es h r nt , t om ie et c a pil ra yc i c y ns rb e p a m ti cl , in c az s: es ep B en in e, zo ph di en az yt ep oi in n e B s et a bl oc ke rs B is p al hos en p dr ho on n at ate e Tr s: ic yc l ic an tid ep Ty re pi ss ca an la ts nt ip sy ch W ar ot ic fa s rin A ph o m A A Outcome Angioedema C E In h te ric in ib ito rs Drug Lesson 1: Empirical performance: Most observational methods do not have nominal statistical operating characteristics • Applying the cohort design to MDCR against 34 negative controls for acute liver injury: • If 95% confidence interval was properly calibrated, then 95%*34 = 32 of the estimates should cover RR = 1 • We observed 17 of negative controls did cover RR=1 • Estimated coverage probability = 17 / 34 = 50% • Estimates on both sides of null suggest high variability in the bias Ryan PB, Stang PE, Overhage JM et al, Drug Safety, 2013: “A Comparison of the Empirical Performance of Methods for a Risk Identification System” Lesson 2: Database heterogeneity: Holding analysis constant, different data may yield different estimates • When applying a propensity score adjusted new user cohort design to 10 databases for 53 drug-outcome pairs: • 43% had substantial heterogeneity (I2 > 75%) where pooling would not be advisable • 21% of pairs had at least 1 source with significant positive effect and at least 1 source with significant negative effect Madigan D, Ryan PB, Schuemie MJ et al, American Journal of Epidemiology, 2013 “Evaluating the Impact of Database Heterogeneity on Observational Study Results” Test cases from OMOP 2011/2012 experiment Lesson 3: Parameter sensitivity: Holding data constant, different analytic design choices may yield different estimates Holding all parameters constant, except: • Matching on age, sex and visit (within 30d) (CC: 2000205) yields a RR = 0.73 (0.65 – 0.81) Sertaline-GI Bleed: RR = 2.45 (2.06 – 2.92) • Controls per case: up to 10 controls per case • Required observation time prior to outcome: 180d • Time-at-risk: 30d from exposure start • Include index date in time-at-risk: No • Case-control matching strategy: Age and sex • Nesting within indicated population: No • Exposures to include: First occurrence • Metric: Odds ratio with Mantel Haenszel adjustment by age and gender (CC: 2000195) Relative risk Madigan D, Ryan PB, Scheumie MJ, Therapeutic Advances in Drug Safety, 2013: “Does design matter? Systematic evaluation of the impact of analytical choices on effect estimates in observational studies” Lesson 4: Empirical calibration can help restore interpretation of study findings • Type I error rate typically 40-60% • Negative controls can be used to estimate empirical null distribution: how much bias and variance exists when no effect should be observed Schuemie MJ, Ryan PB, DuMouchel W, et al, Statistics in Medicine, 2013: “Interpreting observational studies: why empirical calibration is needed to correct p-values” All is Not Well • Unknown operating characteristics • Type 1 error rate? “95%” confidence interval? • Like early days of routine laboratory testing – “trust me, I measured it myself” Large-scale analytics can help reframe the patient-level prediction problem …can we predict outcomes for that patient in the future? 0 1 1 0 0 0 0 0 1 0 0 1 1 1 0 0 1 1 1 0 0 All drugs 1 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 1 1 0 1 1 0 1 1 1 0 0 0 1 0 0 1 All conditions 1 1 1 1 1 1 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 1 1 1 1 All procedures 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 1 0 1 0 0 0 1 0 0 1 1 0 All lab values b n La 76 M B 441 77 F W 521 96 F B 215 76 F B 646 64 M B 379 74 M W 627 68 M B 348 Demographics Dr ug Co n nd i Co tion nd 1 i … tion 2 Co nd i Pr tion oc e n P r dur oc e 1 e … dur e 2 Pr oc e La dur e b n 1 La b 2 … 0 1 1 1 0 1 1 Ge nd Ra er ce Lo ca t Dr ion ug Dr 1 ug … 2 Ou tc o Ag m e : e St ro k e Given a patient’s clinical observations in the past…. Which Atrial Fibrillation Patients Should Take Warfarin? Atrial Fibrillation Stroke Risk Stroke Risk Warfarin Bleed Risk Goal: Identify patients with sufficiently low stroke risk to be spared warfarin. Standard Machine Learning Stroke Results AUC CHADS2 Random Forest Logistic Regression .72 .79 .78 Standard Machine Learning Stroke Results AUC Chads2 Random Forest Logistic Regression Not great discrimination at the low risk end .72 .79 .78 Health History Motifs Amit and Murua (2001) Shahn, Ryan, and Madigan (2015) Random Relational Forest (RRF) Approach • Build decision trees with graphs at the nodes • Each graph is a set of labeled edges. • A labeled edge is a triplet [ei,ej,Relation] where ei and ej are each health events (such as “Diabetes diagnosis” or “Atorvastatin prescription”) and ‘Relation’ labels the temporal relationship between the two events. Example Tree I = {[Diabetes, Asthma, d=20]} E = {} I = {[Diabetes,Asthma, (d=20)], [Asthma,Dementia,d=40]} E = {} I = {} E = {[Diabetes, Asthma, d=20]} I = {[Diabetes,Asthma, d=20]} E = {[Asthma,Dementia, d=40]} Each node of a tree is defined by two sets of labeled edges, call them I and E for “included” and “excluded”. A patient is in a node if he contains each edge in I and none of the edges in E. The edges in I form a connected graph. - + 113 - 14 8 age < 59 (9,17) age >= 59 age >= 59 (1,13) (6,16) (4,6) (9,17) (14,8) + (14,8) age >= 59 (1,13) (6,16) (4,6) (9,17) (1,13) (6,16) (4,6) (1,13) (6,16) (4,6) + 74 + (1,13) (6,16) (7,4) - (6,16) (4,6) + 15 13 616 46 9 17 - +- age < 59 + (1,13) + - + 13 10 12 6 + (1, 13) 61 + (15, 13) (14,1) (4,6) (7,4) (1,13) (6,16) (12,6) 14 1 + - 63 - + (4,6) (7,4) (12,6) (6,3) 12 11 (13, 10) (1, 13) (14, 1) + 51 (15,13) (6, 1) (4,6) (7,4) (12,6) (6,3) (12,11) (4, 6) (7,4) (12,6) (1,13) (6,16) (6,3) (12,11) (1,13) (6,16) 4 + + (1,13) (14, 1) (6,1) (5, 1) 1 2 (15,13) (1,13) (14,1) (6,1) (1,13) (6,16) (6,3) 7 (1,13) (15, 13) (13, 10) (1, 13) (15, 13) 1. Vascular disorders NEC 2. Central nervous system vascular disorders NEC 3. Total fluid volume increased 4. Vascular disorders 5. Cardiac failure congestive 6. Nervous system disorders 7. Respiratory system disorders 8. Eye disorders 9. Coronary artery disorders NEC 10. Haematological and lymphoid tissue therapeutic procedures 11. Anti-inflammatory and antirheumatic products 12. Non-steroidal drugs for obstructive airway disease 13. Blood and blood forming organs 14. Antithrombotic agents 15. Opioids 16. Myocardial disorders NEC 17. Arteriosclerosis, stenosis, vascular insufficiency and necrosis (15,13) (5,1) (1,2) 6 (1,13) (14,1) (6,1) (1,2) (15,13) (5,1) 14 12 12 11 5 1 13 RRF Stroke Results AUC Chads2 Logistic Regression Random Forest RRF .72 .78 .79 .79 RRF Stroke Results AUC Chads2 Logistic Regression Random Forest RRF Improved discrimination at the low risk end .72 .78 .79 .79 Standardized large-scale analytics tools under development within OHDSI ACHILLES: Database profiling Patient-level data in OMOP CDM CIRCE: Cohort definition HERACLES: Cohort characterization OHDSI Methods Library: CYCLOPS CohortMethod HERMES: Vocabulary exploration LAERTES: Drug-AE evidence base http://github.com/OHDSI PLATO: Patient-level predictive modeling HOMER: Population-level causality assessment Large-scale analytics example: ACHILLES http://ohdsi.org/web/ACHILLES • • • • • • • • • >12 databases from 5 countries across 3 different platforms: Janssen (Truven, Optum, Premier, CPRD, NHANES, HCUP) Columbia University Regenstrief Institute Ajou University IMEDS Lab (Truven, GE) UPMC Nursing Home Erasmus MC Cegedim Atopic Dermatitis 29 Treatment pathways for diabetes T2DM : All databases Only drug First drug Second drug Treatment pathways for HTN HTN: All databases Treatment pathways for depression Depression: All databases Population-level heterogeneity Type 2 Diabetes Mellitus CCAE Hypertension CUMC CPRD INPC JMDC MDCR Depression MDCD GE OPTUM Population-level heterogeneity Type 2 Diabetes Mellitus CCAE Hypertension CUMC CPRD INPC JMDC MDCR Differences by country Depression MDCD GE OPTUM Population-level heterogeneity Type 2 Diabetes Mellitus CCAE Hypertension CUMC CPRD INPC JMDC MDCR Depression Differences by medical center MDCD GE OPTUM Concluding thoughts • An international community and global data network can be used to generate real-world evidence in a secure, reliable and efficient manner • Common data model critically important • Much work remains on establishing (and improving) actual operating characteristics of current approaches to causal inference “I would rather discover one cause than gain the kingdom of Persia” - Democritus 400 BCE OHDSI: Join the journey