Discovering Novel Adverse Drug Events Using Natural Language Processing and Mining of Electronic Health Records Carol Friedman, PhD Department of Biomedical Informatics Columbia University July 21 - AIME 2009 Motivation: Severity of Problem • Clinical trials do not test a broad population • Adverse Drug Events (ADEs) world-wide problem • *Expense from ADEs is $5.6 billion annually • *Estimated that over 2 million patients hospitalized due to ADEs • *ADEs are fourth leading cause of death *In US alone July 21 - AIME 2009 Motivation: Limitations of Approaches • Manual review of case reports (Venulet J 1988) • Spontaneous reporting to designated agency (Evans JM 2001; Eland IA 1999; Wysowski DK 2005) – – – – – Serious ADEs reported less than 1-10% of time Reporting is voluntary for physicians/patients Recognition of ADEs is highly subjective Difficult to determine cause of ADE Biased by length of time on market and other factors – Cannot determine number of patients on drug or percent at risk • Drug prescribing/claims data (Hershman D 2007; Ray WA 2009) July 21 - AIME 2009 Severity of Under Reporting Study showed 87% of time physicians ignored patient reports of known ADEs (Golumb et al. Physicians response to patient reports of adverse drug effects. Drug Safety 2007) July 21 - AIME 2009 Related Work • Automated methods mainly based on spontaneous reporting databases – Most methods use (Evans SJ 2001; Szarfman A 2002) • Surrogate observed-to-expected ratios • Incidence of drug-event reporting compared to background reporting across all drugs and events • Some research aimed at improving effectiveness of SPR databases – Create ontology of higher order adverse events • MedDRA – Avoid fragmentation of signal July 21 - AIME 2009 Related Work • Pharmacoepidemiology databases used to confirm suspicions – General practice research database (GPRD) (Wood & Martinez 2004) – New Zealand Intensive Medicines Monitoring (IMMP) (Coulter 1998) – Medicine Monitoring Unit (MEMO) (Evans et al. 2001) • EHR databases used to find signals (Brown JS et al. 2007; Berlowitz DR et al. 2006; Wang X et al. 2009) – Mainly coded data used – Has potential for active real time surveillance – Should reduce biased reporting July 21 - AIME 2009 Related Work • Consortiums involving multiple EHRs – EU-ADR project (http://www.alert-project.org/) – eHealth initiative (http://www.ehealthinitiative.org/drugSafety/) • Related work using EHR to detect known ADEs – not aimed at discovering novel ADEs (Bates DW 2003; Hongman B 2001) July 21 - AIME 2009 Exploiting the Electronic Health Record Text notes primary care D A T A inpatient progress specialties Applications admit history Labs bun 83 inr 1.3 hct 22 … … Centralized Data NLP + Integration Orders lasix … pepcid … … … July 21 - AIME 2009 Executable Data •Decision support •Patient Safety •Acquire knowledge •Discovery •Guidelines •Surveillance •Patient management •Clinical Trial recruitment •Improved documentation •Quality assurance The Electronic Health Record (EHR) • Rich source of patient information • Mostly untapped • Primary use for EHR – Documenting care in multi-provider environment – Manual review by providers • More complete than coded ICD-9 codes – Symptoms – Clinical conditions not beneficial for billing • Fragmented • Heterogeneous • Noisy July 21 - AIME 2009 Research Opportunities: NLP Issues • Occurrence of clinical events in natural language – Drugs, diseases, symptoms – Temporal information is critical • Irregularity of reports – Section headings important but abbreviated/missing – Use of indentation, lists, run on sentences – Tables & semi-structured data in reports • Abbreviations – 2/2 meaning secondary to – co meaning cardiac output or complaining of • Mapping terms in text to an ontology/controlled vocabulary – infiltrate in chest x-ray means chest infiltrate Julylimited 21 - AIMEthan 2009 language – ontology terms more Research Opportunities: Statistical Issues • Find associations between drug, symptoms, and diseases – Not explicit in EHR • Large volumes of data – Statistical significance vs. clinical significance • Statistical associations – not relationships – Drug treats condition / Drug causes condition • Integrating time sequences is important – For treats: condition must precede drug event – For causes: drug event must precede condition July 21 - AIME 2009 Research Opportunities: Statistical Issues • Confounding (indirect associations) – Metolazone treats heart failure (HF) – HF is manifested by shortness of breath (SOB) – Metolazone and SOB indirectly related • Higher order associations – Drug interactions: Drug1, drug2, condition – Drug-contraindications: Drug, disease, condition • Rare ADEs July 21 - AIME 2009 Other Research Opportunities: Knowledge Acquisition • Structured Knowledge bases – UMLS relations (may_be_treated_by) – Proprietary ones – usually unavailable • Text/Semi-Structured Knowledge (need NLP) – Spontaneous reporting databases: indications, drugs, adverse events – Literature (Medline) – Web sites (WebMD, Micromedix) – Online medical textbooks – Claims Data (Health IT payors) July 21 - AIME 2009 Text Mining for Knowledge Acquisition • Statistical methods: co-occurrences – Discovered associations between diseases and diets from literature (Weeber M 2002) – Identified disease candidate genes ( Hristovski D 2005) • NLP systems – Trends in medications based on the literature and narrative clinical reports (Chen ES 2007, 2008) – Semantic relations in the literature (Hristovski D 2006) July 21 - AIME 2009 Overview of Our NLP-EHR based Pharmacovigilance System Narrative records Coded data MedLEE NLP Standardize & integrate EHR Selecting & filtering Detect associations Medical knowledge Eliminate confounding July 21 - AIME 2009 ADE Signals Natural Language Processing of EHR Narrative records Coded data MedLEE NLP Standardize & integrate EHR Selecting & filtering Detect associations Medical knowledge Eliminate confounding July 21 - AIME 2009 ADE Signals Meds: Tegretol xr Zocor All: Several sz meds PMHx: sz d/o - well controlled on tegretol high chol - on zocor CAD - 60% lesion in LADM by cath MR - secondary to mitral prolapse PSHx: rib fx in 2001, shoulder fx secondary to trauma Vitals: 130/80 12 80 A/P: 54 y/o m with mult med problems, all relatively well controlled. Pt sz free, not anemic as of 2/2003. Concerned of MR and its possible long term effects. July 21 - AIME 2009 Coded Output from NLP med:tegretol xr sectname>> report medication item code>> UMLS:C0592163_Tegretol XR med:zocor sectname>> report medication item code>> UMLS:C0678181_Zocor ......... problem:mitral valve regurgitation sectname>> report past history item code>> UMLS:C0026266_Mitral Valve Insufficiency …….. problem:rib fracture date>> 2001 sectname>> report past history item July 21 - AIME 2009 Coding Issues • Not all conditions have codes – Non-communicative • Some conditions are combinations of codes – Difficulty sleeping – Vascular injury • Granularity of coding system – Many different codes for a concept Asthma: asthma exacerbation, asthma disturbing sleep, moderate asthma, suspected asthma, … July 21 - AIME 2009 Standardizing Coded Data Narrative records HCT:20 Coded data MedLEE NLP Standardize & integrate C0744727: low hematocrit EHR Selecting & filtering Detect associations Medical knowledge Eliminate confounding July 21 - AIME 2009 ADE Signals Standardizing Coded EHR Data: Laboratory Tests and Medications • Lab values denoting normal/abnormal vary – Abnormal range may depend on age, sex, ethnicity, weight – Change in lab values and duration must be considered • Standardizing medications is complex & requires additional knowledge – Tradename to generic (Avandia rosaglitazone) – Handling of combination medications • 1.5% Lidocaine with 1:200,000 Epinephrine – Handling of dose & Route • Diazepam 2 MG Oral Tablet July 21 - AIME 2009 Selecting and Filtering Narrative records Coded data MedLEE NLP Standardize & integrate EHR Selecting & filtering • Select using UMLS classes (diseases, medications) Filter out: •negations, past info, … • wrong time order Detect associations Medical knowledge Eliminate confounding ADE Signals July 21 - AIME 2009 Selecting and Filtering • Dependence on accuracy of semantic classification – UMLS classification errors - Finding: birth history, cardiac output, divorce + Finding: cardiomegaly, fever • Temporal information difficult to obtain – An adverse drug event should only follow drug event – Processing of explicit time information is complex and vague • Yesterday, last admission, 2/5 – Information typically occur in reports without dates July 21 - AIME 2009 Detect Associations Narrative records Coded data MedLEE NLP Standardize & integrate EHR Selecting & filtering Detect associations Medical knowledge Eliminate confounding July 21 - AIME 2009 • Obtain event frequencies •Co-occurrence frequencies •Form 2x2 tables •Calculate associations ADE Signals Detect Associations • Correct temporal sequence is critical – Drug event should precede adverse event – Dates are not usually stated along with events – Section of reports helpful surrogate • Statistical associations correspond to different clinical relations – For pharmacovigilance: • Want drug causes adverse event • Confounding caused by dependencies in data July 21 - AIME 2009 Confounding Interdependencies Disease Manifested by Treats Drug Cause_ADE July 21 - AIME 2009 Adverse Event Confounding Interdependencies HD SOB ML ML: Metolazone; HD: Hypertensive Disease; SOB: Shortness of Breath July 21 - AIME 2009 Drug Associations Network Rx1-n ADE treatment Rx ADE Sx association Sx1-n process treatment process Dx1-n Dx association July 21 - AIME 2009 Reduce Confounding Narrative records Coded data MedLEE NLP Standardize & integrate EHR Selecting & filtering Detect associations Medical knowledge Eliminate confounding July 21 - AIME 2009 ADE Signals Reduce Confounding • Collect knowledge from external sources and associations – Drug-treat-disease – Disease-manifested by-symptom – Drug-interacts with-drug • Use Information theory – Mutual Information (MI) – Data processing inequality MI3 < (MI1, MI3) Disease MI2 MI1 Drug MI3 July 21 - AIME 2009 Adverse Event Initial Study: Methods • 6 drugs chosen – – – • • • • • Ibuprofen, Morphine, Warfarin: longtime on market with known ADEs Bupropion, Paroxetine, Rosiglitazone: ADEs discovered after 2004 1 drug class: ACE inhibitors 25,074 textual discharge summaries in 2004 from NYPH processed using MedLEE NLP Reference standard created using expert knowledge sources Drug-potential ADE pairs determined Recall/precision calculated Qualitative analysis performed to classify drugpotential ADE pairs detected July 21 - AIME 2009 Initial Study: Results • Quantitative – recall (.75), precision (.30) • Qualitative analysis: potential drug-ADE pairs a. Known drug-ADEs: 30% b. Drug-indication pairs: 30% c. Remote drug-indication pair: 33% d. Unknown clinical associations: 6% July 21 - AIME 2009 Confounding Interdependencies Disease Disease2 Manifested by Treats Drug Cause_ADE July 21 - AIME 2009 Adverse Event Study 2: Reduction of Confounding • Evaluation set • 14 associations related to 2 drugs from Study 1 • Reference standard • Drug-ADE associations determined and MI, DPI used to automatically classify them Drug-ADE Relation Direct Side effects of the drug (Rosiglitazone-headache) Indirect Conditions related to the disease/symptoms the drug treats (Metolazone-shortness of breath) Either Conditions in both ‘direct’ and ‘indirect’ categories (Rosiglitazone-chest Pain) July 21 - AIME 2009 Results • Precision • 0.86 when handling confounding • 0.31 when without handling confounding July 21 - AIME 2009 Discussion: Limitations & Future Directions • Mutual information only strategy to handle confounding – More complex MI strategy will be explored – Other statistical/knowledge based methods will be explored • Inpatient data only/sicker patient population – The same methods could be used for outpatient data as well possibly more noisy • Drug dosage, drug-drug and more complex interactions should be explored July 21 - AIME 2009 Discussion: Limitations & Future Directions • Small evaluation data set – More comprehensive evaluation • Limitations inherent from NLP, coding, association detection • Limitations due to fragmented/incomplete patient data July 21 - AIME 2009 Summary • Need for more pharmacovigilance research – Based on the EHR – Using available databases and text • Studies demonstrated promising results • Many interesting research opportunities – – – – – Natural language processing Statistical methods Integrating different sources of data Gathering knowledge from different sources Automated knowledge acquisition for evidence based medicine July 21 - AIME 2009 Acknowledgement • NLP Data Mining group at DBMI at Columbia – – – – – – – – George Hripcsak Marianthi Markatou Herb Chase Xiaoyan Wang David Albers Jung-wei Fan Lyudmila Shagina Noemie Elhadad • Grants – – – – R01 LM007659 from NLM R01 LM008635 from NLM R01 LM06910 from NLM 5T15LM007079 from NLM training grant July 21 - AIME 2009 QUESTIONS THANK YOU! July 21 - AIME 2009