PPT

advertisement
Discovering Novel Adverse Drug
Events Using Natural Language
Processing and Mining of
Electronic Health Records
Carol Friedman, PhD
Department of Biomedical Informatics
Columbia University
July 21 - AIME 2009
Motivation: Severity of Problem
• Clinical trials do not test a broad population
• Adverse Drug Events (ADEs) world-wide
problem
• *Expense from ADEs is $5.6 billion annually
• *Estimated that over 2 million patients
hospitalized due to ADEs
• *ADEs are fourth leading cause of death
*In US alone
July 21 - AIME 2009
Motivation: Limitations of Approaches
• Manual review of case reports (Venulet J 1988)
• Spontaneous reporting to designated agency
(Evans JM 2001; Eland IA 1999; Wysowski DK 2005)
–
–
–
–
–
Serious ADEs reported less than 1-10% of time
Reporting is voluntary for physicians/patients
Recognition of ADEs is highly subjective
Difficult to determine cause of ADE
Biased by length of time on market and other
factors
– Cannot determine number of patients on drug or
percent at risk
• Drug prescribing/claims data (Hershman D 2007; Ray
WA 2009)
July 21 - AIME 2009
Severity of Under Reporting
Study showed 87% of time physicians
ignored patient reports of known ADEs
(Golumb et al. Physicians response to patient reports of
adverse drug effects. Drug Safety 2007)
July 21 - AIME 2009
Related Work
• Automated methods mainly based on
spontaneous reporting databases
– Most methods use (Evans SJ 2001; Szarfman A 2002)
• Surrogate observed-to-expected ratios
• Incidence of drug-event reporting compared to background
reporting across all drugs and events
• Some research aimed at improving
effectiveness of SPR databases
– Create ontology of higher order adverse events
• MedDRA
– Avoid fragmentation of signal
July 21 - AIME 2009
Related Work
• Pharmacoepidemiology databases used to
confirm suspicions
– General practice research database (GPRD) (Wood
& Martinez 2004)
– New Zealand Intensive Medicines Monitoring
(IMMP) (Coulter 1998)
– Medicine Monitoring Unit (MEMO) (Evans et al. 2001)
• EHR databases used to find signals (Brown JS et
al. 2007; Berlowitz DR et al. 2006; Wang X et al. 2009)
– Mainly coded data used
– Has potential for active real time surveillance
– Should reduce biased reporting
July 21 - AIME 2009
Related Work
• Consortiums involving multiple EHRs
– EU-ADR project (http://www.alert-project.org/)
– eHealth initiative
(http://www.ehealthinitiative.org/drugSafety/)
• Related work using EHR to detect known
ADEs – not aimed at discovering novel ADEs
(Bates DW 2003; Hongman B 2001)
July 21 - AIME 2009
Exploiting the Electronic Health Record
Text notes
primary
care
D
A
T
A
inpatient
progress
specialties
Applications
admit
history
Labs
bun
83
inr
1.3
hct
22
…
…
Centralized
Data
NLP +
Integration
Orders
lasix
…
pepcid
…
…
…
July 21 - AIME 2009
Executable
Data
•Decision support
•Patient Safety
•Acquire knowledge
•Discovery
•Guidelines
•Surveillance
•Patient management
•Clinical Trial
recruitment
•Improved
documentation
•Quality assurance
The Electronic Health Record (EHR)
• Rich source of patient information
• Mostly untapped
• Primary use for EHR
– Documenting care in multi-provider environment
– Manual review by providers
• More complete than coded ICD-9 codes
– Symptoms
– Clinical conditions not beneficial for billing
• Fragmented
• Heterogeneous
• Noisy
July 21 - AIME 2009
Research Opportunities: NLP Issues
• Occurrence of clinical events in natural language
– Drugs, diseases, symptoms
– Temporal information is critical
• Irregularity of reports
– Section headings important but abbreviated/missing
– Use of indentation, lists, run on sentences
– Tables & semi-structured data in reports
• Abbreviations
– 2/2 meaning secondary to
– co meaning cardiac output or complaining of
• Mapping terms in text to an ontology/controlled
vocabulary
– infiltrate in chest x-ray means chest infiltrate
Julylimited
21 - AIMEthan
2009 language
– ontology terms more
Research Opportunities: Statistical Issues
• Find associations between drug, symptoms,
and diseases
– Not explicit in EHR
• Large volumes of data
– Statistical significance vs. clinical significance
• Statistical associations – not relationships
– Drug treats condition / Drug causes condition
• Integrating time sequences is important
– For treats: condition must precede drug event
– For causes: drug event must precede condition
July 21 - AIME 2009
Research Opportunities: Statistical Issues
• Confounding (indirect associations)
– Metolazone treats heart failure (HF)
– HF is manifested by shortness of breath (SOB)
– Metolazone and SOB indirectly related
• Higher order associations
– Drug interactions: Drug1, drug2, condition
– Drug-contraindications: Drug, disease, condition
• Rare ADEs
July 21 - AIME 2009
Other Research Opportunities:
Knowledge Acquisition
• Structured Knowledge bases
– UMLS relations (may_be_treated_by)
– Proprietary ones – usually unavailable
• Text/Semi-Structured Knowledge (need NLP)
– Spontaneous reporting databases: indications,
drugs, adverse events
– Literature (Medline)
– Web sites (WebMD, Micromedix)
– Online medical textbooks
– Claims Data (Health IT payors)
July 21 - AIME 2009
Text Mining for Knowledge Acquisition
• Statistical methods: co-occurrences
– Discovered associations between diseases and
diets from literature (Weeber M 2002)
– Identified disease candidate genes ( Hristovski D 2005)
• NLP systems
– Trends in medications based on the literature and
narrative clinical reports (Chen ES 2007, 2008)
– Semantic relations in the literature (Hristovski D 2006)
July 21 - AIME 2009
Overview of Our NLP-EHR based
Pharmacovigilance System
Narrative
records
Coded
data
MedLEE NLP
Standardize &
integrate
EHR
Selecting &
filtering
Detect
associations
Medical
knowledge
Eliminate
confounding
July 21 - AIME 2009
ADE Signals
Natural Language Processing of EHR
Narrative
records
Coded
data
MedLEE NLP
Standardize &
integrate
EHR
Selecting &
filtering
Detect
associations
Medical
knowledge
Eliminate
confounding
July 21 - AIME 2009
ADE Signals
Meds:
Tegretol xr
Zocor
All:
Several sz meds
PMHx:
sz d/o - well controlled on tegretol
high chol - on zocor
CAD - 60% lesion in LADM by cath
MR - secondary to mitral prolapse
PSHx:
rib fx in 2001, shoulder fx secondary to trauma
Vitals: 130/80 12 80
A/P: 54 y/o m with mult med problems, all relatively well controlled. Pt sz
free, not anemic as of 2/2003. Concerned of MR and its possible long
term effects.
July 21 - AIME 2009
Coded Output from NLP
med:tegretol xr
sectname>> report medication item
code>> UMLS:C0592163_Tegretol XR
med:zocor
sectname>> report medication item
code>> UMLS:C0678181_Zocor
.........
problem:mitral valve regurgitation
sectname>> report past history item
code>> UMLS:C0026266_Mitral Valve Insufficiency
……..
problem:rib fracture
date>> 2001
sectname>> report past history item
July 21 - AIME 2009
Coding Issues
• Not all conditions have codes
– Non-communicative
• Some conditions are combinations of codes
– Difficulty sleeping
– Vascular injury
• Granularity of coding system
– Many different codes for a concept
Asthma: asthma exacerbation, asthma disturbing
sleep, moderate asthma, suspected asthma, …
July 21 - AIME 2009
Standardizing Coded Data
Narrative
records
HCT:20
Coded
data
MedLEE NLP
Standardize &
integrate
C0744727:
low hematocrit
EHR
Selecting &
filtering
Detect
associations
Medical
knowledge
Eliminate
confounding
July 21 - AIME 2009
ADE Signals
Standardizing Coded EHR Data:
Laboratory Tests and Medications
• Lab values denoting normal/abnormal vary
– Abnormal range may depend on age, sex, ethnicity, weight
– Change in lab values and duration must be considered
• Standardizing medications is complex & requires
additional knowledge
– Tradename to generic (Avandia  rosaglitazone)
– Handling of combination medications
• 1.5% Lidocaine with 1:200,000 Epinephrine
– Handling of dose & Route
• Diazepam 2 MG Oral Tablet
July 21 - AIME 2009
Selecting and Filtering
Narrative
records
Coded
data
MedLEE NLP
Standardize &
integrate
EHR
Selecting &
filtering
• Select using UMLS classes
(diseases, medications)
Filter out:
•negations, past info, …
• wrong time order
Detect
associations
Medical
knowledge
Eliminate
confounding
ADE Signals
July 21 - AIME 2009
Selecting and Filtering
• Dependence on accuracy of semantic
classification
– UMLS classification errors
- Finding: birth history, cardiac output, divorce
+ Finding: cardiomegaly, fever
• Temporal information difficult to obtain
– An adverse drug event should only follow drug event
– Processing of explicit time information is complex and vague
• Yesterday, last admission, 2/5
– Information typically occur in reports without dates
July 21 - AIME 2009
Detect Associations
Narrative
records
Coded
data
MedLEE NLP
Standardize &
integrate
EHR
Selecting &
filtering
Detect
associations
Medical
knowledge
Eliminate
confounding
July 21 - AIME 2009
• Obtain event frequencies
•Co-occurrence frequencies
•Form 2x2 tables
•Calculate associations
ADE Signals
Detect Associations
• Correct temporal sequence is critical
– Drug event should precede adverse event
– Dates are not usually stated along with events
– Section of reports helpful surrogate
• Statistical associations correspond to different
clinical relations
– For pharmacovigilance:
• Want drug causes adverse event
• Confounding caused by dependencies in data
July 21 - AIME 2009
Confounding Interdependencies
Disease
Manifested by
Treats
Drug
Cause_ADE
July 21 - AIME 2009
Adverse
Event
Confounding Interdependencies
HD
SOB
ML
ML: Metolazone; HD: Hypertensive Disease; SOB: Shortness of Breath
July 21 - AIME 2009
Drug Associations Network
Rx1-n
ADE
treatment
Rx
ADE
Sx
association
Sx1-n
process
treatment
process
Dx1-n
Dx
association
July 21 - AIME 2009
Reduce Confounding
Narrative
records
Coded
data
MedLEE NLP
Standardize &
integrate
EHR
Selecting &
filtering
Detect
associations
Medical
knowledge
Eliminate
confounding
July 21 - AIME 2009
ADE Signals
Reduce Confounding
• Collect knowledge from external sources and
associations
– Drug-treat-disease
– Disease-manifested by-symptom
– Drug-interacts with-drug
• Use Information theory
– Mutual Information (MI)
– Data processing inequality
MI3 < (MI1, MI3)
Disease
MI2
MI1
Drug
MI3
July 21 - AIME 2009
Adverse
Event
Initial Study: Methods
•
6 drugs chosen
–
–
–
•
•
•
•
•
Ibuprofen, Morphine, Warfarin: longtime on market with
known ADEs
Bupropion, Paroxetine, Rosiglitazone: ADEs discovered after
2004
1 drug class: ACE inhibitors
25,074 textual discharge summaries in 2004 from
NYPH processed using MedLEE NLP
Reference standard created using expert knowledge
sources
Drug-potential ADE pairs determined
Recall/precision calculated
Qualitative analysis performed to classify drugpotential ADE pairs detected
July 21 - AIME 2009
Initial Study: Results
• Quantitative
– recall (.75), precision (.30)
• Qualitative analysis: potential drug-ADE pairs
a. Known drug-ADEs: 30%
b. Drug-indication pairs: 30%
c. Remote drug-indication pair: 33%
d. Unknown clinical associations: 6%
July 21 - AIME 2009
Confounding Interdependencies
Disease
Disease2
Manifested by
Treats
Drug
Cause_ADE
July 21 - AIME 2009
Adverse
Event
Study 2: Reduction of Confounding
• Evaluation set
• 14 associations related to 2 drugs from Study 1
• Reference standard
• Drug-ADE associations determined and MI, DPI used
to automatically classify them
Drug-ADE Relation
Direct
Side effects of the drug (Rosiglitazone-headache)
Indirect
Conditions related to the disease/symptoms the drug
treats (Metolazone-shortness of breath)
Either
Conditions in both ‘direct’ and ‘indirect’ categories
(Rosiglitazone-chest Pain)
July 21 - AIME 2009
Results
• Precision
• 0.86 when handling confounding
• 0.31 when without handling confounding
July 21 - AIME 2009
Discussion: Limitations
& Future Directions
• Mutual information only strategy to handle
confounding
– More complex MI strategy will be explored
– Other statistical/knowledge based methods will be explored
• Inpatient data only/sicker patient population
– The same methods could be used for outpatient data as well possibly more noisy
• Drug dosage, drug-drug and more complex
interactions should be explored
July 21 - AIME 2009
Discussion: Limitations
& Future Directions
• Small evaluation data set
– More comprehensive evaluation
• Limitations inherent from NLP, coding,
association detection
• Limitations due to fragmented/incomplete
patient data
July 21 - AIME 2009
Summary
• Need for more pharmacovigilance research
– Based on the EHR
– Using available databases and text
• Studies demonstrated promising results
• Many interesting research opportunities
–
–
–
–
–
Natural language processing
Statistical methods
Integrating different sources of data
Gathering knowledge from different sources
Automated knowledge acquisition for evidence
based medicine
July 21 - AIME 2009
Acknowledgement
• NLP Data Mining group at DBMI at Columbia
–
–
–
–
–
–
–
–
George Hripcsak
Marianthi Markatou
Herb Chase
Xiaoyan Wang
David Albers
Jung-wei Fan
Lyudmila Shagina
Noemie Elhadad
• Grants
–
–
–
–
R01 LM007659 from NLM
R01 LM008635 from NLM
R01 LM06910 from NLM
5T15LM007079 from NLM training grant
July 21 - AIME 2009
QUESTIONS
THANK YOU!
July 21 - AIME 2009
Download