Extracting Epidemiological Information from Free Text

Text Mining for Surveillance II:
Extracting
Epidemiological Information
from Free Text
Lynette Hirschman
Chief Scientist
Information Technology Center
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Outline
Why text mining for surveillance?
0 What kinds of text to mine?
0 What is text mining?
0 Some examples
- Prodromic “binning” in RODS
- Processing patient records: MedLEE
- Tracking outbreaks in the news: MiTAP
0 Open research issues and conclusions
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Text Mining for Surveillance
Data
Streams
HOSPITAL-PEDIATRIC DISCHARGE SUMMARY
NAME – #####
DATE OF ADMISSION – ####
LOCATION – #####
BIRTH DATE - #####HOSPITAL-PEDIATRIC DISCHARGE SUMMARY
NAME – #####
(REASON FOR ADMISSION)
DATE OF ADMISSION – ####
SWOLLEN, PAINFUL HANDS.
VOMITING.
LOCATION
– ##### SYMPTOMS OF 18
HOURS DURATION. BIRTH DATE - #####
Document
Classes
HOSPITAL-PEDIATRIC DISCHARGE SUMMARY
NAME – #####
DATE OF ADMISSION – ####
LOCATION – #####
BIRTH DATE - #####
(ABSTRACT)
(REASON FOR ADMISSION)
PATIENT, 1 YEAR OLD.
IS KNOWN
TO HAVE
SICKLE
CELL SYMPTOMS OF 18
SWOLLEN,
PAINFUL
HANDS.
VOMITING.
DISEASES AND 2 EPISODES
OF MENINGITIS. DEVELOPED
HOURS DURATION.
SWOLLEN, PAINFUL AND WARM HANDS. HAD SEVERAL
EPISODES OF VOMIINT
PRIOR TO ADMISSION.
(ABSTRACT)
LABORATORY STUDIESPATIENT,
DID NOT 1REVEAL
ANEMIA
YEAR OLD.
IS OR
KNOWN TO HAVE SICKLE CELL
SYSTEMIC INFECTION.
HYDRATION
AND OF
BEDMENINGITIS. DEVELOPED
DISEASES
AND THERAPY
2 EPISODES
REST WERE PROVIDED,
WITH IMPORVEMENT
48 HOURS.
SWOLLEN,
PAINFUL ANDINWARM
HANDS. HAD SEVERAL
WAS DISCHARGED IMPROVED.
FOLLOWED
IN TO ADMISSION.
EPISODESTOOFBEVOMIINT
PRIOR
HEMATOLOGY CLINIC.LABORATORY STUDIES DID NOT REVEAL ANEMIA OR
SYSTEMIC INFECTION. HYDRATION THERAPY AND BED
REST WERE PROVIDED, WITH IMPORVEMENT IN 48 HOURS.
WAS DISCHARGED IMPROVED. TO BE FOLLOWED IN
HEMATOLOGY CLINIC.
(REASON FOR ADMISSION)
SWOLLEN, PAINFUL HANDS. VOMITING. SYMPTOMS OF 18
HOURS DURATION.
(ABSTRACT)
PATIENT, 1 YEAR OLD. IS KNOWN TO HAVE SICKLE CELL
DISEASES AND 2 EPISODES OF MENINGITIS. DEVELOPED
SWOLLEN, PAINFUL AND WARM HANDS. HAD SEVERAL
EPISODES OF VOMIINT PRIOR TO ADMISSION.
LABORATORY STUDIES DID NOT REVEAL ANEMIA OR
SYSTEMIC INFECTION. HYDRATION THERAPY AND BED
REST WERE PROVIDED, WITH IMPORVEMENT IN 48 HOURS.
WAS DISCHARGED IMPROVED. TO BE FOLLOWED IN
HEMATOLOGY CLINIC.
HOSPITAL-PEDIATRIC DISCHARGE SUMMARY
NAME – #####
DATE OF ADMISSION – ####
LOCATION – #####
BIRTH DATE - #####
(REASON FOR ADMISSION)
SWOLLEN, PAINFUL HANDS. VOMITING. SYMPTOMS OF 18
HOURS DURATION.
diarrheal
respiratory
(ABSTRACT)
PATIENT, 1 YEAR OLD. IS KNOWN TO HAVE SICKLE CELL
DISEASES AND 2 EPISODES OF MENINGITIS. DEVELOPED
SWOLLEN, PAINFUL AND WARM HANDS. HAD SEVERAL
EPISODES OF VOMIINT PRIOR TO ADMISSION.
LABORATORY STUDIES DID NOT REVEAL ANEMIA OR
SYSTEMIC INFECTION. HYDRATION THERAPY AND BED
REST WERE PROVIDED, WITH IMPORVEMENT IN 48 HOURS.
WAS DISCHARGED IMPROVED. TO BE FOLLOWED IN
HEMATOLOGY CLINIC.
Text Classification:
key words to
document classes
Extracted
Information,
Summary
Views
ICD9: 465.9
upper respiratory
infection
Information Extraction:
documents to entities, relations
Documents contain useful information for
tracking outbreaks – if free text can be
converted into structured data
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Outline
0 Why text mining for surveillance?
What kinds of text to mine?
0 What is text mining?
0 Some examples
- Prodromic “binning” in RODS
- Processing patient records: MedLEE
- Tracking outbreaks in the news: MiTAP
0 Open research issues
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Patient Encounter Data
0 Useful information is contained in patient records
- Clinic visits, emergency room visits, hot lines
- Data usually occurs as stylized free text
0 What to extract?
- Information useful for prodromic or syndromic
surveillance
= Without text mining, systems often just
track fluctuation in number of admissions
= New systems (e.g., RODS) can bin text data
into prodromes or syndromes
0 Time is critical in detecting an outbreak
- Delays in collecting, processing and aggregating
information lead to delays in response
- Moral: grab what you can (Chief Complaint) MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Example of Clinical Data:
Triage Chief Complaint (TCC)*
NVD
COUGH SOB
DIZZY NAUSEA
VOMITITING
TCC is
short (20-50 characters, 1-10 words)
timely (available upon patient admission)
errorful (typos and abbreviations)
*From R Olszewski, “Bayesian Classification of Triage Diagnoses for the Early Detection
of Epidemics,” Recent Advances in Artificial Intelligence: Proc of the 16th Internl
FLAIRS Conf. Pp 412-416, AAAI Press, 2003.
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Clinical Data: Radiology Report*
Mention of a condition
(hilar adenopathy) is not
equivalent to assertion:
Patient does not have
hilar adenopathy
Telegraphic style
Extensive jargon (sublanguage)
Fields vary depending on report type
*http://cat.cpmc.columbia.edu/MedLEExml/demo/
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Global Disease Tracking from News
0 Capture of global outbreak information
- The recent SARS outbreak underscores the
importance of global monitoring for outbreaks
- These are often first reported in (local) news
media or by informal communication (web chat
rooms)
0 Global outreach to capture local news is critical
- Local news sources tend to be in local
languages (requiring translation)
- Local news may be by radio, requiring capture
from broadcast news sources (radio, TV)
- This requires more advanced text processing
technology (speech transcription, translation)
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Example: Global Disease Tracking
Message from Feb. 10
in ProMED* on SARS
*ProMED: Program
for Monitoring
Infectious Diseases:
http://www.promedmail.org;
Displayed in MiTAP
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Outline
0 Why text mining for surveillance?
0 What kinds of text to mine?
What is text mining?
0 Some examples
- Prodromic “binning” in RODS
- Processing patient records: MedLEE
- Tracking outbreaks in the news: MiTAP
0 Open research issues
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
The Components of Text Mining
Collections
News
Reports
Document
Classes
Patient
Records
MEDLINE
Question Answering:
Question to answer
Summaries,
Tables
Facts
HOSPITAL-PEDIATRIC DISCHARGE SUMMARY
NAME – #####
DATE OF ADMISSION – ####
LOCATION – #####
BIRTH DATE - #####
(REASON FOR ADMISSION)
SWOLLEN, PAINFUL HANDS. VOMITING. SYMPTOMS OF 18
HOURS DURATION.
(ABSTRACT)
PATIENT, 1 YEAR OLD. IS KNOWN TO HAVE SICKLE CELL
DISEASES AND 2 EPISODES OF MENINGITIS. DEVELOPED
SWOLLEN, PAINFUL AND WARM HANDS. HAD SEVERAL
EPISODES OF VOMIINT PRIOR TO ADMISSION.
LABORATORY STUDIES DID NOT REVEAL ANEMIA OR
SYSTEMIC INFECTION. HYDRATION THERAPY AND BED
REST WERE PROVIDED, WITH IMPORVEMENT IN 48 HOURS.
WAS DISCHARGED IMPROVED. TO BE FOLLOWED IN
HEMATOLOGY CLINIC.
Information Retrieval,
Text Classification:
Key words to
document classes
SARS traced
to civet cat
Information Extraction:
Documents to entities, relations
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Text Mining Modules
Text mining takes free text as input and
distills some “value added” information
0 Binning documents into coherent sets (e.g.,
0
0
0
0
prodromes)
Extracting key entities (symptoms, diseases,
locations) and relations (time, severity, frequency)
from narrative text
Summarizing the findings (in a single record or
across multiple records)
Visualizing the data: create tables from textual
data for display (e.g., charts, maps)
Finding answers to natural language questions (the
“nuggets” in a collection of documents)
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Text Mining (1)
0 Binning technology (classification)
- Shallow and fast;
- Usually uses “bag of words” approach;
- Must be trained to classify into free text into
desired set of bins
0 Extraction
- Relies on words in context (statistical or
linguistic);
- Is designed to get at content/details, including
negated or qualified conditions (deeper, slower)
- Requires either hand-tailored rules or
application of machine learning algorithms
based on extensive annotated training data
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Text Mining (2)
0 Summarization
- Provides distillation of information across multiple
records;
- Relies on a mix of pattern recognition and
semantic analysis
0 Visualization
- Takes as input values extracted from free text
- Useful for interpreting complex spatio-temporal
data (graphs, maps)
0 Question answering
- Can return “nuggets” of information
- Works by analyzing the question, locating specific
document and extracting the right type of fact
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Outline
Why text mining for surveillance?
What kinds of text to mine?
What is text mining?
Some examples
Prodromic “binning” in RODS
- Processing patient records: MedLEE
- Tracking outbreaks in the news: MiTAP
0 Open research issues
0
0
0
0
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Example #1: RODS (Real-Time Outbreak
and Disease Surveillance)*
Admission Records from
Emergency Departments
RODS System
Graphs and
Maps
Detection
Algorithms
Emergency
Department
Preprocessor
Web
Server
Database
Emergency
Department
CoCo
Emergency
Department
Geographic
Information
System
*Slide courtesy of Wendy Chapman, U. Pittsburgh
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Example System #1: RODS*
0 RODS created at the University of Pittsburgh
- Used in multiple deployments, including for
health monitoring during the 2001 Olympics,
Western Pennsylvania health surveillance
0 RODS captures electronic medical records
- Applies natural language processing to bin
“chief complaint” into a set of syndromes, e.g.,
respiratory, diarrheal, rash, …
- Detects “out of ordinary” occurrences
- Provides temporal and geospatial visualization
*http://www.health.pitt.edu/rods/
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Text Mining:
Binning Into Syndromes
0 Naïve Bayesian classifier used to bin triage
diagnoses into 8 syndromes
- Unigram model gave good results:
.80 to .97 area under ROC curve, depending on
syndrome, compared to human experts
- Bigram and mixture models did worse than
unigram model (due to sparse data)
0 Correcting spelling mistakes and expanding
abbreviations improved results by about a
percentage point
0 Some syndromes harder than others (“botulinic”
syndrome was only around 78% AUC)
*R Olszewski, “Bayesian Classification of Triage Diagnoses for the Early Detection of
Epidemics,” Recent Advances in Artificial Intelligence: Proc of the 16th Internl FLAIRS
Conf. Pp 412-416, AAAI Press, 2003.
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Text Mining for Chief Complaint:
Conclusions*
0 Naïve Bayes works “well enough” for Chief Complaint
0 Triage complaints are well-suited to “bag of words”
- They are short with little syntax, few modifiers
- They present positive complaints (no negation)
0 Some issues
- Entries may be too brief to provide adequate data
to separate certain syndromes
- There may be insufficient training data for some
bins (“botulinic” bin had the fewest cases)
- Triage complaints may lack sufficient detail for
certain applications
More advanced linguistic techniques may be
overkill for Chief Complaint reports
*Ivanov, O. et al Accuracy of Three Classifiers of Acute Gastrointestinal
Syndrome for Syndromic Surveillance. AMIA 2002 Ann. Symp. Proc. 345-349MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
RODS
Mining syndromic
information with time and
geospatial coordinates
allows effective monitoring
for anomalous events
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Outline
Why text mining for surveillance?
What kinds of text to mine?
What is text mining?
Some examples
- Prodromic “binning” in RODS
Processing patient records: MedLEE
- Tracking outbreaks in the news: MiTAP
0 Open research issues
0
0
0
0
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Example System #2: MedLEE*
(Carol Friedman, Columbia)
Radiology
no mass or
calcification
noted on
this xray
lungs are
clear; no
gallops or
rubs
Pathology
this echo
shows some
thickening
of mitral
valve
DOMAINS
Medical
Language
Processor
Error Tracking
INSTITUTIONS
Patient
Record Access
APPLICATIONS
Surveillance
Discharge
Summaries
*Friedman, C., et al. A General Natural-Language Text Processor for
Clinical Radiology. JAMIA 1(2) 161-174. 1994
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
MedLEE Processes Complex Narrative*
HL7 format,
with codes
*http://cat.cpmc.columbia.edu/MedLEExml/demo/
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
MedLEE Provides Multiple Output Formats
Mark-up version,showing
only positive findings:
conditions are in RED,
procedures in GREEN
Indented version, with
findings and modifiers
*http://cat.cpmc.columbia.edu/MedLEExml/demo/
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
MedLEE Architecture1
Knowledge Components
Abbrevs
WSD Rules*
Lexicon
Pre Processor
HOSPITAL-PEDIATRIC DISCHARGE SUMMARY
NAME – #####
DATE OF ADMISSION – ####
LOCATION – #####
BIRTH DATE - #####
(REASON FOR ADMISSION)
SWOLLEN, PAINFUL HANDS. VOMITING. SYMPTOMS OF 18
HOURS DURATION.
(ABSTRACT)
PATIENT, 1 YEAR OLD. IS KNOWN TO HAVE SICKLE CELL
DISEASES AND 2 EPISODES OF MENINGITIS. DEVELOPED
SWOLLEN, PAINFUL AND WARM HANDS. HAD SEVERAL
EPISODES OF VOMIINT PRIOR TO ADMISSION.
LABORATORY STUDIES DID NOT REVEAL ANEMIA OR
SYSTEMIC INFECTION. HYDRATION THERAPY AND BED
REST WERE PROVIDED, WITH IMPORVEMENT IN 48 HOURS.
WAS DISCHARGED IMPROVED. TO BE FOLLOWED IN
HEMATOLOGY CLINIC.
Grammar
Mappings
Coding
Table
Parser
Phrase
Regular.
Encoder
Error
Recovery
Text
Structured form
*Word sense disambiguation
1Slide courtesy of Carol Friedman
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
MedLEE Processing Pipeline
0 Pre-processor does lexical look-up to assign
semantic classes, abbreviation expansion and
other disambiguation:
- HR = heart rate or hour?
- Discharge = patient status or sign/symptom?
0 Parsing done with a semantic grammar, e.g.,
- DEGREE + CHANGE + FINDING handles:
mild increase in congestion,
mildly increased congestion, …
0 Phrase regularization maps semantically equivalent
phrases into structured controlled vocabulary:
- Heart appears to be slightly enlarged =>
enlarged heart
0 Encoding maps phrases into appropriate format
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
MedLEE Captures Language Complexity1
0 Synonyms and ambiguities
0 Modification and predication relations
- “enlarged heart” = “cardiac enlargement” =
heart appears to be enlarged”
0 Mapping into (several) standard forms via coding
tables:
‘abdominal pain’ with no modifiers:
C0000737 (UMLS code)
‘abdominal pain’ modified by ‘no’:
C0423651 (‘no abdominal pain’)
C0518732 (‘abdominal pain not present’)
1Slide
courtesy of Carol Friedman
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
MedLEE Applications
0 Framework is general enough so that it has been
applied to different medical domains
- Discharge summaries, radiology, mammography,
pathology,…
- And to the biological domain as well (GENIES)
0 It has been evaluated in applications to:
- Generate alerts to isolate patients suspicious
for tuberculosis
- Detect patients who have positive
mammograms
- Compare comorbidities for community-acquired
pneumonia using MedLEE encoding vs
administrative data (ICD-9 codes)
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Extending MedLEE1
0 Type of expertise required
• Grammar requires NLP expertise
• 450 rules (CXR) – 730 rules (DSUM)
• Little change for remaining domains
0 Lexicon, abbreviations,
Domain Specific
WSD, coding table –
Lexical Entries
domain expertise
12000
0 Compositional mappings 10000
8000
- automated
6000
4000
2000
MITRE
path
echo
ekg
rad
dsum
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
pe
courtesy of Carol Friedman
m am m o
cxr
1Slide
MedLEE Evaluations
0 Chest radiographic reports assessed for 24
conditions1
- System processed ~900,000 reports on 250,000
patients
- 150 reports compared to manual coding, with
sensitivity of 0.88, specificity of 0.99
0 Data extracted by MedLEE used to calculate
severity scores for community acquired pneumonia2
- Discharge summaries: sensitivity 92%;
specificity 93% compared to human coders
- Chest x-rays: sensitivity 87%; specificity 96%
1Friedman
et al., Automating a Severity Score Guideline for CommunityAcquired Pneumonia Employing Medical Language Processing of Discharge
Summaries. AMIA 256—260. 1999.
2Hripcsak
et al., Use of Natural Language Processing to Translate Clinical
Information from a Database of 889,921 hest Radiographic Reports.
MITRE
Radiology, 157-163, July 2002.
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Information Extraction: Conclusions
0 MedLEE has been demonstrated to provide
automated extraction of detailed information,
with high correlation to human coding
0 NLP can extract more detail and finer-grained
information than e.g., ICD-9 codings
0 System must be manually tailored to new reports
- E.g., radiology has a different vocabulary and
style, compared to pathology or chief
complaint or discharge summary
- One hospital may differ from another in its
report format and even vocabulary
With tailoring, Information Extraction can be used to
extract data for retrospective studies and detailed
tracking of course of disease
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Outline
Why text mining for surveillance?
What kinds of text to mine?
What is text mining?
Some examples
- Prodromic “binning” in RODS
- Processing patient records: MedLEE
Tracking outbreaks in the news: MiTAP
0 Open research issues
0
0
0
0
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE Text & Audio Processing1
0 Prototype for monitoring infectious disease
outbreaks & other global threats
0 Delivers information on demand
- In real time, 24x7
- From live, on-line sources
- Global news, at local level
- In multiple languages
0 Part of DARPA* TIDES† program
0 Available to qualified users via registration at the
MiTAP web site: http://mitap.sdsu.edu/p/
*
Defense Advanced Research Projects Agency
† Translingual Information Detection, Extraction & Summarization
1Damianos
et al., "MiTAP, Text and Audio Processing for BioSecurity: A Case Study." In Proc of IAAI-2002: The 14th
Innovative Applications of Artificial Intelligence Conf., 2002.
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
System Overview
capture
90+ sources,
~4K msgs/day
8 languages,
with MT
Grouped into news groups
by source, disease, person,
region, organization
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
SARS: Severe Acute Respiratory Syndrome
0 First record in MiTAP
- ProMED Feb 10 9PM
0 MiTAP finds in US press
- Miami Herald, Feb 11
- Other
countries:
Jakarta Post
Feb 12
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Tracing the Record for SARS
0 Searching the MiTAP archives for
- “SARS” OR “pneumonia” or
“acute respiratory infection”
0 655 hits from Feb 1 to March 22
0 Sample from search page from
http://mitap.sdsu.edu/search
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
News Reader Interface
Stories can be sorted by subject, source, date
News is categorized by
disease, source, region,
and custom categories
Messages are cross-posted
to relevant newsgroups
System is accessible via standard
news reader or web-based search
engine
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
MiTAP Interface
0 Each article is
indexed for
search
0 Routed to one or
more newsgroups
0 Tagged (via color
code) for relevant
entities, e.g.,
- Disease
- Location
- Time
0 Translated if
appropriate
(currently
disabled)
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Top locations pop up
MITRE
Summarization: Daily Top 10 Diseases
Diseases in today’s news, ranked by # articles
# MiTAP articles today
Click to
view extracts
Compare to
yesterday’s news
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Thumbnail of March 20 Top Stories on
SARS
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Multi-Document Summarization
Summary of clustered documents
Links to MiTAP docs
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
TIDES World Press Update (WPU):
0 Daily newsletter prepared
0
0
0
0
by consultant
Collated from ~50 mostly
foreign news sources in
MiTAP
Review of 800-1000 articles
in <2 hours
Designed to improve
understanding of the forces
shaping public perceptions
globally
Operation handed off to
SPAWAR
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Machine Translation
MT in 7 languages
with tagging of
translations
(currently disabled)
Access to
original, foreign
language
document
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Information Extraction: Proteus-BIO*
Automatic extraction of
disease, location, date,
number of victims and
status from ProMED
Accuracy of extraction:
precision 79%, recall 41%
*Grishman, R. et al Information Extraction for Enhanced Access to
Disease Outbreak Reports, J. Biomedical Informatics 35 (2002) 236246.
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Visualizing Epidemics: SARS
Given extracted information
(here from tabular WHO
reports), data can be graphed
or mapped
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
SARS: Rate of Change
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
MiTAP Usage: ~600 Current Users
0 Government, military and government contractors
0 Medical groups: analysts, physicians
0 Researchers & collaborators
- Universities
- DARPA/NSF contractors
0 United Nations
0 Non-Government Organizations
- American Red Cross
- ProMED
0 Non-US organizations
- European Disaster Center
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Outline
Why text mining for surveillance?
What kinds of text to mine?
What is text mining?
Some examples
- Prodromic “binning” in RODS
- Processing patient records: MedLEE
- Tracking outbreaks in the news: MiTAP
Open research issues
0
0
0
0
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Key Research Issues
0 Techniques must be tailored to document type:
- Chief complaint handled well by “bag of words”
- More complex patient records require capture
of linguistic relations (modifiers, negation)
- Other encounters may be in form of speech
(hotline) and foreign languages
0 Tasks vary:
- “Binning” for prodromic surveillance
- Extraction for careful tracking of particular
condition and outcomes
- Browsing and clustering is useful for global
disease tracking
- Summarization useful for following patient or
outbreak over time
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Extracting Meaning from Language is Hard
0 Meaning may depend on context, e.g.,
= Discharge of patient vs. bloody discharge
= Chest negative => chest [x-ray] negative
0 One meaning can be expressed in many ways:
- Enlarged heart = cardiomegaly
0 Complex syntactic relations
- Enlarged heart = heart is enlarged
- Severe pain and fever = (severe pain) + fever
- Pain in left arm and wrist = left (arm and wrist)
0 Language varies from domain to domain
- New vocabulary and phrases required for every
new specialty
- If a disease isn’t named, it is hard to find it!
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Language is Hard: Negation
0 Chapman is developing a negative detection tool,
NegEx* to detect negation and scope
0 Distinguish:
- Patient denies pain and shortness of breath
=>
pain (negated); sob (negated)
- Patient denies pain but has shortness of breath
=>
pain (negated); sob (positive)
0 Negative expressions include:
- No, not, deny, ruled out, no complaint of,
absence of, free of, without, fails to reveal,…
*http://omega.cbmi.upmc.ed/~chapman/NegEx.html
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Key Research Issues: Some Problems
0 Input format for medical records is irregular
- No uniform medical record format
- Records may be truncated, with idiosyncratic
abbreviations and typos
0 Desired output format is also non-standard:
- Multiple nomenclatures and encodings, e.g.,
= ICD9
= SNOMED
= UMLS
0 There are many subdomains:
- Tools needed to help automatic tailoring to new
domains
- Training data and resources (lexicons with
synonym lists) are key
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Conclusions
0 Text mining and information extraction have been
successfully applied in several relevant domains
- Binning into syndromes
- Extraction of complex information from patient
records
- Capture, binning and mark-up of global news on
infectious disease
0 There are still major challenges:
- There is no standardized input or output
- Systems are cumbersome to port to new
subdomains and tasks
- Evaluation is difficult: hard to evaluate quality of
extraction given noisy data, complex tasks
=Standard benchmark test sets would help
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Acknowledgements
0 I would like to thank
- Wendy Chapman for materials on RODS
- Carol Friedman for materials on MedLEE
- Cmd Eric Rasmussen, US Navy, whose vision
has guided MiTAP
- Mark Prutsalis, MiTAP’s most productive
user
- Bob Younger and Sue Ellen Moore for
MiTAP technology transfer to SPAWAR
Systems Center
- And my MITRE colleagues for the work on
the MiTAP system:
= Laurie Damianos (PI), Steve Wohlever,
George Wilson, Marc Ubaldino, Andy
Chisholm, Janet Hitzeman, Conrad
Chang, Andy Shen
MITRE
- DARPA for its funding of the MiTAP work
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Backup
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Information Extraction Evaluations
For Newswire
100
90
Name extraction > 90%
in English, Japanese;
improving in Chinese
Results show “best
of show” each year
F-measure (Accuracy)
80
Relation extraction
now at over 80%
70
60
Event extraction
less than 60%,
improving slowly
50
40
Names: English
30
Names: Japanese
Names: Chinese
20
Commercial name
taggers exist for
news reports in
multiple languages
Relations
10
Events
0
1991
1992
1993
1995
1998
1999
Year
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
Question Answering
(MITRE’s QANDA System)
Question Answering
Collections:
Gigabytes
Documents:
Megabytes
PIR
Genbank
Question Answering:
question to answer
Lists,Tables:
Kilobytes Phrases:
Bytes
MEDLINE
D isea se
Ebo la
Ebo la
Ebo la
Ebo la
Ebo la
Ebo la
Ebo la
Ebo la
So urce
PR OME D
PR OME D
PR OME D
PR OME D
PR OME D
PR OME D
PR OME D
PR OME D
C ount ry
Ug anda
Ug anda
Ug anda
Ug anda
Ug anda
Ug anda
Ug anda
Ug anda
Ci t y_n ameD at e
C ases
N ew _c ases Dea d
Gu la
2 6-O ct - 2000
182
17
64
Gu la
5-N ov- 2000
280
14
89
Gu lu
1 3-O ct - 2000
42
9
30
Gu lu
1 5-O ct - 2000
51
7
31
Gu lu
1 6-O ct - 2000
63
12
33
Gu lu
1 7-O ct - 2000
73
2
35
Gu lu
1 8-O ct - 2000
94
21
39
Gu lu
1 9-O ct - 2000
111
17
41
Where did Dylan Thomas die?
1. Swansea: In “Dylan: the Nine Lives of Dylan Thomas,
Fryer makes a virtue of not coming from Swansea
 2. Italy: Dylan Thomas’s widow Caitlin, who died last week in Italy
aged 81,
3. New York:Dylan Thomas died in New York 40 years ago next Tuesday
M ITRE
© 2 001 The M ITRE Corporation . ALL RIG HTS RESERVED.
What diseases are caused by prions?
1. Both CJD and BSE are caused by mysterious particles of
infectious protein called prions
 2. Scientists trying to understand the epidemic face an unusual problem:
BSE, scrapie, and CJD are caused by a bizarre infectious agent, the
prion which does not follow the normal rules of microbiology.
 3. These diseases are caused by a prion, an abnormal version of a
naturally-occurring protein, but researchers have recognized
different strains of prions that differ in incubation times,
symptoms, and severity of illness. ...
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Protease-resistant
prion protein
interacts with...
MITRE
Question Answering
0 Stage 1: Question analysis
0
0
0
0
- Find type of object that answers the question:
“when” needs time, “which proteins” need protein
Stage 2: Document retrieval
- Using (augmented) question, retrieve set of
possibly relevant documents via information
retrieval
Stage 3: Document processing
- Search documents for entities of the desired type
using information extraction
- Search for entities in appropriate relations
Stage 4: Rank answer candidates
Stage 5: Present the answer (N bytes, or a phrase
or a sentence or a summary)
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
TREC Q&A 2000 Results (250-byte)
Harabagiu and Moldovan,
Southern Methodist University
Mean Reciprocal Rank:
76%
First Answer Correct:
69%
Correct Answer in Top 5: 86%
Lessons: question answering works -at least for simple factual questions
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
State of NLP: Metrics
0 Automated systems exist now that can:
- Return classes of documents relevant to a subject
(information retrieval: IR)
- Identify entities (90-95% accuracy) or relations
among entities (70-80% accuracy) in text
(information extraction: IE)
- Answer factual questions using large document
collections at 75-85% accuracy
(question answering: QA)
- Provide translations good enough for skimming
0 But... these results are for news stories
0 How do these results translate to medical data?
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Negatives from Chapman’s NegEx site
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
Example of Clinical Data (1)
Discharge Summary
Medical records
have internal
structure rely
heavily on
specialized
terminology
and
SYMPTOMS
OF 18
abbreviations
HOSPITAL-PEDIATRIC DISCHARGE SUMMARY
NAME – #####
DATE OF ADMISSION – ####
LOCATION – #####
BIRTH DATE - #####
(REASON FOR ADMISSION)
SWOLLEN, PAINFUL HANDS. VOMITING.
HOURS DURATION.
(ABSTRACT)
PATIENT, 1 YEAR OLD. IS KNOWN TO HAVE SICKLE CELL
DISEASES AND 2 EPISODES OF MENINGITIS. DEVELOPED
SWOLLEN, PAINFUL AND WARM HANDS. HAD SEVERAL
EPISODES OF VOMIINT PRIOR TO ADMISSION.
LABORATORY STUDIES DID NOT REVEAL ANEMIA OR
SYSTEMIC INFECTION. HYDRATION THERAPY AND BED
REST WERE PROVIDED, WITH IMPORVEMENT IN 48 HOURS.
WAS DISCHARGED IMPROVED. TO BE FOLLOWED IN
HEMATOLOGY CLINIC.
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
MedLEE: Discharge Summary
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
MedLEE: Discharge Summary w Mark-Up
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
MedLEE: Discharge Summary in HL7
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.
First SARS Message in MiTAP:
Feb 10, 2003
MITRE
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED.