Text Mining for Surveillance II: Extracting Epidemiological Information from Free Text Lynette Hirschman Chief Scientist Information Technology Center MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Outline Why text mining for surveillance? 0 What kinds of text to mine? 0 What is text mining? 0 Some examples - Prodromic “binning” in RODS - Processing patient records: MedLEE - Tracking outbreaks in the news: MiTAP 0 Open research issues and conclusions MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Text Mining for Surveillance Data Streams HOSPITAL-PEDIATRIC DISCHARGE SUMMARY NAME – ##### DATE OF ADMISSION – #### LOCATION – ##### BIRTH DATE - #####HOSPITAL-PEDIATRIC DISCHARGE SUMMARY NAME – ##### (REASON FOR ADMISSION) DATE OF ADMISSION – #### SWOLLEN, PAINFUL HANDS. VOMITING. LOCATION – ##### SYMPTOMS OF 18 HOURS DURATION. BIRTH DATE - ##### Document Classes HOSPITAL-PEDIATRIC DISCHARGE SUMMARY NAME – ##### DATE OF ADMISSION – #### LOCATION – ##### BIRTH DATE - ##### (ABSTRACT) (REASON FOR ADMISSION) PATIENT, 1 YEAR OLD. IS KNOWN TO HAVE SICKLE CELL SYMPTOMS OF 18 SWOLLEN, PAINFUL HANDS. VOMITING. DISEASES AND 2 EPISODES OF MENINGITIS. DEVELOPED HOURS DURATION. SWOLLEN, PAINFUL AND WARM HANDS. HAD SEVERAL EPISODES OF VOMIINT PRIOR TO ADMISSION. (ABSTRACT) LABORATORY STUDIESPATIENT, DID NOT 1REVEAL ANEMIA YEAR OLD. IS OR KNOWN TO HAVE SICKLE CELL SYSTEMIC INFECTION. HYDRATION AND OF BEDMENINGITIS. DEVELOPED DISEASES AND THERAPY 2 EPISODES REST WERE PROVIDED, WITH IMPORVEMENT 48 HOURS. SWOLLEN, PAINFUL ANDINWARM HANDS. HAD SEVERAL WAS DISCHARGED IMPROVED. FOLLOWED IN TO ADMISSION. EPISODESTOOFBEVOMIINT PRIOR HEMATOLOGY CLINIC.LABORATORY STUDIES DID NOT REVEAL ANEMIA OR SYSTEMIC INFECTION. HYDRATION THERAPY AND BED REST WERE PROVIDED, WITH IMPORVEMENT IN 48 HOURS. WAS DISCHARGED IMPROVED. TO BE FOLLOWED IN HEMATOLOGY CLINIC. (REASON FOR ADMISSION) SWOLLEN, PAINFUL HANDS. VOMITING. SYMPTOMS OF 18 HOURS DURATION. (ABSTRACT) PATIENT, 1 YEAR OLD. IS KNOWN TO HAVE SICKLE CELL DISEASES AND 2 EPISODES OF MENINGITIS. DEVELOPED SWOLLEN, PAINFUL AND WARM HANDS. HAD SEVERAL EPISODES OF VOMIINT PRIOR TO ADMISSION. LABORATORY STUDIES DID NOT REVEAL ANEMIA OR SYSTEMIC INFECTION. HYDRATION THERAPY AND BED REST WERE PROVIDED, WITH IMPORVEMENT IN 48 HOURS. WAS DISCHARGED IMPROVED. TO BE FOLLOWED IN HEMATOLOGY CLINIC. HOSPITAL-PEDIATRIC DISCHARGE SUMMARY NAME – ##### DATE OF ADMISSION – #### LOCATION – ##### BIRTH DATE - ##### (REASON FOR ADMISSION) SWOLLEN, PAINFUL HANDS. VOMITING. SYMPTOMS OF 18 HOURS DURATION. diarrheal respiratory (ABSTRACT) PATIENT, 1 YEAR OLD. IS KNOWN TO HAVE SICKLE CELL DISEASES AND 2 EPISODES OF MENINGITIS. DEVELOPED SWOLLEN, PAINFUL AND WARM HANDS. HAD SEVERAL EPISODES OF VOMIINT PRIOR TO ADMISSION. LABORATORY STUDIES DID NOT REVEAL ANEMIA OR SYSTEMIC INFECTION. HYDRATION THERAPY AND BED REST WERE PROVIDED, WITH IMPORVEMENT IN 48 HOURS. WAS DISCHARGED IMPROVED. TO BE FOLLOWED IN HEMATOLOGY CLINIC. Text Classification: key words to document classes Extracted Information, Summary Views ICD9: 465.9 upper respiratory infection Information Extraction: documents to entities, relations Documents contain useful information for tracking outbreaks – if free text can be converted into structured data MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Outline 0 Why text mining for surveillance? What kinds of text to mine? 0 What is text mining? 0 Some examples - Prodromic “binning” in RODS - Processing patient records: MedLEE - Tracking outbreaks in the news: MiTAP 0 Open research issues MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Patient Encounter Data 0 Useful information is contained in patient records - Clinic visits, emergency room visits, hot lines - Data usually occurs as stylized free text 0 What to extract? - Information useful for prodromic or syndromic surveillance = Without text mining, systems often just track fluctuation in number of admissions = New systems (e.g., RODS) can bin text data into prodromes or syndromes 0 Time is critical in detecting an outbreak - Delays in collecting, processing and aggregating information lead to delays in response - Moral: grab what you can (Chief Complaint) MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Example of Clinical Data: Triage Chief Complaint (TCC)* NVD COUGH SOB DIZZY NAUSEA VOMITITING TCC is short (20-50 characters, 1-10 words) timely (available upon patient admission) errorful (typos and abbreviations) *From R Olszewski, “Bayesian Classification of Triage Diagnoses for the Early Detection of Epidemics,” Recent Advances in Artificial Intelligence: Proc of the 16th Internl FLAIRS Conf. Pp 412-416, AAAI Press, 2003. MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Clinical Data: Radiology Report* Mention of a condition (hilar adenopathy) is not equivalent to assertion: Patient does not have hilar adenopathy Telegraphic style Extensive jargon (sublanguage) Fields vary depending on report type *http://cat.cpmc.columbia.edu/MedLEExml/demo/ © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Global Disease Tracking from News 0 Capture of global outbreak information - The recent SARS outbreak underscores the importance of global monitoring for outbreaks - These are often first reported in (local) news media or by informal communication (web chat rooms) 0 Global outreach to capture local news is critical - Local news sources tend to be in local languages (requiring translation) - Local news may be by radio, requiring capture from broadcast news sources (radio, TV) - This requires more advanced text processing technology (speech transcription, translation) MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Example: Global Disease Tracking Message from Feb. 10 in ProMED* on SARS *ProMED: Program for Monitoring Infectious Diseases: http://www.promedmail.org; Displayed in MiTAP MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Outline 0 Why text mining for surveillance? 0 What kinds of text to mine? What is text mining? 0 Some examples - Prodromic “binning” in RODS - Processing patient records: MedLEE - Tracking outbreaks in the news: MiTAP 0 Open research issues MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. The Components of Text Mining Collections News Reports Document Classes Patient Records MEDLINE Question Answering: Question to answer Summaries, Tables Facts HOSPITAL-PEDIATRIC DISCHARGE SUMMARY NAME – ##### DATE OF ADMISSION – #### LOCATION – ##### BIRTH DATE - ##### (REASON FOR ADMISSION) SWOLLEN, PAINFUL HANDS. VOMITING. SYMPTOMS OF 18 HOURS DURATION. (ABSTRACT) PATIENT, 1 YEAR OLD. IS KNOWN TO HAVE SICKLE CELL DISEASES AND 2 EPISODES OF MENINGITIS. DEVELOPED SWOLLEN, PAINFUL AND WARM HANDS. HAD SEVERAL EPISODES OF VOMIINT PRIOR TO ADMISSION. LABORATORY STUDIES DID NOT REVEAL ANEMIA OR SYSTEMIC INFECTION. HYDRATION THERAPY AND BED REST WERE PROVIDED, WITH IMPORVEMENT IN 48 HOURS. WAS DISCHARGED IMPROVED. TO BE FOLLOWED IN HEMATOLOGY CLINIC. Information Retrieval, Text Classification: Key words to document classes SARS traced to civet cat Information Extraction: Documents to entities, relations MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Text Mining Modules Text mining takes free text as input and distills some “value added” information 0 Binning documents into coherent sets (e.g., 0 0 0 0 prodromes) Extracting key entities (symptoms, diseases, locations) and relations (time, severity, frequency) from narrative text Summarizing the findings (in a single record or across multiple records) Visualizing the data: create tables from textual data for display (e.g., charts, maps) Finding answers to natural language questions (the “nuggets” in a collection of documents) MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Text Mining (1) 0 Binning technology (classification) - Shallow and fast; - Usually uses “bag of words” approach; - Must be trained to classify into free text into desired set of bins 0 Extraction - Relies on words in context (statistical or linguistic); - Is designed to get at content/details, including negated or qualified conditions (deeper, slower) - Requires either hand-tailored rules or application of machine learning algorithms based on extensive annotated training data MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Text Mining (2) 0 Summarization - Provides distillation of information across multiple records; - Relies on a mix of pattern recognition and semantic analysis 0 Visualization - Takes as input values extracted from free text - Useful for interpreting complex spatio-temporal data (graphs, maps) 0 Question answering - Can return “nuggets” of information - Works by analyzing the question, locating specific document and extracting the right type of fact MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Outline Why text mining for surveillance? What kinds of text to mine? What is text mining? Some examples Prodromic “binning” in RODS - Processing patient records: MedLEE - Tracking outbreaks in the news: MiTAP 0 Open research issues 0 0 0 0 MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Example #1: RODS (Real-Time Outbreak and Disease Surveillance)* Admission Records from Emergency Departments RODS System Graphs and Maps Detection Algorithms Emergency Department Preprocessor Web Server Database Emergency Department CoCo Emergency Department Geographic Information System *Slide courtesy of Wendy Chapman, U. Pittsburgh © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Example System #1: RODS* 0 RODS created at the University of Pittsburgh - Used in multiple deployments, including for health monitoring during the 2001 Olympics, Western Pennsylvania health surveillance 0 RODS captures electronic medical records - Applies natural language processing to bin “chief complaint” into a set of syndromes, e.g., respiratory, diarrheal, rash, … - Detects “out of ordinary” occurrences - Provides temporal and geospatial visualization *http://www.health.pitt.edu/rods/ © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Text Mining: Binning Into Syndromes 0 Naïve Bayesian classifier used to bin triage diagnoses into 8 syndromes - Unigram model gave good results: .80 to .97 area under ROC curve, depending on syndrome, compared to human experts - Bigram and mixture models did worse than unigram model (due to sparse data) 0 Correcting spelling mistakes and expanding abbreviations improved results by about a percentage point 0 Some syndromes harder than others (“botulinic” syndrome was only around 78% AUC) *R Olszewski, “Bayesian Classification of Triage Diagnoses for the Early Detection of Epidemics,” Recent Advances in Artificial Intelligence: Proc of the 16th Internl FLAIRS Conf. Pp 412-416, AAAI Press, 2003. MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Text Mining for Chief Complaint: Conclusions* 0 Naïve Bayes works “well enough” for Chief Complaint 0 Triage complaints are well-suited to “bag of words” - They are short with little syntax, few modifiers - They present positive complaints (no negation) 0 Some issues - Entries may be too brief to provide adequate data to separate certain syndromes - There may be insufficient training data for some bins (“botulinic” bin had the fewest cases) - Triage complaints may lack sufficient detail for certain applications More advanced linguistic techniques may be overkill for Chief Complaint reports *Ivanov, O. et al Accuracy of Three Classifiers of Acute Gastrointestinal Syndrome for Syndromic Surveillance. AMIA 2002 Ann. Symp. Proc. 345-349MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. RODS Mining syndromic information with time and geospatial coordinates allows effective monitoring for anomalous events MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Outline Why text mining for surveillance? What kinds of text to mine? What is text mining? Some examples - Prodromic “binning” in RODS Processing patient records: MedLEE - Tracking outbreaks in the news: MiTAP 0 Open research issues 0 0 0 0 MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Example System #2: MedLEE* (Carol Friedman, Columbia) Radiology no mass or calcification noted on this xray lungs are clear; no gallops or rubs Pathology this echo shows some thickening of mitral valve DOMAINS Medical Language Processor Error Tracking INSTITUTIONS Patient Record Access APPLICATIONS Surveillance Discharge Summaries *Friedman, C., et al. A General Natural-Language Text Processor for Clinical Radiology. JAMIA 1(2) 161-174. 1994 © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE MedLEE Processes Complex Narrative* HL7 format, with codes *http://cat.cpmc.columbia.edu/MedLEExml/demo/ © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE MedLEE Provides Multiple Output Formats Mark-up version,showing only positive findings: conditions are in RED, procedures in GREEN Indented version, with findings and modifiers *http://cat.cpmc.columbia.edu/MedLEExml/demo/ © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE MedLEE Architecture1 Knowledge Components Abbrevs WSD Rules* Lexicon Pre Processor HOSPITAL-PEDIATRIC DISCHARGE SUMMARY NAME – ##### DATE OF ADMISSION – #### LOCATION – ##### BIRTH DATE - ##### (REASON FOR ADMISSION) SWOLLEN, PAINFUL HANDS. VOMITING. SYMPTOMS OF 18 HOURS DURATION. (ABSTRACT) PATIENT, 1 YEAR OLD. IS KNOWN TO HAVE SICKLE CELL DISEASES AND 2 EPISODES OF MENINGITIS. DEVELOPED SWOLLEN, PAINFUL AND WARM HANDS. HAD SEVERAL EPISODES OF VOMIINT PRIOR TO ADMISSION. LABORATORY STUDIES DID NOT REVEAL ANEMIA OR SYSTEMIC INFECTION. HYDRATION THERAPY AND BED REST WERE PROVIDED, WITH IMPORVEMENT IN 48 HOURS. WAS DISCHARGED IMPROVED. TO BE FOLLOWED IN HEMATOLOGY CLINIC. Grammar Mappings Coding Table Parser Phrase Regular. Encoder Error Recovery Text Structured form *Word sense disambiguation 1Slide courtesy of Carol Friedman © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE MedLEE Processing Pipeline 0 Pre-processor does lexical look-up to assign semantic classes, abbreviation expansion and other disambiguation: - HR = heart rate or hour? - Discharge = patient status or sign/symptom? 0 Parsing done with a semantic grammar, e.g., - DEGREE + CHANGE + FINDING handles: mild increase in congestion, mildly increased congestion, … 0 Phrase regularization maps semantically equivalent phrases into structured controlled vocabulary: - Heart appears to be slightly enlarged => enlarged heart 0 Encoding maps phrases into appropriate format MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MedLEE Captures Language Complexity1 0 Synonyms and ambiguities 0 Modification and predication relations - “enlarged heart” = “cardiac enlargement” = heart appears to be enlarged” 0 Mapping into (several) standard forms via coding tables: ‘abdominal pain’ with no modifiers: C0000737 (UMLS code) ‘abdominal pain’ modified by ‘no’: C0423651 (‘no abdominal pain’) C0518732 (‘abdominal pain not present’) 1Slide courtesy of Carol Friedman © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE MedLEE Applications 0 Framework is general enough so that it has been applied to different medical domains - Discharge summaries, radiology, mammography, pathology,… - And to the biological domain as well (GENIES) 0 It has been evaluated in applications to: - Generate alerts to isolate patients suspicious for tuberculosis - Detect patients who have positive mammograms - Compare comorbidities for community-acquired pneumonia using MedLEE encoding vs administrative data (ICD-9 codes) MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Extending MedLEE1 0 Type of expertise required • Grammar requires NLP expertise • 450 rules (CXR) – 730 rules (DSUM) • Little change for remaining domains 0 Lexicon, abbreviations, Domain Specific WSD, coding table – Lexical Entries domain expertise 12000 0 Compositional mappings 10000 8000 - automated 6000 4000 2000 MITRE path echo ekg rad dsum © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. pe courtesy of Carol Friedman m am m o cxr 1Slide MedLEE Evaluations 0 Chest radiographic reports assessed for 24 conditions1 - System processed ~900,000 reports on 250,000 patients - 150 reports compared to manual coding, with sensitivity of 0.88, specificity of 0.99 0 Data extracted by MedLEE used to calculate severity scores for community acquired pneumonia2 - Discharge summaries: sensitivity 92%; specificity 93% compared to human coders - Chest x-rays: sensitivity 87%; specificity 96% 1Friedman et al., Automating a Severity Score Guideline for CommunityAcquired Pneumonia Employing Medical Language Processing of Discharge Summaries. AMIA 256—260. 1999. 2Hripcsak et al., Use of Natural Language Processing to Translate Clinical Information from a Database of 889,921 hest Radiographic Reports. MITRE Radiology, 157-163, July 2002. © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Information Extraction: Conclusions 0 MedLEE has been demonstrated to provide automated extraction of detailed information, with high correlation to human coding 0 NLP can extract more detail and finer-grained information than e.g., ICD-9 codings 0 System must be manually tailored to new reports - E.g., radiology has a different vocabulary and style, compared to pathology or chief complaint or discharge summary - One hospital may differ from another in its report format and even vocabulary With tailoring, Information Extraction can be used to extract data for retrospective studies and detailed tracking of course of disease MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Outline Why text mining for surveillance? What kinds of text to mine? What is text mining? Some examples - Prodromic “binning” in RODS - Processing patient records: MedLEE Tracking outbreaks in the news: MiTAP 0 Open research issues 0 0 0 0 MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Text & Audio Processing1 0 Prototype for monitoring infectious disease outbreaks & other global threats 0 Delivers information on demand - In real time, 24x7 - From live, on-line sources - Global news, at local level - In multiple languages 0 Part of DARPA* TIDES† program 0 Available to qualified users via registration at the MiTAP web site: http://mitap.sdsu.edu/p/ * Defense Advanced Research Projects Agency † Translingual Information Detection, Extraction & Summarization 1Damianos et al., "MiTAP, Text and Audio Processing for BioSecurity: A Case Study." In Proc of IAAI-2002: The 14th Innovative Applications of Artificial Intelligence Conf., 2002. © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE System Overview capture 90+ sources, ~4K msgs/day 8 languages, with MT Grouped into news groups by source, disease, person, region, organization MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. SARS: Severe Acute Respiratory Syndrome 0 First record in MiTAP - ProMED Feb 10 9PM 0 MiTAP finds in US press - Miami Herald, Feb 11 - Other countries: Jakarta Post Feb 12 MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Tracing the Record for SARS 0 Searching the MiTAP archives for - “SARS” OR “pneumonia” or “acute respiratory infection” 0 655 hits from Feb 1 to March 22 0 Sample from search page from http://mitap.sdsu.edu/search MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. News Reader Interface Stories can be sorted by subject, source, date News is categorized by disease, source, region, and custom categories Messages are cross-posted to relevant newsgroups System is accessible via standard news reader or web-based search engine © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE MiTAP Interface 0 Each article is indexed for search 0 Routed to one or more newsgroups 0 Tagged (via color code) for relevant entities, e.g., - Disease - Location - Time 0 Translated if appropriate (currently disabled) © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Top locations pop up MITRE Summarization: Daily Top 10 Diseases Diseases in today’s news, ranked by # articles # MiTAP articles today Click to view extracts Compare to yesterday’s news MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Thumbnail of March 20 Top Stories on SARS MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Multi-Document Summarization Summary of clustered documents Links to MiTAP docs MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. TIDES World Press Update (WPU): 0 Daily newsletter prepared 0 0 0 0 by consultant Collated from ~50 mostly foreign news sources in MiTAP Review of 800-1000 articles in <2 hours Designed to improve understanding of the forces shaping public perceptions globally Operation handed off to SPAWAR MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Machine Translation MT in 7 languages with tagging of translations (currently disabled) Access to original, foreign language document MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Information Extraction: Proteus-BIO* Automatic extraction of disease, location, date, number of victims and status from ProMED Accuracy of extraction: precision 79%, recall 41% *Grishman, R. et al Information Extraction for Enhanced Access to Disease Outbreak Reports, J. Biomedical Informatics 35 (2002) 236246. © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Visualizing Epidemics: SARS Given extracted information (here from tabular WHO reports), data can be graphed or mapped MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. SARS: Rate of Change MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MiTAP Usage: ~600 Current Users 0 Government, military and government contractors 0 Medical groups: analysts, physicians 0 Researchers & collaborators - Universities - DARPA/NSF contractors 0 United Nations 0 Non-Government Organizations - American Red Cross - ProMED 0 Non-US organizations - European Disaster Center MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Outline Why text mining for surveillance? What kinds of text to mine? What is text mining? Some examples - Prodromic “binning” in RODS - Processing patient records: MedLEE - Tracking outbreaks in the news: MiTAP Open research issues 0 0 0 0 MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Key Research Issues 0 Techniques must be tailored to document type: - Chief complaint handled well by “bag of words” - More complex patient records require capture of linguistic relations (modifiers, negation) - Other encounters may be in form of speech (hotline) and foreign languages 0 Tasks vary: - “Binning” for prodromic surveillance - Extraction for careful tracking of particular condition and outcomes - Browsing and clustering is useful for global disease tracking - Summarization useful for following patient or outbreak over time MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Extracting Meaning from Language is Hard 0 Meaning may depend on context, e.g., = Discharge of patient vs. bloody discharge = Chest negative => chest [x-ray] negative 0 One meaning can be expressed in many ways: - Enlarged heart = cardiomegaly 0 Complex syntactic relations - Enlarged heart = heart is enlarged - Severe pain and fever = (severe pain) + fever - Pain in left arm and wrist = left (arm and wrist) 0 Language varies from domain to domain - New vocabulary and phrases required for every new specialty - If a disease isn’t named, it is hard to find it! MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Language is Hard: Negation 0 Chapman is developing a negative detection tool, NegEx* to detect negation and scope 0 Distinguish: - Patient denies pain and shortness of breath => pain (negated); sob (negated) - Patient denies pain but has shortness of breath => pain (negated); sob (positive) 0 Negative expressions include: - No, not, deny, ruled out, no complaint of, absence of, free of, without, fails to reveal,… *http://omega.cbmi.upmc.ed/~chapman/NegEx.html © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Key Research Issues: Some Problems 0 Input format for medical records is irregular - No uniform medical record format - Records may be truncated, with idiosyncratic abbreviations and typos 0 Desired output format is also non-standard: - Multiple nomenclatures and encodings, e.g., = ICD9 = SNOMED = UMLS 0 There are many subdomains: - Tools needed to help automatic tailoring to new domains - Training data and resources (lexicons with synonym lists) are key MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Conclusions 0 Text mining and information extraction have been successfully applied in several relevant domains - Binning into syndromes - Extraction of complex information from patient records - Capture, binning and mark-up of global news on infectious disease 0 There are still major challenges: - There is no standardized input or output - Systems are cumbersome to port to new subdomains and tasks - Evaluation is difficult: hard to evaluate quality of extraction given noisy data, complex tasks =Standard benchmark test sets would help MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Acknowledgements 0 I would like to thank - Wendy Chapman for materials on RODS - Carol Friedman for materials on MedLEE - Cmd Eric Rasmussen, US Navy, whose vision has guided MiTAP - Mark Prutsalis, MiTAP’s most productive user - Bob Younger and Sue Ellen Moore for MiTAP technology transfer to SPAWAR Systems Center - And my MITRE colleagues for the work on the MiTAP system: = Laurie Damianos (PI), Steve Wohlever, George Wilson, Marc Ubaldino, Andy Chisholm, Janet Hitzeman, Conrad Chang, Andy Shen MITRE - DARPA for its funding of the MiTAP work © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Backup MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Information Extraction Evaluations For Newswire 100 90 Name extraction > 90% in English, Japanese; improving in Chinese Results show “best of show” each year F-measure (Accuracy) 80 Relation extraction now at over 80% 70 60 Event extraction less than 60%, improving slowly 50 40 Names: English 30 Names: Japanese Names: Chinese 20 Commercial name taggers exist for news reports in multiple languages Relations 10 Events 0 1991 1992 1993 1995 1998 1999 Year © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Question Answering (MITRE’s QANDA System) Question Answering Collections: Gigabytes Documents: Megabytes PIR Genbank Question Answering: question to answer Lists,Tables: Kilobytes Phrases: Bytes MEDLINE D isea se Ebo la Ebo la Ebo la Ebo la Ebo la Ebo la Ebo la Ebo la So urce PR OME D PR OME D PR OME D PR OME D PR OME D PR OME D PR OME D PR OME D C ount ry Ug anda Ug anda Ug anda Ug anda Ug anda Ug anda Ug anda Ug anda Ci t y_n ameD at e C ases N ew _c ases Dea d Gu la 2 6-O ct - 2000 182 17 64 Gu la 5-N ov- 2000 280 14 89 Gu lu 1 3-O ct - 2000 42 9 30 Gu lu 1 5-O ct - 2000 51 7 31 Gu lu 1 6-O ct - 2000 63 12 33 Gu lu 1 7-O ct - 2000 73 2 35 Gu lu 1 8-O ct - 2000 94 21 39 Gu lu 1 9-O ct - 2000 111 17 41 Where did Dylan Thomas die? 1. Swansea: In “Dylan: the Nine Lives of Dylan Thomas, Fryer makes a virtue of not coming from Swansea 2. Italy: Dylan Thomas’s widow Caitlin, who died last week in Italy aged 81, 3. New York:Dylan Thomas died in New York 40 years ago next Tuesday M ITRE © 2 001 The M ITRE Corporation . ALL RIG HTS RESERVED. What diseases are caused by prions? 1. Both CJD and BSE are caused by mysterious particles of infectious protein called prions 2. Scientists trying to understand the epidemic face an unusual problem: BSE, scrapie, and CJD are caused by a bizarre infectious agent, the prion which does not follow the normal rules of microbiology. 3. These diseases are caused by a prion, an abnormal version of a naturally-occurring protein, but researchers have recognized different strains of prions that differ in incubation times, symptoms, and severity of illness. ... © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Protease-resistant prion protein interacts with... MITRE Question Answering 0 Stage 1: Question analysis 0 0 0 0 - Find type of object that answers the question: “when” needs time, “which proteins” need protein Stage 2: Document retrieval - Using (augmented) question, retrieve set of possibly relevant documents via information retrieval Stage 3: Document processing - Search documents for entities of the desired type using information extraction - Search for entities in appropriate relations Stage 4: Rank answer candidates Stage 5: Present the answer (N bytes, or a phrase or a sentence or a summary) MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. TREC Q&A 2000 Results (250-byte) Harabagiu and Moldovan, Southern Methodist University Mean Reciprocal Rank: 76% First Answer Correct: 69% Correct Answer in Top 5: 86% Lessons: question answering works -at least for simple factual questions © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE State of NLP: Metrics 0 Automated systems exist now that can: - Return classes of documents relevant to a subject (information retrieval: IR) - Identify entities (90-95% accuracy) or relations among entities (70-80% accuracy) in text (information extraction: IE) - Answer factual questions using large document collections at 75-85% accuracy (question answering: QA) - Provide translations good enough for skimming 0 But... these results are for news stories 0 How do these results translate to medical data? MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Negatives from Chapman’s NegEx site MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Example of Clinical Data (1) Discharge Summary Medical records have internal structure rely heavily on specialized terminology and SYMPTOMS OF 18 abbreviations HOSPITAL-PEDIATRIC DISCHARGE SUMMARY NAME – ##### DATE OF ADMISSION – #### LOCATION – ##### BIRTH DATE - ##### (REASON FOR ADMISSION) SWOLLEN, PAINFUL HANDS. VOMITING. HOURS DURATION. (ABSTRACT) PATIENT, 1 YEAR OLD. IS KNOWN TO HAVE SICKLE CELL DISEASES AND 2 EPISODES OF MENINGITIS. DEVELOPED SWOLLEN, PAINFUL AND WARM HANDS. HAD SEVERAL EPISODES OF VOMIINT PRIOR TO ADMISSION. LABORATORY STUDIES DID NOT REVEAL ANEMIA OR SYSTEMIC INFECTION. HYDRATION THERAPY AND BED REST WERE PROVIDED, WITH IMPORVEMENT IN 48 HOURS. WAS DISCHARGED IMPROVED. TO BE FOLLOWED IN HEMATOLOGY CLINIC. MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MedLEE: Discharge Summary MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MedLEE: Discharge Summary w Mark-Up MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MedLEE: Discharge Summary in HL7 MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED. First SARS Message in MiTAP: Feb 10, 2003 MITRE © 2002 The MITRE Corporation. ALL RIGHTS RESERVED.