A Knowledge-based Approach to Retrieve Scenario Specific Free-text in a Medical Digital Library Wesley W. Chu Computer Science Dept, UCLA wwc@cs.ucla.edu NIH Program Project Grant A 5 year $ 10M joint interdisciplinary project between Medical School & CS faculty Project 1-- teleradaiology infrastructure Project 2-- neuroradiology workstation Project 3-- multimedia information architecture Project 4-- natural language processing for medical reports Project 5-- medical digital library 2 Project 5 Personnel Project leader: Wesley W. Chu Graduate students: Victor Z. Liu Wenlei Mao Qinghua Zou Consultants: Hooshang Kangaloo, M.D. Denies Aberle, M.D. 3 Data in a Medical Digital Library Structured data (patient lab data, demographic data,…)--CoBase Images (X rays, MRI, CT scans)--KMeD Free-text Patient reports Teaching files Literature News articles 4 System Overview Ad-hoc query Patient report for content correlation Medical Digital Library (MDL) Query results Patient reports Medical literature 5 Teaching materials News Articles A Sample Patient Report … Tissue Source: LUNG (FINE NEEDLE ASPIRATION) (LEFT LOWER LOBE) … FINAL DIAGNOSIS: - LUNG NODULE, LEFT LOWER LOBE (FINE NEEDLE ASPIRATION): - LUNG CANCER, SMALL CELL, STAGE II. … 6 Scenario Specific Retrieval … Tissue Source: LUNG (FINE NEEDLE ASPIRATION) (LEFT LOWER LOBE) … FINAL DIAGNOSIS: - LUNG NODULE, LEFT LOWER LOBE (FINE NEEDLE ASPIRATION): - LUNG CANCER, SMALL CELL, STAGE II. … ??? ??? How How to to treat the diagnose disease the disease Diagnosisrelated articles 7 Treatmentrelated articles Challenge I: Indexing Extracting domain-specific key concepts in the free text for indexing Free-text: Lung cancer, small cell, stage II Concept terms in knowledge source: stage II small cell lung cancer Conventional methods use NLP Not scalable Cannot adapt to various forms of word permutation 8 Challenge II: Terms used in the query are too general Expanding the general terms in the query to specific terms that are used in the document Query: lung cancer, diagnosis chest x-rayoptions , bronchography, … √ ? Document: … the effectiveness of chest x-ray and bronchography on patients with lung cancer … 9 Challenge III: Mismatching between terms used in query and documents Example Query: … lung cancer, … ? ? Document 1: … lung carcinoma … ? Document 3: anti-cancer drug combinations… Document 2: … lung neoplasm … 10 Challenge I: Indexing Challenge II: Terms in the query are too general Challenge III: Mismatch between terms in the query and the documents 11 IndexFinder: Extracting domainspecific key concepts Technique Permute words from text to generate concept candidates. Use knowledge base to select the valid candidates. Problem Valid candidates may be irrelevant to specific domain indexing. 12 Eliminating irrelevant concepts Syntactic filter: Limit permutation of words within a sentence. Semantic filter: Use the semantic type (e.g. body part, disease, treatment, diagnosis) to filter out irrelevant concepts Use ISA relationship to filter out general concepts and yield specific concepts. 13 IndexFinder Performance Two orders of magnitude faster than conventional approaches No NLP Knowledge base (UMLS) and index files are resided in main memory Time complexity is linear with the number of distinct words in the text Preliminary Evaluation IndexFinder generates 4% more concepts than conventional approaches (using a single noun phrase) All concepts are relevant 14 Challenge I: Indexing Challenge II: Terms in the query are too general Challenge III: Mismatch between terms in the query and the documents 15 Query Expansion (QE) Queries in the following form benefit from expansion: <key concept> + <general supporting concept(s)> e.g. lung cancer e.g. diagnosis options expansion <key concept> + <specific supporting concept(s)> e.g. lung cancer e.g. chest x-ray, bronchography 16 Traditional QE Appends all terms that statistically co-occur with the key terms in the query Not semantically focused Original Query: lung cancer, diagnosis options expansion Expanded Query: lung cancer, radiotherapy, chemotherapy, antineoplastic agents, survival rate 17 Knowledge-based QE Disease or Syndrome Knowledge source (UMLS, by the NLM) Diagnostic Procedure Sign or Symptom Pharmacologic Body Substance Parts Injury or Poisoning diagnoses diagnoses diagnoses Semantic Network Metathesaurus lung cancer Key concept Semantic Type chest x-ray Specific supporting concepts Concept 18 A class of concepts that belong to a Semantic Type Challenge I: Indexing Challenge II: Terms in the query are too general Challenge III: Mismatch between terms in the query and the documents 19 Phrase-based Vector Space Model (VSM) Query: … lung cancer, … √ ? lung cancer = lung carcinoma … missing!!! parent_of anti-cancer drug combinations lung neoplasm … Document: … anti-cancer lung neoplasm carcinoma drug … … combinations … Knowledge-source 20 Phrase-based VSM Examples Query: “lung cancer …” Document: “anti-cancer drug combinations …” Query Phrases: [(C0242379); “lung” “cancer”]… Phrases: [(C0003393); “anti” “cancer” “drug” “combin”]… [(C0242379); “lung” “cancer”] … Document [(C0003393); “anti” “cancer” “drug” “combin”] … 21 Retrieval Effectiveness Comparison (Corpus: OHSUMED, KB: UMLS) 1 Stems average precision over 105 queries 0.9 0.8 16% 100 queries vs. 5% 50 queries 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 recall 0.6 22 0.7 0.8 0.9 1 System Overview Ad-hoc query Patient report for content correlation Medical Digital Library (MDL) Query results Patient reports Medical literature 23 Teaching materials News Articles Application: Query Answering via Templates Sample templates: “<disease>, treatment,” “<disease>, diagnosis ” relevant documents Phrase-based VSM lung cancer IndexFinder Template: “<disease>, treatment” lung cancer, treatment Query Expansion … 24 lung cancer radiotherapy chemotherapy cisplatin Applications (cont’d) Scenario-specific content correlation relevant documents Query Templates e.g. treatment, diagnosis, etc. Phrase-based VSM Scenario Selection IndexFinder Query Expansion … Patient Report 25 Conclusion Knowledge based (UMLS) approach provides scenario-specific medical free-text retrieval IndexFinder – use word permutation as well as syntactic and semantic filtering to extract domain-specific key concepts in the free text for indexing Knowledge-based query expansion – transform general terms in the query into the scenario specific terms used in the documents, giving the query a higher probability of matching with the relevant documents Phrase based indexing – transform document indexing into phrase paradigm (concept and its word stems) to improve retrieve effectiveness 26 Acknowledgement This research is supported in part by NIC/NIH Grant#4442511-33780 27 Demo http://fargo.cs.ucla.edu/umls/search.aspx Test Texts • Technically successful left lower lobe nodule biopsy. • Preliminary localization CT images again demonstrate a left lower lobe nodule adjacent to the posterior segmental bronchus. • CT scans obtained during biopsy demonstrate the coaxial cannula adjacent to the proximal aspect of the nodule. • Surrounding pulmonary parenchymal hemorrhage as a result of the biopsy is also noted. • There may be a tiny left apical air collection in the pleural space lateral to the apical bulla. • Formal cytologic evaluation of the withdrawn specimen is pending at this time, although abnormal appearing "spindle" cells were identified during on-site cytopathologic evaluation of specimen adequacy. 31