A Knowledge-based Approach to Retrieve Scenario Specific Free

advertisement
A Knowledge-based Approach to
Retrieve Scenario Specific Free-text
in a Medical Digital Library
Wesley W. Chu
Computer Science Dept,
UCLA
wwc@cs.ucla.edu
NIH Program Project Grant






A 5 year $ 10M joint interdisciplinary project between
Medical School & CS faculty
Project 1-- teleradaiology infrastructure
Project 2-- neuroradiology workstation
Project 3-- multimedia information architecture
Project 4-- natural language processing for medical
reports
Project 5-- medical digital library
2
Project 5 Personnel


Project leader: Wesley W. Chu
Graduate students:
Victor Z. Liu
Wenlei Mao
Qinghua Zou
Consultants:
Hooshang Kangaloo, M.D.
Denies Aberle, M.D.

3
Data in a Medical Digital
Library



Structured data (patient lab data,
demographic data,…)--CoBase
Images (X rays, MRI, CT scans)--KMeD
Free-text




Patient reports
Teaching files
Literature
News articles
4
System Overview
Ad-hoc query
Patient report for content correlation
Medical Digital Library
(MDL)
Query results
Patient
reports
Medical
literature
5
Teaching
materials
News Articles
A Sample Patient Report
…
Tissue Source:
LUNG (FINE NEEDLE ASPIRATION) (LEFT
LOWER LOBE)
…
FINAL DIAGNOSIS:
- LUNG NODULE, LEFT LOWER LOBE (FINE
NEEDLE ASPIRATION):
- LUNG CANCER, SMALL CELL, STAGE II.
…
6
Scenario Specific Retrieval
…
Tissue Source:
LUNG (FINE NEEDLE
ASPIRATION) (LEFT LOWER
LOBE)
…
FINAL DIAGNOSIS:
- LUNG NODULE, LEFT
LOWER LOBE (FINE NEEDLE
ASPIRATION):
- LUNG CANCER, SMALL
CELL, STAGE II.
…
???
??? How
How to
to
treat
the
diagnose
disease
the disease
Diagnosisrelated
articles
7
Treatmentrelated
articles
Challenge I: Indexing

Extracting domain-specific key concepts
in the free text for indexing



Free-text: Lung cancer, small cell, stage II
Concept terms in knowledge source: stage II small
cell lung cancer
Conventional methods use NLP


Not scalable
Cannot adapt to various forms of word
permutation
8
Challenge II: Terms used in the
query are too general
Expanding the general terms in the query
to specific terms that are used in the
document
Query: lung cancer, diagnosis
chest x-rayoptions
, bronchography, …
√
?
Document: … the effectiveness of chest x-ray and
bronchography on patients with lung cancer …
9
Challenge III: Mismatching between
terms used in query and documents

Example
Query: … lung cancer, …
?
?
Document 1: … lung carcinoma …
?
Document 3: anti-cancer
drug combinations…
Document 2: … lung neoplasm …
10



Challenge I: Indexing
Challenge II: Terms in the query are too
general
Challenge III: Mismatch between terms
in the query and the documents
11
IndexFinder: Extracting domainspecific key concepts

Technique



Permute words from text to generate
concept candidates.
Use knowledge base to select the valid
candidates.
Problem

Valid candidates may be irrelevant to
specific domain indexing.
12
Eliminating irrelevant concepts

Syntactic filter:


Limit permutation of words within a
sentence.
Semantic filter:


Use the semantic type (e.g. body part,
disease, treatment, diagnosis) to filter out
irrelevant concepts
Use ISA relationship to filter out general
concepts and yield specific concepts.
13
IndexFinder Performance

Two orders of magnitude faster than conventional
approaches




No NLP
Knowledge base (UMLS) and index files are resided in
main memory
Time complexity is linear with the number of distinct
words in the text
Preliminary Evaluation

IndexFinder generates

4% more concepts than conventional approaches (using
a single noun phrase)
 All concepts are relevant
14



Challenge I: Indexing
Challenge II: Terms in the query are too
general
Challenge III: Mismatch between terms
in the query and the documents
15
Query Expansion (QE)

Queries in the following form benefit
from expansion:
<key concept> + <general supporting concept(s)>
e.g. lung cancer e.g. diagnosis options
expansion
<key concept> + <specific supporting concept(s)>
e.g. lung cancer e.g. chest x-ray, bronchography
16
Traditional QE


Appends all terms that statistically co-occur
with the key terms in the query
Not semantically focused
Original Query: lung cancer, diagnosis options
expansion
Expanded Query: lung cancer, radiotherapy,
chemotherapy, antineoplastic agents, survival rate
17
Knowledge-based QE
Disease or
Syndrome
Knowledge
source
(UMLS,
by the
NLM)
Diagnostic
Procedure
Sign or
Symptom
Pharmacologic Body
Substance
Parts
Injury or
Poisoning
diagnoses
diagnoses
diagnoses
Semantic Network
Metathesaurus
lung cancer
Key concept
Semantic Type
chest x-ray
Specific supporting concepts
Concept
18
A class of concepts
that belong to a
Semantic Type



Challenge I: Indexing
Challenge II: Terms in the query are too
general
Challenge III: Mismatch between terms
in the query and the documents
19
Phrase-based Vector Space Model
(VSM)
Query: … lung cancer, …
√
?
lung cancer = lung carcinoma …
missing!!!
parent_of
anti-cancer drug
combinations
lung neoplasm …
Document: … anti-cancer
lung neoplasm
carcinoma
drug
…
…
combinations …
Knowledge-source
20
Phrase-based VSM Examples
Query:
“lung cancer …”
Document:
“anti-cancer drug
combinations …”
Query
Phrases:
[(C0242379); “lung” “cancer”]…
Phrases:
[(C0003393); “anti” “cancer”
“drug” “combin”]…
[(C0242379); “lung” “cancer”] …
Document [(C0003393); “anti” “cancer” “drug” “combin”] …
21
Retrieval Effectiveness Comparison
(Corpus: OHSUMED, KB: UMLS)
1
Stems
average precision over 105 queries
0.9
0.8
16%
100 queries
vs.
5%
50 queries
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
recall
0.6
22
0.7
0.8
0.9
1
System Overview
Ad-hoc query
Patient report for content correlation
Medical Digital Library
(MDL)
Query results
Patient
reports
Medical
literature
23
Teaching
materials
News Articles
Application: Query Answering via
Templates

Sample templates:
“<disease>, treatment,”
“<disease>, diagnosis ”
relevant documents
Phrase-based
VSM
lung cancer
IndexFinder
Template:
“<disease>,
treatment”
lung cancer,
treatment
Query
Expansion
…
24
lung cancer
radiotherapy
chemotherapy
cisplatin
Applications (cont’d)

Scenario-specific content correlation
relevant documents
Query
Templates
e.g. treatment,
diagnosis, etc.
Phrase-based
VSM
Scenario
Selection
IndexFinder
Query
Expansion
…
Patient
Report
25
Conclusion

Knowledge based (UMLS) approach provides
scenario-specific medical free-text retrieval



IndexFinder – use word permutation as well as syntactic
and semantic filtering to extract domain-specific key
concepts in the free text for indexing
Knowledge-based query expansion – transform general
terms in the query into the scenario specific terms used in
the documents, giving the query a higher probability of
matching with the relevant documents
Phrase based indexing – transform document indexing into
phrase paradigm (concept and its word stems) to improve
retrieve effectiveness
26
Acknowledgement
This research is supported in part by
NIC/NIH Grant#4442511-33780
27
Demo
http://fargo.cs.ucla.edu/umls/search.aspx
Test Texts
• Technically successful left lower lobe nodule biopsy.
• Preliminary localization CT images again demonstrate a left lower lobe
nodule adjacent to the posterior segmental bronchus.
• CT scans obtained during biopsy demonstrate the coaxial cannula
adjacent to the proximal aspect of the nodule.
• Surrounding pulmonary parenchymal hemorrhage as a result of the
biopsy is also noted.
• There may be a tiny left apical air collection in the pleural space lateral
to the apical bulla.
• Formal cytologic evaluation of the withdrawn specimen is pending at
this time, although abnormal appearing "spindle" cells were identified
during on-site cytopathologic evaluation of specimen adequacy.
31
Download