CS245A8

advertisement
KMeD: A Knowledge-Based Multimedia
Medical Database System
Wesley W. Chu
Computer Science Department
University of California, Los Angeles
http://www.cobase.cs.ucla.edu
1
KMeD
A Knowledge-Based Multimedia Medical
Distributed Database System
October 1, 1991 to
September 30, 1993
A Cooperative, Spatial, Evolutionary
Medical Database System
July 1, 1993 to
June 30, 1997
Knowledge-Based Image Retrieval with
Spatial and Temporal Constructs
May 1, 1997 to
April 30, 2001
Wesley W. Chu
Alfonso F. Cardenas
Ricky K. Taira
Computer Science Department
Computer Science Department
Department of Radiological Sciences
2
Research Team
Students
John David N. Dionisio
Chih-Cheng Hsu
David Johnson
Christine Chih
Collaborators
Computer Science
Department
Alfonso F. Cardenas
UCLA Medical School
Denise Aberle, MD
Robert Lufkin, MD
Ricky K. Taira, MD
3
Significance
Query multimedia data based on image content
and spatial predicates
Use domain knowledge to relax and interpret
medical queries
Present integrated view of multiple temporal and
evolutionary data in a timeline metaphor
4
Overview
Image retrieval by feature and content
Query relaxation
Spatial query answering
Similarity query answering
Visual query interface
Timeline interface
Sample cases
5
Image Retrieval by Content
Features

size, shape, texture, density, histology
Spatial Relations

angle of coverage, shortest distance, overlapping ratio,
contact ratio, relative direction
Evolution of Object Growth

fusion, fission
6
7
8
9
10
Characteristics of Medical Queries
Multimedia
Temporal
Evolutionary
Spatial
Imprecise
11
01
O
O’
01
O
Om
Evolution: Object O evolves
into a new object O’
Fusion: Object 01, …, Om
fuse into a new object
O
On
Fission: Object O splits
into object 01, …, On
12
Case a:
The object exists with its supertype
or aggregated type.
Case c:
The life span of the object starts with
and ends before its supertype or
aggregated type.
Case b:
The life span of the object starts after
and ends with its supertype or
aggregated type.
Case d:
The life span of the object starts after
and ends before its supertype or
aggregated type.
13
Lesion
MicroLesion
MicroLesion
14
15
16
17
18
Query Modification Techniques
Relaxation


Generalization
Specialization
Association
19
Generalization and Specialization
More Conceptual Query
Generalization
Conceptual Query
Generalization
Specific Query
Specialization
Conceptual Query
Specialization
Specific Query
20
Type Abstraction Hierarchy
Presents abstract view of





Types
Attribute values
Image features
Temporal and evolutionary behavior
Spatial relationships among objects
Provides multi-level knowledge representation
21
TAH Generation for Numerical Attribute Values
Relaxation Error


Difference between the exact value and the returned
approximate value
The expected error is weighted by the probability of
occurrence of each value
DISC (Distribution Sensitive Clustering) is based on
the attribute values and frequency distribution of the
data
22
TAH Generation for Numerical Attribute Values (cont.)
2
Computation Complexity: O(n ), where n is the
number of distinct value in a cluster
DISC performs better than Biggest Cap (value only)
or Max Entropy (frequency only) methods
MDISC is developed for multiple attribute TAHs.
Computation Complexity: O(mn2), where m is the
number of attributes
23
Query Relaxation
Query
Relax
Attribute
Display
Yes
Database
Answers
No
TAHs
Query
Modification
24
Cooperative Querying for Medical Applications
Query

Find the treatment used for the tumor similar-to (loc, size)
X1 on 12 year-old Korean males.
Relaxed Query

Find the treatment used for the tumor Class X on preteen
Asians.
Association

The success rate, side effects, and cost of the treatment.
25
Type Abstraction Hierarchies for Medical Domain
Tumor (location, size)
Age
Ethnic Group
Class X
Preteens
9
10
12
Teen
Adult
[loc1 loc3]
[s1 s3]
Class Y
[locY sY]
11
Asian
Korean
African
Chinese Japanese
European
Filipino
X3
X1
X2
[loc1 s1] [loc2 s2] [loc3 s3]
26
Knowledge-Based Image Model
TAH
TAH
TAH
SR(t,b)
Tumor Size
SR(t,l)
Lateral
Ventricle
Knowledge Level
SR(t,l)
SR(t,b)
Brain
TAH
Tumor
Lateral
Ventricle
SR: Spatial Relation
b: Brain
t: Tumor
l: Lateral Ventricle
Schema Level
Representation Level
(features and contents)
27
Queries
Knowledge
Based
Query
Processing
Query Analysis and
Feature Selection
Knowledge-Based
Content Matching
Via TAHs
Query Relaxation
Query Answers
28
User Model
To customize query conditions and knowledge-based
query processing
User type
Default Parameter Values
Feature and Content Matching Policies


Complete Match
Partial Match
29
User Model (cont.)
Relaxation Control Policies

Relaxation Order

Unrelaxable Object

Preference List
Measure for Ranking
30
31
32
Query Preprocessing
Segment and label contours for objects of interest
Determine relevant features and spatial relationships
(e.g., location, containment, intersection) of the selected
objects
Organize the features and spatial relationships of objects
into a feature database
Classify the feature database into a Type Abstraction
Hierarchy (TAH)
33
Similarity Query Answering
Determine relevant features based on query input
Select TAH based on these features
Traverse through the TAH nodes to match all the images
with similar features in the database
Present the images and rank their similarity (e.g., by
mean square error)
34
Spatial Query Answering
Preprocessing




Draw and label contours for objects of interest
Determine relevant features and spatial relationships (e.g.,
location, containment, intersection) of the selected objects
Organize the features and spatial relationships of objects
into a feature database
Classify the feature database into a type abstraction
hierarchy (TAH)
35
Spatial Query Answering (cont.)
Processing



Select TAH based on t he query conditions and context
Search nodes to match the query conditions
Return images linked to the TAH node
36
Similarity Query Answering
Preprocessing



Select objects and specify features of interest in the
image
Create a feature database of the selected objects for all
images
Classify the feature databases as type abstraction
hierarchies
37
Similarity Query Answering (cont.)
Processing




Determine relevant features based on query input
Select TAH based on these features (interact with user to
resolve ambiguity)
Traverse through the TAH nodes to match all the images
with similar features in the databases
Present the images and rank their similarity (e.g., by mean
square error)
38
39
Visual Query Language and Interface
Point-click-drag interface
Objects may be represented iconically
Spatial relationships among objects are
represented graphically
40
Visual Query Example
Retrieve brain tumor cases where
a tumor is located in the region as
indicated in the picture
41
42
43
44
45
46
47
48
49
50
51
Implementation
Sun Sparc 20 workstations (128 MB RAM, 24-bit
frame buffer)
Oracle Database Management System
X/Motif Development Environment, C++
Mass Storage of Images (9 GB)
52
53
54
55
56
Conclusions
Image retrieval by feature and content
Matching and relaxation images based on features
Processing of queries based on spatial relationships
among objects
Answering of imprecise queries
Expression of queries via visual query language
Integrated view of temporal multimedia data in a
timeline metaphor
57
58
59
Semi-Automatic Segmentation of Lung Tumors
interesting
area
classification
seed
estimation
region
growing
tumor
segment
adaptive fusion
60
61
62
63
Medical Digital Library to Support
Scenario Specific Information Retrieval
Wesley W. Chu
wwc@cs.ucla.edu
Computer Science Department
University of California
Los Angeles, California
64
A Project of the NIH Grant at UCLA
A Digital File Room for Patient Care,
Education, and Research
Wesley W. Chu, PhD
Hooshang Kangarloo, MD
Usha Sinha, PhD
David B. Johnson, PhD
Bernard Churchill, MD
John D. N. Dionisio, PhD
Richard Johnson, MD
Osman Ratib, MD, PhD
65
Background
Current file rooms managing patient records have
limited functionality

Main goal of mapping patient ID to patient records
• PACS implementations are an electronic version
of the traditional file room
66
Background
Finding relevant
information for a
particular user is time
consuming and
labor intensive
Lack of structure makes...
Poorly structured and
incomplete results, which
may affect patient
management
Current search tools
limited for general use and
not tailored to specific
users or tasks
67
Digital File Room Requirements
A navigable information space providing:
 Relevant and reputable information
 Access to similar patient records
 Content-based cross referencing
 Dynamically updated data repository
 Tailored access for specific users and devices
68
Hypotheses
A digital file room (digital library) that delivers
relevant and structured answers to specific query
can be developed from existing medical databases
Such a digital file room will increase user
satisfaction and improve patient management
69
Specific Aims
SA1
Develop a system that identifies and provides access
to reputable information sources
SA2
Provide users with greater query capability (e.g.
similar-to, approximate)
SA3
Extract knowledge from patient data, medical
literature and radiology teaching files to support
content-based cross-referencing
SA4
Provide access to dynamically updated collections
based on patient data
SA5
Adapt information retrieval to user and device
characteristics
70
Significance
Extend patient record to provide tailored and
timely access to a broader array of reputable
medical information
71
Approach and Innovations
Intelligent information registration

Provide access to multiple, related data sources through a single
access point
Content-based navigation and matching


Develop similarity matching based on medical concepts & patterns
Content correlation
User and device modeling

Adaptive information retrieval based on user and device models
Scenario-based information web (proxies)

Develop information web linking clustered data sources for a
given set of related tasks (i.e., scenario)
72
Intelligent Information Registration
Registers multiple information sources to provide transparent
access through a single point (proxy object).


Information requests are routed to appropriate data sources based on
query characteristics
Data sources are hierarchically clustered according to a four-layer data
model
proxy-object
(access point)
Patient
meta-data
summarization
Procedure database data:
billing, cpt
Procedures
Ortho
Incontinence
Labs
Neurological Incontinence
Ortho
Laboratory
databases
73
Content-Based Navigation & Matching
Two types of navigation
 Navigation of the information space using
proxies and content correlation
 Pattern/similarity navigation using type
abstraction hierarchies (TAHs)
74
Pattern-Based Type Abstraction Hierarchies
Incontinence
Scalable, hierarchical
knowledge structures
that facilitate
similarity matching
Adequate
holding
Poor
holding
Type IV
poor holding,
poor storage,
poor emptying
Type V
adequate holding,
poor storage,
poor emptying
Type II
adequate holding,
adequate storage,
poor emptying
6 day
M
7 mo
F
12 yr
M
25 yr
F
28 day
M
Type III
poor holding,
adequate storage,
poor emptying
24 mo
F
15 yr
M
20 yr
F
Adaptive Information Retrieval
Tailors query processing and query results according to:


Particular user
Characteristics of their device
Examples:


Doctors prefer JAMA or Lancet while patients prefer Time or CNN.
High resolution workstations support large, detailed imaging studies
while portable devices need lower-bandwidth data.
Allows the system to retrieve appropriate data for a particular
query, user, and device
76
Scenario-Based Proxy
A framework that defines, for a particular domain and set of
tasks, the access methods to and the relationships between
information sources.
Patient
– intelligent information registration
Procedures
UCLA
HFC
Labs
MD Office
Adequate holding
HFC Blood
UCLA Blood
Inadequate
holding
– pattern-based similarity matching
Type II
Type V
Type III
Type IV
– adaptive information retrieval
– information web
77
Scenario-Based Information Web
A directed graph that defines access paths for navigation
among proxy objects
correlated-to
similar-to
Literature
Patient
correlated-to
similar-to
Teaching File
78
Scenario-Based Information Web
Similar-to links relate objects based on their similarity

patients similar by age, sex, and disease
• Correlated-to links relate objects based on related content
– disease can be correlated to relevant literature
correlated-to
similar-to
Patient
Literature
similar-to
correlated-to
Teaching File
Extends the scope of the digital file room into a digital medical
library
79
Research Progress
Phrase Indexing
Phrase generated from a n-word combination in a
sentence.
 Domain Specific Retrieval
 Document Summarization
Content Correlation

Linking of relevant documents via patterns
80
Domain Specific Retrieval
Document are grouped into domain-specific
collections


Medical patient reports
Web sites are often tailored to specific subject areas
Phrases can capture content better than single word,
thus improve retrieval performance
81
Problem With Longer Phrases
Large combinatorial problem
1.00E+12
1.00E+11
1.00E+10
100 word
document
125 word
document
1.00E+09
1.00E+08
1.00E+07
150 word
document
100^n
1.00E+06
1.00E+05
1.00E+04
14-word
sentence
1.00E+03
1.00E+02
1.00E+01
1.00E+00
1
2
3
4
5
6
To process longer phrases it is necessary to partition
documents into smaller segments
82
Phrase Analysis
A phrase is defined as
any 2, 3 or 4 words
co-occurring in a
sentence (word
combination)
Very large number of
possible phrases


Use a stoplist to
remove “useless”
words
Normalize words to a
common stem
The
right upper
lobe
mass
is
seen again.
the
right upper
lobe
mass
is
seen
again
right upper lobe
mass
seen
again
stemming
right
upp
lob
mass
seen
again
sorting
again
lob
mass
right
seen
upp
mass
right
again
lob
sentence
case
normalization
stop word
removal
candidate
2-word
combinations
lob
mass
mass
seen
lob
right
mass
upp
lob
seen
right
seen
lob
upp
right
upp
seen
upp
again mass
again right
again
seen
again
upp
83
Document Retrieval Evaluation
Preliminary evaluation



A domain specific collection of documents
Can phrase analysis limited to sentences improve retrieval
effectiveness?
SMART system (single word terms) used as baseline
Data



Thoracic radiology patient reports
Dictated reports
Describe anatomy and abnormal findings such as enlarged
lymph nodes and cancer masses
84
Domain Specific Document Retrieval
Query: “right upper lobe mass”
85
Automatic Text Summarization
Salton Method
• Given a text file with n paragraphs
• A paragraph can be represented by Di=(di1, di2, …, dim)
– dik is the weight to represent the importance for term Tk(word
or phrase)
P1
P2
• The pair-wise similarity of two paragraphs
Sim(Di,Dj) =  dik * djk , k = 1..m
Pn
P3
Text relationship map:
P5
• Nodes = paragraph
• Links = pair-wise similarity of the connected nodes
• Links are created if Sim(Di, Dj) > threshold
P4
Bushiness of a node = # of links of a node
Text Summarization derived from the Bushy nodes.
86
Performance Comparison of Sultan’s Summarization
Method Based on Phrase and Single Word
Aspirin.txt
Threshold
Paragraphs
Ranking
Based on
Bushiness
No.1
No.2
No.3
No.4
No.5
No.6
No.7
No.8
No.9
0.1
4
6
8
1
5
2
3
9
7
words
0.2
6
8
3
4
5
1
2
9
7
2W phrases
3W phrases
0.3 0.1 0.2 0.3 0.1 0.2 0.3
8
2
2
2
2
2
2
2
3
3
3
3
3
3
3
6
6
6
8
8
8
4
1
4
4
4
4
4
5
8
5
5
6
6
6
6
4
1
1
5
5
5
1
5
8
8
7
7
7
9
7
7
7
1
1
1
7
9
9
9
9
9
9
Summarization based on Phrases are less sensitive to
Threshold setting than Single Words.
87
N-words Distribution
500
Aspirin1
450
Aspirin2
Number
400
Elian04
350
LAPD06
300
CNN-Bush
250
CNN-Florida
200
150
100
50
0
1
2
3
4
5
6
7
8
9
N-Word
88
Number Distinct Freq Words
Number of Distinct Frequent Words
180
Aspirin1
Aspirin2
150
Elian04
LAPD06
120
CNN-Bush
90
CNN-Florida
60
30
0
1
2
3
4
5
6
7
8
9
N-Word
89
Number of Valid Sentences
Number of Valid Sentences
90
Aspirin1
80
Aspirin2
70
Elian04
60
LAPD06
CNN-Bush
50
CNN-Florida
40
30
20
10
0
1
2
3
4
5
6
7
8
9
N-Word
90
Performance Comparison
Saton
0.1
Apirin01
13 sent
Apirin02
68 sent
Elian04
92 sent
David
0.2
df
0.3
1
2,12,9,3,7
3,9,1,4,7
1,2,3,7,9
9,2,3,12,7
0
2
9,2,3,4,7
2,3,4,9,7
2,1,3,7,9
9,12,2,3,4
1
3
2,9,3,12,7
2,9,12,7,1
12,2,7,9,3
12,9,2,1,7
0
4
2,9,12,4,7
9,2,12,4,7
9,12,3,4,7
12,9,4,2,7
0
5
12,4,9
12,4,9
12,4,9
12,4,9
0
1
14,12,22,66,20
12,22,36,66,1
1,12,14,15,20
14,12,66,22,20
0
2
12,14,66,15,20
36,15,20,22,66
66,1,12,14,20
14,12,66,22,15
0
3
12,14,66,22,21
14,22,66,12,36
66,12,14,21,22
14,12,66,22,18
0
4
14,66,12,21,22
14,66,22,12,21
12,14,22,66,68
14,22,12,66,68
0
5
14,12,15,18,22
14,12,15,18,22
14,12,18,22,36
14,12,18,15,22
0
1
26,76,33,59,2
26,33,76,2,59
26,76,2,44,33
26,76,2,33,7
1
2
26,76,7,33,29
26,76,82,7,29
6,76,26,27,29
26,76,7,2,59
1
3
6,26,7,76,2
6,2,27,26,44
6,27,26,2,29
26,7,76,2,59
1
4
26,2,76,6,7
26,6,7,24,28
7,24,26,84,85
26,7,84,2,85
0
5
26,7,76,84,6
26,7,76,84,85
7,24,26,28,85
7,26,85,84,2
1
91
Comparison (cont’d)
Saton
0.1
LAPD06
27 sent
CNNbush
14 sent
Florida
49 sent
0.2
David
df
0.3
1
6,7,20,25,5
6,19,20,25,7
6,7,14,19,25
6,7,20,25,19
0
2
18,20,6,19,25
6,18,19,24,20
19,18,24,9,20
5,7,1,20,6
3
3
19,6,7,18,25
7,25,5,6,1
1,5,7,14,19
1,5,7,20,12
2
4
7,5,6,1,8
7,5,1,6,8
1,5,6,7,9
1,5,7,12,17
2
5
1,5,7,12,17
1,5,7,12,17
1,5,7,12,17
1,5,7,12,17
0
1
12,5,6,8,11
12,5,6,11,8
5,6,11,12,1
5,12,8,11,6
0
2
12,5,8,3,11
5,8,12,3,7
5,12,8,7,3
5,12,8,11,9
1
3
5,12,3,8,10
5,8,3,9,10
5,8,10,12,6
5,12,11,9,8
1
4
5,12,8,7,9
5,12,6,8,9
5,12,6,8,9
5,11,12,9,8
1
5
5,8,12,6,7
5,12,6,8,9
5,12,6,8,9
5,11,12,9,8
1
29,11,2,41,40
29,41,11,26,2
29,41,26,11,14
29,11,17,40,48
2
2
11,29,40,17,48
17,29,40,22,11
17,20,35,40,22
17,11,20,40,22
0
3
17,29,20,40,48
17,6,22,29,11
28,22,26,4,6
17,20,11,22,40
0
4
17,11,22,6,2
2,11,20,6,17
2,11,20,6,17
17,20,11,22,2
0
5
2,11,20,17,22
2,11,20,17,22
2,11,17,20,22
17,11,22,2,20
0
92
Content Correlation
Given a document in one collection, content
correlation links relevant documents in another
document collection
Time
CNN
Patient
Records
New England
Journal of Medicine
93
Document Cluster By Pattern
A pattern is a set of unique terms that characterize
some features in the data set
Patterns can be found in a collection of documents
by data mining
Documents are grouped into clusters based on
patterns via clustering technique
94
Cluster Signature
Every cluster can be classified according to the
occurrence frequency of the patterns
Looking to answer:


The set of patterns summarize a given cluster?
How the patterns related among the clusters ?
Literature
Patient Records
95
Deriving Cluster Signature
Metrics


Local Cluster Certainty (LCC) measures the coverage of a
pattern in a given cluster (Popularity)
The Global Cluster Certainty (GCC) measures the coverage
of a pattern among clusters (Exclusiveness)
The Cluster Signature is the set of those patterns that have
both high LCC and GCC
Documents from one collection (source) can be linked to
relevant clusters in another collection (target)
Literature
Patient Records
96
Preliminary Results
A collection of 69 pediatric urology literature abstracts taken from
Medline were clustered using the complete link clustering algorithm

3 large clusters, each with 2 or more sub-clusters
GCC and LCC were calculated for patterns found in several subclusters
Data from one sub-cluster is reported here
Document #
Title
1
Complications in pediatric urological laparoscopy: results of a survey
2
Laparoscopic surgery in pediatric urology
3
[Laparoscopic interventions in pediatric urology]
4
Role of laparoscopic surgery in pediatric urology
5
[Laparoscopic interventions in urology]
6
Laparoscopic heminephroureterectomy in pediatric patients
97
GCC
LCC
Term/Phrase
Term/Phrase
Cg
Cl
Laparoscop
0.1887
Pediatr
1.0
Compl
0.0817
Result
1.0
Child Laparoscop
1.0
Patient
1.0
Laparoscop patient
1.0
Perform
1.0
Compl Laparoscop
1.0
Compl
1.0
Comple techn
1.0
Laparoscop
1.0
<MEAS> compl
1.0
Urolog
0.34
Laparoscop perform
0.6088
Laparoscop pediatr
1.0
Compl rate
0.4564
Laparoscop perform
1.0
Laparoscop patient perform
1.0
Diagnost laparoscop
0.35
Laparoscop perform procedur
1.0
Laparoscop operat
0.35
<MEAS> compl rate
1.0
Compl rate
0.35
Laparoscop pediatr perform
1.0
Laparoscop patient
0.35
Compl laparoscop techn
1.0
Laparoscop operat perform
0.0817
Laparoscop patient perform
0.0817
98
Project Summary
A system that provides:





relevant and reputable information,
access to similar patient records,
content-based cross referencing,
a dynamically updated data repository,
and
tailored access for specific users and
devices
will:
 augment
the patient
record to provide tailored
and timely access to a
broader array of
reputable information
and
 extend the digital file
room into a digital
medical library.
99
Research Results
Phrase Indexing


Developed an efficient algorithm for extracting n-word
features from textual documents
Phrase index provide better results than single word
index in document retrieval and summarization
Content Correlation via Cluster Signature (LCC &
GCC)

Preliminary results reveal the feasibility using cluster
signature for linking relevant documents
Work begun on proxy for information navigation
100
Future Work
Develop Ontology for Intelligent Information
Registration
User Model for Information Retrieval
101
Download