KMeD: A Knowledge-Based Multimedia Medical Database System Wesley W. Chu Computer Science Department University of California, Los Angeles http://www.cobase.cs.ucla.edu 1 KMeD A Knowledge-Based Multimedia Medical Distributed Database System October 1, 1991 to September 30, 1993 A Cooperative, Spatial, Evolutionary Medical Database System July 1, 1993 to June 30, 1997 Knowledge-Based Image Retrieval with Spatial and Temporal Constructs May 1, 1997 to April 30, 2001 Wesley W. Chu Alfonso F. Cardenas Ricky K. Taira Computer Science Department Computer Science Department Department of Radiological Sciences 2 Research Team Students John David N. Dionisio Chih-Cheng Hsu David Johnson Christine Chih Collaborators Computer Science Department Alfonso F. Cardenas UCLA Medical School Denise Aberle, MD Robert Lufkin, MD Ricky K. Taira, MD 3 Significance Query multimedia data based on image content and spatial predicates Use domain knowledge to relax and interpret medical queries Present integrated view of multiple temporal and evolutionary data in a timeline metaphor 4 Overview Image retrieval by feature and content Query relaxation Spatial query answering Similarity query answering Visual query interface Timeline interface Sample cases 5 Image Retrieval by Content Features size, shape, texture, density, histology Spatial Relations angle of coverage, shortest distance, overlapping ratio, contact ratio, relative direction Evolution of Object Growth fusion, fission 6 7 8 9 10 Characteristics of Medical Queries Multimedia Temporal Evolutionary Spatial Imprecise 11 01 O O’ 01 O Om Evolution: Object O evolves into a new object O’ Fusion: Object 01, …, Om fuse into a new object O On Fission: Object O splits into object 01, …, On 12 Case a: The object exists with its supertype or aggregated type. Case c: The life span of the object starts with and ends before its supertype or aggregated type. Case b: The life span of the object starts after and ends with its supertype or aggregated type. Case d: The life span of the object starts after and ends before its supertype or aggregated type. 13 Lesion MicroLesion MicroLesion 14 15 16 17 18 Query Modification Techniques Relaxation Generalization Specialization Association 19 Generalization and Specialization More Conceptual Query Generalization Conceptual Query Generalization Specific Query Specialization Conceptual Query Specialization Specific Query 20 Type Abstraction Hierarchy Presents abstract view of Types Attribute values Image features Temporal and evolutionary behavior Spatial relationships among objects Provides multi-level knowledge representation 21 TAH Generation for Numerical Attribute Values Relaxation Error Difference between the exact value and the returned approximate value The expected error is weighted by the probability of occurrence of each value DISC (Distribution Sensitive Clustering) is based on the attribute values and frequency distribution of the data 22 TAH Generation for Numerical Attribute Values (cont.) 2 Computation Complexity: O(n ), where n is the number of distinct value in a cluster DISC performs better than Biggest Cap (value only) or Max Entropy (frequency only) methods MDISC is developed for multiple attribute TAHs. Computation Complexity: O(mn2), where m is the number of attributes 23 Query Relaxation Query Relax Attribute Display Yes Database Answers No TAHs Query Modification 24 Cooperative Querying for Medical Applications Query Find the treatment used for the tumor similar-to (loc, size) X1 on 12 year-old Korean males. Relaxed Query Find the treatment used for the tumor Class X on preteen Asians. Association The success rate, side effects, and cost of the treatment. 25 Type Abstraction Hierarchies for Medical Domain Tumor (location, size) Age Ethnic Group Class X Preteens 9 10 12 Teen Adult [loc1 loc3] [s1 s3] Class Y [locY sY] 11 Asian Korean African Chinese Japanese European Filipino X3 X1 X2 [loc1 s1] [loc2 s2] [loc3 s3] 26 Knowledge-Based Image Model TAH TAH TAH SR(t,b) Tumor Size SR(t,l) Lateral Ventricle Knowledge Level SR(t,l) SR(t,b) Brain TAH Tumor Lateral Ventricle SR: Spatial Relation b: Brain t: Tumor l: Lateral Ventricle Schema Level Representation Level (features and contents) 27 Queries Knowledge Based Query Processing Query Analysis and Feature Selection Knowledge-Based Content Matching Via TAHs Query Relaxation Query Answers 28 User Model To customize query conditions and knowledge-based query processing User type Default Parameter Values Feature and Content Matching Policies Complete Match Partial Match 29 User Model (cont.) Relaxation Control Policies Relaxation Order Unrelaxable Object Preference List Measure for Ranking 30 31 32 Query Preprocessing Segment and label contours for objects of interest Determine relevant features and spatial relationships (e.g., location, containment, intersection) of the selected objects Organize the features and spatial relationships of objects into a feature database Classify the feature database into a Type Abstraction Hierarchy (TAH) 33 Similarity Query Answering Determine relevant features based on query input Select TAH based on these features Traverse through the TAH nodes to match all the images with similar features in the database Present the images and rank their similarity (e.g., by mean square error) 34 Spatial Query Answering Preprocessing Draw and label contours for objects of interest Determine relevant features and spatial relationships (e.g., location, containment, intersection) of the selected objects Organize the features and spatial relationships of objects into a feature database Classify the feature database into a type abstraction hierarchy (TAH) 35 Spatial Query Answering (cont.) Processing Select TAH based on t he query conditions and context Search nodes to match the query conditions Return images linked to the TAH node 36 Similarity Query Answering Preprocessing Select objects and specify features of interest in the image Create a feature database of the selected objects for all images Classify the feature databases as type abstraction hierarchies 37 Similarity Query Answering (cont.) Processing Determine relevant features based on query input Select TAH based on these features (interact with user to resolve ambiguity) Traverse through the TAH nodes to match all the images with similar features in the databases Present the images and rank their similarity (e.g., by mean square error) 38 39 Visual Query Language and Interface Point-click-drag interface Objects may be represented iconically Spatial relationships among objects are represented graphically 40 Visual Query Example Retrieve brain tumor cases where a tumor is located in the region as indicated in the picture 41 42 43 44 45 46 47 48 49 50 51 Implementation Sun Sparc 20 workstations (128 MB RAM, 24-bit frame buffer) Oracle Database Management System X/Motif Development Environment, C++ Mass Storage of Images (9 GB) 52 53 54 55 56 Conclusions Image retrieval by feature and content Matching and relaxation images based on features Processing of queries based on spatial relationships among objects Answering of imprecise queries Expression of queries via visual query language Integrated view of temporal multimedia data in a timeline metaphor 57 58 59 Semi-Automatic Segmentation of Lung Tumors interesting area classification seed estimation region growing tumor segment adaptive fusion 60 61 62 63 Medical Digital Library to Support Scenario Specific Information Retrieval Wesley W. Chu wwc@cs.ucla.edu Computer Science Department University of California Los Angeles, California 64 A Project of the NIH Grant at UCLA A Digital File Room for Patient Care, Education, and Research Wesley W. Chu, PhD Hooshang Kangarloo, MD Usha Sinha, PhD David B. Johnson, PhD Bernard Churchill, MD John D. N. Dionisio, PhD Richard Johnson, MD Osman Ratib, MD, PhD 65 Background Current file rooms managing patient records have limited functionality Main goal of mapping patient ID to patient records • PACS implementations are an electronic version of the traditional file room 66 Background Finding relevant information for a particular user is time consuming and labor intensive Lack of structure makes... Poorly structured and incomplete results, which may affect patient management Current search tools limited for general use and not tailored to specific users or tasks 67 Digital File Room Requirements A navigable information space providing: Relevant and reputable information Access to similar patient records Content-based cross referencing Dynamically updated data repository Tailored access for specific users and devices 68 Hypotheses A digital file room (digital library) that delivers relevant and structured answers to specific query can be developed from existing medical databases Such a digital file room will increase user satisfaction and improve patient management 69 Specific Aims SA1 Develop a system that identifies and provides access to reputable information sources SA2 Provide users with greater query capability (e.g. similar-to, approximate) SA3 Extract knowledge from patient data, medical literature and radiology teaching files to support content-based cross-referencing SA4 Provide access to dynamically updated collections based on patient data SA5 Adapt information retrieval to user and device characteristics 70 Significance Extend patient record to provide tailored and timely access to a broader array of reputable medical information 71 Approach and Innovations Intelligent information registration Provide access to multiple, related data sources through a single access point Content-based navigation and matching Develop similarity matching based on medical concepts & patterns Content correlation User and device modeling Adaptive information retrieval based on user and device models Scenario-based information web (proxies) Develop information web linking clustered data sources for a given set of related tasks (i.e., scenario) 72 Intelligent Information Registration Registers multiple information sources to provide transparent access through a single point (proxy object). Information requests are routed to appropriate data sources based on query characteristics Data sources are hierarchically clustered according to a four-layer data model proxy-object (access point) Patient meta-data summarization Procedure database data: billing, cpt Procedures Ortho Incontinence Labs Neurological Incontinence Ortho Laboratory databases 73 Content-Based Navigation & Matching Two types of navigation Navigation of the information space using proxies and content correlation Pattern/similarity navigation using type abstraction hierarchies (TAHs) 74 Pattern-Based Type Abstraction Hierarchies Incontinence Scalable, hierarchical knowledge structures that facilitate similarity matching Adequate holding Poor holding Type IV poor holding, poor storage, poor emptying Type V adequate holding, poor storage, poor emptying Type II adequate holding, adequate storage, poor emptying 6 day M 7 mo F 12 yr M 25 yr F 28 day M Type III poor holding, adequate storage, poor emptying 24 mo F 15 yr M 20 yr F Adaptive Information Retrieval Tailors query processing and query results according to: Particular user Characteristics of their device Examples: Doctors prefer JAMA or Lancet while patients prefer Time or CNN. High resolution workstations support large, detailed imaging studies while portable devices need lower-bandwidth data. Allows the system to retrieve appropriate data for a particular query, user, and device 76 Scenario-Based Proxy A framework that defines, for a particular domain and set of tasks, the access methods to and the relationships between information sources. Patient – intelligent information registration Procedures UCLA HFC Labs MD Office Adequate holding HFC Blood UCLA Blood Inadequate holding – pattern-based similarity matching Type II Type V Type III Type IV – adaptive information retrieval – information web 77 Scenario-Based Information Web A directed graph that defines access paths for navigation among proxy objects correlated-to similar-to Literature Patient correlated-to similar-to Teaching File 78 Scenario-Based Information Web Similar-to links relate objects based on their similarity patients similar by age, sex, and disease • Correlated-to links relate objects based on related content – disease can be correlated to relevant literature correlated-to similar-to Patient Literature similar-to correlated-to Teaching File Extends the scope of the digital file room into a digital medical library 79 Research Progress Phrase Indexing Phrase generated from a n-word combination in a sentence. Domain Specific Retrieval Document Summarization Content Correlation Linking of relevant documents via patterns 80 Domain Specific Retrieval Document are grouped into domain-specific collections Medical patient reports Web sites are often tailored to specific subject areas Phrases can capture content better than single word, thus improve retrieval performance 81 Problem With Longer Phrases Large combinatorial problem 1.00E+12 1.00E+11 1.00E+10 100 word document 125 word document 1.00E+09 1.00E+08 1.00E+07 150 word document 100^n 1.00E+06 1.00E+05 1.00E+04 14-word sentence 1.00E+03 1.00E+02 1.00E+01 1.00E+00 1 2 3 4 5 6 To process longer phrases it is necessary to partition documents into smaller segments 82 Phrase Analysis A phrase is defined as any 2, 3 or 4 words co-occurring in a sentence (word combination) Very large number of possible phrases Use a stoplist to remove “useless” words Normalize words to a common stem The right upper lobe mass is seen again. the right upper lobe mass is seen again right upper lobe mass seen again stemming right upp lob mass seen again sorting again lob mass right seen upp mass right again lob sentence case normalization stop word removal candidate 2-word combinations lob mass mass seen lob right mass upp lob seen right seen lob upp right upp seen upp again mass again right again seen again upp 83 Document Retrieval Evaluation Preliminary evaluation A domain specific collection of documents Can phrase analysis limited to sentences improve retrieval effectiveness? SMART system (single word terms) used as baseline Data Thoracic radiology patient reports Dictated reports Describe anatomy and abnormal findings such as enlarged lymph nodes and cancer masses 84 Domain Specific Document Retrieval Query: “right upper lobe mass” 85 Automatic Text Summarization Salton Method • Given a text file with n paragraphs • A paragraph can be represented by Di=(di1, di2, …, dim) – dik is the weight to represent the importance for term Tk(word or phrase) P1 P2 • The pair-wise similarity of two paragraphs Sim(Di,Dj) = dik * djk , k = 1..m Pn P3 Text relationship map: P5 • Nodes = paragraph • Links = pair-wise similarity of the connected nodes • Links are created if Sim(Di, Dj) > threshold P4 Bushiness of a node = # of links of a node Text Summarization derived from the Bushy nodes. 86 Performance Comparison of Sultan’s Summarization Method Based on Phrase and Single Word Aspirin.txt Threshold Paragraphs Ranking Based on Bushiness No.1 No.2 No.3 No.4 No.5 No.6 No.7 No.8 No.9 0.1 4 6 8 1 5 2 3 9 7 words 0.2 6 8 3 4 5 1 2 9 7 2W phrases 3W phrases 0.3 0.1 0.2 0.3 0.1 0.2 0.3 8 2 2 2 2 2 2 2 3 3 3 3 3 3 3 6 6 6 8 8 8 4 1 4 4 4 4 4 5 8 5 5 6 6 6 6 4 1 1 5 5 5 1 5 8 8 7 7 7 9 7 7 7 1 1 1 7 9 9 9 9 9 9 Summarization based on Phrases are less sensitive to Threshold setting than Single Words. 87 N-words Distribution 500 Aspirin1 450 Aspirin2 Number 400 Elian04 350 LAPD06 300 CNN-Bush 250 CNN-Florida 200 150 100 50 0 1 2 3 4 5 6 7 8 9 N-Word 88 Number Distinct Freq Words Number of Distinct Frequent Words 180 Aspirin1 Aspirin2 150 Elian04 LAPD06 120 CNN-Bush 90 CNN-Florida 60 30 0 1 2 3 4 5 6 7 8 9 N-Word 89 Number of Valid Sentences Number of Valid Sentences 90 Aspirin1 80 Aspirin2 70 Elian04 60 LAPD06 CNN-Bush 50 CNN-Florida 40 30 20 10 0 1 2 3 4 5 6 7 8 9 N-Word 90 Performance Comparison Saton 0.1 Apirin01 13 sent Apirin02 68 sent Elian04 92 sent David 0.2 df 0.3 1 2,12,9,3,7 3,9,1,4,7 1,2,3,7,9 9,2,3,12,7 0 2 9,2,3,4,7 2,3,4,9,7 2,1,3,7,9 9,12,2,3,4 1 3 2,9,3,12,7 2,9,12,7,1 12,2,7,9,3 12,9,2,1,7 0 4 2,9,12,4,7 9,2,12,4,7 9,12,3,4,7 12,9,4,2,7 0 5 12,4,9 12,4,9 12,4,9 12,4,9 0 1 14,12,22,66,20 12,22,36,66,1 1,12,14,15,20 14,12,66,22,20 0 2 12,14,66,15,20 36,15,20,22,66 66,1,12,14,20 14,12,66,22,15 0 3 12,14,66,22,21 14,22,66,12,36 66,12,14,21,22 14,12,66,22,18 0 4 14,66,12,21,22 14,66,22,12,21 12,14,22,66,68 14,22,12,66,68 0 5 14,12,15,18,22 14,12,15,18,22 14,12,18,22,36 14,12,18,15,22 0 1 26,76,33,59,2 26,33,76,2,59 26,76,2,44,33 26,76,2,33,7 1 2 26,76,7,33,29 26,76,82,7,29 6,76,26,27,29 26,76,7,2,59 1 3 6,26,7,76,2 6,2,27,26,44 6,27,26,2,29 26,7,76,2,59 1 4 26,2,76,6,7 26,6,7,24,28 7,24,26,84,85 26,7,84,2,85 0 5 26,7,76,84,6 26,7,76,84,85 7,24,26,28,85 7,26,85,84,2 1 91 Comparison (cont’d) Saton 0.1 LAPD06 27 sent CNNbush 14 sent Florida 49 sent 0.2 David df 0.3 1 6,7,20,25,5 6,19,20,25,7 6,7,14,19,25 6,7,20,25,19 0 2 18,20,6,19,25 6,18,19,24,20 19,18,24,9,20 5,7,1,20,6 3 3 19,6,7,18,25 7,25,5,6,1 1,5,7,14,19 1,5,7,20,12 2 4 7,5,6,1,8 7,5,1,6,8 1,5,6,7,9 1,5,7,12,17 2 5 1,5,7,12,17 1,5,7,12,17 1,5,7,12,17 1,5,7,12,17 0 1 12,5,6,8,11 12,5,6,11,8 5,6,11,12,1 5,12,8,11,6 0 2 12,5,8,3,11 5,8,12,3,7 5,12,8,7,3 5,12,8,11,9 1 3 5,12,3,8,10 5,8,3,9,10 5,8,10,12,6 5,12,11,9,8 1 4 5,12,8,7,9 5,12,6,8,9 5,12,6,8,9 5,11,12,9,8 1 5 5,8,12,6,7 5,12,6,8,9 5,12,6,8,9 5,11,12,9,8 1 29,11,2,41,40 29,41,11,26,2 29,41,26,11,14 29,11,17,40,48 2 2 11,29,40,17,48 17,29,40,22,11 17,20,35,40,22 17,11,20,40,22 0 3 17,29,20,40,48 17,6,22,29,11 28,22,26,4,6 17,20,11,22,40 0 4 17,11,22,6,2 2,11,20,6,17 2,11,20,6,17 17,20,11,22,2 0 5 2,11,20,17,22 2,11,20,17,22 2,11,17,20,22 17,11,22,2,20 0 92 Content Correlation Given a document in one collection, content correlation links relevant documents in another document collection Time CNN Patient Records New England Journal of Medicine 93 Document Cluster By Pattern A pattern is a set of unique terms that characterize some features in the data set Patterns can be found in a collection of documents by data mining Documents are grouped into clusters based on patterns via clustering technique 94 Cluster Signature Every cluster can be classified according to the occurrence frequency of the patterns Looking to answer: The set of patterns summarize a given cluster? How the patterns related among the clusters ? Literature Patient Records 95 Deriving Cluster Signature Metrics Local Cluster Certainty (LCC) measures the coverage of a pattern in a given cluster (Popularity) The Global Cluster Certainty (GCC) measures the coverage of a pattern among clusters (Exclusiveness) The Cluster Signature is the set of those patterns that have both high LCC and GCC Documents from one collection (source) can be linked to relevant clusters in another collection (target) Literature Patient Records 96 Preliminary Results A collection of 69 pediatric urology literature abstracts taken from Medline were clustered using the complete link clustering algorithm 3 large clusters, each with 2 or more sub-clusters GCC and LCC were calculated for patterns found in several subclusters Data from one sub-cluster is reported here Document # Title 1 Complications in pediatric urological laparoscopy: results of a survey 2 Laparoscopic surgery in pediatric urology 3 [Laparoscopic interventions in pediatric urology] 4 Role of laparoscopic surgery in pediatric urology 5 [Laparoscopic interventions in urology] 6 Laparoscopic heminephroureterectomy in pediatric patients 97 GCC LCC Term/Phrase Term/Phrase Cg Cl Laparoscop 0.1887 Pediatr 1.0 Compl 0.0817 Result 1.0 Child Laparoscop 1.0 Patient 1.0 Laparoscop patient 1.0 Perform 1.0 Compl Laparoscop 1.0 Compl 1.0 Comple techn 1.0 Laparoscop 1.0 <MEAS> compl 1.0 Urolog 0.34 Laparoscop perform 0.6088 Laparoscop pediatr 1.0 Compl rate 0.4564 Laparoscop perform 1.0 Laparoscop patient perform 1.0 Diagnost laparoscop 0.35 Laparoscop perform procedur 1.0 Laparoscop operat 0.35 <MEAS> compl rate 1.0 Compl rate 0.35 Laparoscop pediatr perform 1.0 Laparoscop patient 0.35 Compl laparoscop techn 1.0 Laparoscop operat perform 0.0817 Laparoscop patient perform 0.0817 98 Project Summary A system that provides: relevant and reputable information, access to similar patient records, content-based cross referencing, a dynamically updated data repository, and tailored access for specific users and devices will: augment the patient record to provide tailored and timely access to a broader array of reputable information and extend the digital file room into a digital medical library. 99 Research Results Phrase Indexing Developed an efficient algorithm for extracting n-word features from textual documents Phrase index provide better results than single word index in document retrieval and summarization Content Correlation via Cluster Signature (LCC & GCC) Preliminary results reveal the feasibility using cluster signature for linking relevant documents Work begun on proxy for information navigation 100 Future Work Develop Ontology for Intelligent Information Registration User Model for Information Retrieval 101