Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Kincho H. Law, Siddharth Taduri, Gloria T. Lau Engineering Informatics Lab at Stanford University Motivation Regional variability PMID: 12897095 in the incidence of end-stage of end-stage renal disease: an epidemiological approach. …. Regional variability in renal disease (ESRD) the incidence in Austria is reported. Our aim was …. low rates in the state of Tyrol. …. ESRD incidence data were obtained from …. …. Between 1995 and 1999, 4811 new cases of ESRD were recorded; Synonyms for ESRD Tyrol (T) …. incidence of ESRD patients with type 2 diabetes mellitus …. the difference in the overall ESRD incidence …. prevalence of DM, a highly significant correlation was found between ESRD incidence and DM. the state of …. variability in the End Stage Kidney Disease … Renal Disease, End Stage …. Renal Failure, End Stage …. Kidney Disease, Chronic Renal Failure, Chronic End-Stage Kidney Disease ESRD Renal Disease, End-Stage Renal Failure, End-Stage Chronic Kidney Failure Chronic Renal Failure ESRD incidence in Austria is explained mainly by regional differences in DM-2. Data from similar studies …. allocation for ESRD …. …. 05/01/2012 Engineering Informatics Lab at Stanford University 2 Data Set and Knowledge TREC 2007 Genomics Data Set • Over 162,000 full-text scientific publications from 49 prominent journals in biomedicine • Metadata available through MEDLINE • Tasks involve passage, document, and feature retrieval • Methodologies are evaluated on their response to 36 topics (‘queries’) • The topics are categorized based on 13 entity types (Proteins, Genes, etc.) Domain Knowledge • Over 250 biomedical ontologies from BioPortal 05/01/2012 Engineering Informatics Lab at Stanford University 3 XML Representation of Scientific Publications in PubMed <PubmedArticle> <MedlineCitation Owner="NLM" Status="MEDLINE"> <PMID>10022466</PMID> <DateCreated> <Year>1999</Year> <Month>02</Month> <Day>25</Day> </DateCreated> …. <Article PubModel="Print"> <Journal> …. <JournalIssue CitedMedium="Print"> <Volume>84</Volume> <Issue>2</Issue> …. </JournalIssue> <Title>The Journal of clinical endocrinology and metabolism</Title> <ISOAbbreviation>J. Clin. Endocrinol. Metab.</ISOAbbreviation> </Journal> <ArticleTitle>About the use … of an ACTH 1-39 ….</ArticleTitle> …. 05/01/2012 Engineering Informatics Lab at Stanford University 4 Domain Knowledge Integration (1) Annotating Documents prior to indexing – Response time is fast – Not flexible, the entire index has to be updated if a new ontology needs to be added – Indexes can grow very large (2) Query Expansion – Response time is slower – Very flexible, ontologies can be dynamically chosen 05/01/2012 Engineering Informatics Lab at Stanford University 5 Query Expansion MeSH Tumor Cancer, Neoplasm, … Synonyms Melanoma Adenocarcinoma Leukemia Synonyms Leucocythaemias Leucocythemia Nerve Sheath Neo • The pre-processed query is expanded using BioPortal’s API automatically [Tumor][MeSH] => {Tumor, Neoplasm, Carcinoma, Leukemia …} 05/01/2012 Engineering Informatics Lab at Stanford University 6 Choosing Domain Knowledge • The use of synonymy results in inconsistent performance (2007 TREC genomics track) • Common reasons include: – Relevant terms may not be classified as expected – Some relevant terms may not be classified in a particular ontology – Incomplete information (such as synonyms) • Selection of the appropriate domain ontology is important 05/01/2012 Engineering Informatics Lab at Stanford University 7 Enriching Existing Ontologies • Existing ontologies can be enriched to complete some missing information Ontology NDF Concept Pamidronate Synonyms from NDF APD, Amidronate, ... Synonyms from MeSH pamidronate calcium, pamidronate monosodium, aredia Synonyms from NCI Pamidronic acid, pamidronate disodium, … MeSH NCI • Multiple ontologies can be used to provide different classifications 05/01/2012 Engineering Informatics Lab at Stanford University 8 Evaluations • • • • Baseline With Query Expansion (Suggested Sources) Using Enriched Ontologies Multiple Query Expansions per query Summary of Document MAP scores in 2007 TREC genomics track 05/01/2012 Max 0.3286 Min 0.0329 Mean 0.1862 Median 0.1897 Engineering Informatics Lab at Stanford University 9 Queries Topic Number Query Suggested Sources for Terms (TREC) Selected Domain Knowledge (Our Methodology) 205 What [SIGNS OR SYMPTOMS] of anxiety disorder are related to coronary artery disease? Wikipedia Symptom Ontology 206 What [TOXICITIES] are associated with zoledronic acid? Wikipedia + Aaron NCI Thesaurus 207 What [TOXICITIES] are associated with etidronate? Wikipedia + Aaron NCI Thesaurus 211 What [ANTIBODIES] have been used to detect protein PSD-95? MeSH MeSH 229 What [SIGNS OR SYMPTOMS] are caused by human parvovirus infection? Wikipedia Symptom Ontology 231 What [TUMOR TYPES] are found in zebrafish? Aaron MeSH 05/01/2012 Engineering Informatics Lab at Stanford University 10 Baseline • Queries are used without modification, e.g., – “What [ANTIBODIES] have been used to detect protein PSD-95?” – “What [SIGNS OR SYMPTOMS] of anxiety disorder are related to coronary artery disease?” • Document MAP: 0.277 05/01/2012 Engineering Informatics Lab at Stanford University 11 Query Expansion • Original Query: What [TUMOR TYPES] are found in zebrafish? • Queries are formulated in ‘AND’ clauses: “[Tumor][MeSH] AND zebrafish” => (Tumor, Neoplasm, Carcinoma, Leukemia …) AND zebrafish • Document MAP: 0.347 05/01/2012 Engineering Informatics Lab at Stanford University 12 Multiple Query Expansion Terms • Expansion can be performed on multiple terms in the query • Example: Coronary Artery Disease => {Coronary heart disease, coronary disease, CAD, …} [Tumor][MeSH] AND zebrafish[MeSH} => (tumor, neoplasm, …) AND (zebrafish, danio rerio, …) • Document MAP: 0.352 05/01/2012 Engineering Informatics Lab at Stanford University 13 Enriched Ontology – Current Status • Marginal improvement over basic enhanced models • Document MAP: 0.352 (Marginal improvement from 0.347) • Issues: – Framework for enrichment based on synonymy is rigid, i.e., relevant terms that are entirely missing in the ontology are still not included – Relevant terms that are classified differently are never included in the search 05/01/2012 Engineering Informatics Lab at Stanford University 14 IR Tool • Expert knowledge is valuable • Developed a search tool which automatically integrates with knowledge sources and searches documents • We extend MINOE, a co-occurrence based visualization tool, originally designed for exploring marine ecosystems • User can browse (or search) documents through ontologies and visualize interactions between concepts 05/01/2012 Engineering Informatics Lab at Stanford University 15 Snapshots of the Tool 05/01/2012 Engineering Informatics Lab at Stanford University 16 I. Enter Query Terms II. Domain Knowledge Integration III. Shows Expanded Query, and other filters that are added to the search 05/01/2012 Engineering Informatics Lab at Stanford University 17 TREC Topic 220 • Query: What [PROTEINS] are involved in the activation or recognition mechanism for PmrD? • Domain Knowledge: MeSH Depth of Hierarchical Expansion to Child Nodes Level 1 Level 2 Level 3 Document MAP 0.2 0.8 05/01/2012 0.0 Engineering Informatics Lab at Stanford University 18 05/01/2012 Engineering Informatics Lab at Stanford University 19 05/01/2012 Engineering Informatics Lab at Stanford University 20 05/01/2012 Engineering Informatics Lab at Stanford University 21 05/01/2012 Engineering Informatics Lab at Stanford University 22 05/01/2012 Engineering Informatics Lab at Stanford University 23 05/01/2012 Engineering Informatics Lab at Stanford University 24 Changed 05/01/2012 Engineering Informatics Lab at Stanford University 25 05/01/2012 Engineering Informatics Lab at Stanford University 26 MeSH Descriptors 05/01/2012 Engineering Informatics Lab at Stanford University 27 05/01/2012 Engineering Informatics Lab at Stanford University 28 05/01/2012 Engineering Informatics Lab at Stanford University 29 05/01/2012 Engineering Informatics Lab at Stanford University 30 (>1500 Documents) (>1500 Documents) 05/01/2012 Engineering Informatics Lab at Stanford University 31 Stronger Association: ~270 Documents Weaker Association: ~57 Documents CHILD CONCEPTS 05/01/2012 Engineering Informatics Lab at Stanford University 32 Retrieving Information Across Multiple Diverse Information Sources Patent System Issued Patents and Applications File Wrappers Court Cases Regulations and Laws 05/01/2012 Technical Publications Technology Firms’ Concerns • Can I get patent protection for my innovation? • Do I build or do I buy related technologies? • What are my competitors doing? • How strong are their patents? • Am I perhaps infringing on someone else’s patents? • Is so, are those patents valid? • Have they been enforced in court? • Has their validity been challenged in court? Engineering Informatics Lab at Stanford University 33 Cross-Referencing between Information Sources REGULATIONS: U.S. Code Title 35, C. F. R Title 37, M. P. E. P. … COURT CASE 314 F.3d 1313 (2003) AMGEN INC., Plaintiff-Cross Appellant v. HOECHST MARION ROUSSEL, INC. (now known as Aventis Pharmaceuticals, Inc.) and Transkaryotic Therapies, Inc., Defendants-Appellants. … Plaintiff-Cross Appellant Amgen Inc. is the owner of numerous patents directed to the production of erythropoietin ("EPO"), …alleging that TKT's Investigational New Drug Application ("INDA") infringed United States Patent Nos. 5,547,933; 5,618,698; and 5,621,080. The complaint was amended in October 1999 to include United States Patent Nos. 5,756,349 and 5,955,422, which issued after suit was filed. BIOPORTAL: DOMAIN KNOWLEDGE Publication Database PATENT United States Patent, 5,955,422 September 21, 1999 Production of erthropoietin Abstract: Disclosed are novel polypeptides possessing part or all of the primary structural conformation and one or more of the biological properties of mammalian erythropoietin ("EPO") … FILE WRAPPER U.S. Patent 5,955,422 … Claims 61-63 are rejected under 35 U.S.C. § 103 as being unpatentable over any one of Miyake et al., 1977 (R) … In accordance with the provisions of 37 C.F.R. §1.607, the present continuation is being filed for the purpose of … Inventors: Lin; Fu-Kuen (Thousand Oaks, CA) Assignee: Kirin-Amgen, Inc. (Thousand Oaks, CA) Appl. No.: 08/100,197 Filed: August 2, 1993. Solution: Patent System Ontology 05/01/2012 Engineering Informatics Lab at Stanford University 34 Patent System Ontology I. Facilitate information integration across multiple diverse information sources • This requires a standardized representation (a formal semantic model) - Patent System Ontology II. Integrate Domain Semantics into existing Information Retrieval and Text mining methodologies to improve retrieval of information 05/01/2012 Engineering Informatics Lab at Stanford University 35 Information Retrieval Framework Patent System Ontology 05/01/2012 Engineering Informatics Lab at Stanford University 36 Future Work • Using multiple enriched ontologies may provide the necessary terms • MeSH Descriptors are provided for every publication during indexing and can potentially improve results • Implement Okapi model for scoring documents 05/01/2012 Engineering Informatics Lab at Stanford University 37 Thank You 05/01/2012 Engineering Informatics Lab at Stanford University 38 Backup Slides 05/01/2012 Engineering Informatics Lab at Stanford University 39 Motivation • Scientific literature is an important source of information • Retrieving relevant information from scientific publications is challenging • Domain terminology is used inconsistently in scientific publications • Increasing amounts of information amplify the problem • Improved methodologies based on semantics are required 05/01/2012 Engineering Informatics Lab at Stanford University 40 Background • Text REtrieval Conference (TREC) organized by NIST has showcased many successful methods • The Genomics track focused on full-text scientific publications from 49 prominent journals • Methodologies involved: – – – – Use of Synonymy from ontologies Language based models Query expansion and annotations Okapi scoring model 05/01/2012 Engineering Informatics Lab at Stanford University 41 Goals • Understand how domain ontologies can be leveraged • Understand which domain ontologies can be leveraged • Develop a knowledge-based approach to integrate domain knowledge with search mechanism 05/01/2012 Engineering Informatics Lab at Stanford University 42 Query Expansion • TREC Queries are first manually pre-processed “What [TUMOR TYPES] are found in zebrafish?” => “[Tumor][MeSH] AND zebrafish” • [Tumor] indicates term that has to be expanded • [MeSH] indicates ontology that should be used 05/01/2012 Engineering Informatics Lab at Stanford University 43 Summary • Search methodologies must be based on semantics in order to tackle terminology inconsistency • Domain ontologies provide these semantics • Domain ontologies need to be modified (or enriched) in order to fulfill information needs • User interaction is important 05/01/2012 Engineering Informatics Lab at Stanford University 44 BioPortal • BioPortal is an integrated resource for biomedical ontologies • Currently indexes over 300 ontologies including Medical Subject Headings and Gene Ontology • Provides a comprehensive web service, abstracting the formats and API’s of all underlying ontologies 05/01/2012 Engineering Informatics Lab at Stanford University 45