Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C Workshop on RDF Access to Relational Databases 25-26 October, 2007 — Boston, MA, USA Requirements Cross-linking of database information on e.g. genes, proteins, metabolic pathways, compounds, ligands. to the original sources is a key issue. The productivity for accessing, sharing, searching, navigating, cross-linking and analyzing internal data and external data relevant for the Pharmaceutical industry should be increased 2 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007 Strategy In NIBR, we have been developing a semantic integration layer on top of knowledge resources that has been implemented within various services and applications. It uses • A rich domain-specific terminology (biology, chemistry and medicine) containing 1.6 Mio terms • A Terminology Hub containing 8 GB of referential data (crossreferences between data repositories.) Using that knowledge, the scientist can access all data at hand with just a single mouse-click. 3 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007 Application Areas for Terminologies Categorization of documents (via associated taxonomies) Search for concepts Semantic expansion of queries using synonyms and related terms Identification and extraction of relevant concepts (like e.g. targets, genes, diseases, products) from texts Annotation of textual data with controlled terms as referential anchors Construction of a semantic layer on top of information sources allowing navigation context-sensitive navigation (Ultralink) 4 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007 Application Areas for the Terminology Hub Coherent mapping between Terminologies and Coding Systems (e.g. Uniprot Accession Number for a Protein) Coherent mapping between internal knowledge repositories (e.g. Biological Assays and Chemical Compounds) Coherent mapping between external knowledge repositories (e.g. HUGO and OMIM) Coherent mapping between internal and external knowledge repositories (e.g. Internal Project Code and Product Name) Ultralink makes both of terminologies (entity recognition) and terminology hub (cross referencing) 5 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007 Ultralink Plug-in icon 2 Activation Concept Types Frame UltraLink 6 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007 Activation Ultralink 7 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007 The Landscape of Knowledge - Rooting the Ultralink in Data Sources/Terminologies The Ultralink makes use of a broad range of knowledge sources both internal to Novartis and external. The linkage of these terminologies provide the routes along which you can navigate when using the Ultralink. The linkage between the resources is created automatically via a rule-based mapping procedure and manually by annotation. The latter is extremely important for connecting internal knowledge sources together and to external ones. The annotations built on the fly by the UltraLink could be stored as RDF annotations associated to a document and be accessed by other computer programs – just in the spirit of the Semantic Web 8 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007 The Landscape of Knowledge - Rooting the Ultralink in Data Sources/Terminologies Concepts and Terminology Concepts and Data 9 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007 Underlying terminologies used at NIBR • > 15’000 Companies with > 35’000 terms • > 2’000 Diseases with >19’000 terms • > 150’000 Genes with about 400’000 terms • > 5’000 Modes of Action with > 12’000 terms • > 95’000 Products with > 380’000 terms • > 170’000 Targets with > 250’000 terms • > 310’000 Species with > 435’000 terms • + complete MESH and EMTREE • More than 1’600’000 terms • The terminology consists of terms, and relations between terms (main entry: normalized terms, synonyms, broader terms, narrower terms) 10 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007 Principles used for the construction of the terminology and organization of terms In order to create the terminology of reference, terms are extracted from available terminologies (e.g. UniProt, EntrezGene, HGNC, etc.) and the references to the source systems are preserved. Terms specific to a database are referred as local terms. These local terms are stored in a dedicated data structure, the Metastore. Besides the flat set of terms, thesaurus relations such as synonymy, broader term and narrower terms are extracted as well thus allowing to create a thesaurus. For each entry in the terminology like e.g. for a gene name or for a product, a term is chosen among the list of synonyms and is declared as a “normalized term” Normalized / global terms, synonyms / local terms as well as broader and narrower terms together with their sources of reference constitute the terminology content behind the UltraLink and are used by the Terminology Hub. 11 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007 Creating Reference – the Terminology Hub Different knowledge repositories have different ways to encode a concept: • Registry Number • Unique Internal ID • Concept Identifier • Enumerating terms • Just using different terms without any8constraints More than GB of cross-referencing information Searching a term T both in source A and B may lead to different results because of different naming/referencing conventions (false negatives in IR) Terminology Hub ensures coherent mapping • Between coding systems • Between different representation levels (e.g. ID vs. Concept) • Between local terms and global terms 12 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007 Classes of objects covered by the Terminology Hub Coding systems • A coding system provides a predefined set of (sometimes hierarchical) codes to represent a classification, a nomenclature, a controlled vocabulary, a thesaurus or chemical structures. For example, you can use the MeSH® Tree number C06.405.205.697 to refer to Gastritis in a specific sub-tree of MeSH® References • Unique and unequivocal identifiers based on a coding system create references in their corresponding data repository. By nature, they are technical artifacts and not part of our scientific natural language (e.g. FTY720), nevertheless most of them deserve to be identified, being used in scientific literature. Pointers and cross-referencing information • The Metastore contains pointers that allow to cross reference knowledge sources and applications. 13 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007 Classes of objects covered by the Terminology Hub Terms • A term is the smallest meaningful linguistic unit on which our domains of discourse (biology, chemistry, medicine) are based. A term is something different than a word because a term can consist of multiple meaningful words such as “chronic obstructive pulmonary disease”. Concepts • A concept is an abstraction based on properties of individuals that we observe in the world. Individuals that belong to the same concept share a set of common properties. For example, “targets” share the property that they should be druggable. Data Repositories also named Knowledge Sources • For all kinds of different data, we use the general notion of a data repository. Using the term “data repository” we emphasize the fact that there is a source where some data resides without making any commitments about physical representation (e.g. database or text file) or of representation (e.g. structured or free text). 14format | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007 Classes of objects covered by the Terminology Hub synonym-of broader narrower Encoding IUPAC Structures IDs GIF Symbols Formulas Registry Numbers ... encodes Data Repositories Internal Chemistry DB CI sources Literature Patents ... spinal cord vascular endothelial growth factor CCR5 Glivec ovarian cancer Novartis Cytomegalovirus ... has-type Reference Compound nos Project codes Competitor codes PMID 9683255 EntrezGene 450128 CAS 439-14-5 Patent numbers Terms Concepts points--to is-a 15 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007 Species Products Companies Diseases Genes Targets Mammalian Genes ... Achievements and Improvements All information about terminologies and cross-references is stored in a relational database (Oracle 10.2.0.2). The data in the database can be accessed through WebServices allowing user to find normalized terms, pointers for a specific concept-type etc. 16 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007 Metastore Web Service Get all synonyms for a normalized form 17 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007 UltraLink Web Services Get all accessible pointer types for a normalized form … 18 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007 Achievements and Improvements We intend to improve the semantic representation of the data in order to facilitate reuse, interoperability and exchange. RDF notation and RDF coding standards provide an adequate means for a richer semantic representation. We use SKOS, DublinCore and other RDF-based coding standards and supplement them with our own RDF vocabulary. 19 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007 Simple Knowledge Organisation System (example) 20 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007 Terminology for Diseases (SKOS fragment) 21 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007 Converting Terminologies to RDF Clear separation of terminologies from ontologies. We assign a type (rdf:type) to the URI of a term as reference to a concept in an ontology. Conversion to RDF increased the amount of data rougly by the factor 3. We obtained more than 5 Mio RDF triplets as a preliminary representation of our terminologies. We are currently setting up the entire workflow for generation, storing and querying RDF. 22 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007 Conclusion The first phase of transforming the terminology to RDFXML is completed We are currently developing a model for representing the Terminology Hub in RDF. We expect that an RDF notation of the Terminology Hub will comprise approximately 50 Mio. RDF triples We intend to test the framework thoroughly (performance, effective semantic gain compared to the current technology) Closer collaboration with the W3C Healthcare group 23 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007 Acknowledgements Thanks to the ULT team Semantic & Text Analytics Layer Martin Romacker Pierre Parisot Nicolas Grandjean Data Integration & Services Layer Alexander Fromm Laurent Mentek Application Layer Daniel Cronenberger Olivier Kreim Thanks to Manuel Peitsch 24 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007