MED267 Modeling Clinical Data and Knowledge for Computation Week 9 – Using Ontologies in Biomedical Research Amarnath Gupta A BRIEF RECAP SOME PRELIMINARIES AND SOME NOT-SOPRELIMINARIES Ontology • A formal representation of knowledge as a set of concepts Often Expressed in a language called OWL-DL within a domain, and the relationships between those concepts. ◦ Classes: sets, collections, concepts, classes in programming, types of objects, or kinds of things ◦ Attributes: aspects, properties, features, characteristics, or parameters that objects (and classes) can have ◦ Relations: ways in which classes and individuals can be related to one another ◦ Individuals: instances or objects (the basic or "ground level" objects) ◦ Restrictions: formally stated descriptions of what must be true in order for some assertion to be accepted as input ◦ Rules: statements in the form of an antecedent-consequent sentence that describe the logical inferences that can be drawn from an assertion in a particular form ◦ Axioms: assertions (including rules) in a logical form that together comprise the overall theory in its domain of application. An ontology can be viewed as a graph with an acyclic backbone and a logical interpretation Querying ontologies An ontology is a 2-graph system ◦ Class graph ◦ Instance graph ◦ An edge query language Reasoner Queries ◦ Inferencing ◦ Classification ◦ Consistency SPARQL 1.1 ◦ An edge query language with regular expressions on edges Data Queries ◦ Binding retrieval ◦ Subgraph retrieval SPARQL 1.0 OWL-QL ◦ DL query language Rule Language ◦ SWRL Emerging trends ◦ Keyword query languages ◦ Subgraph query languages We will revisit the query language issue as we go forward Upper Ontologies • An upper ontology (or foundation ontology) is a model of the common objects that are generally applicable across a wide range of domain ontologies. It employs a core glossary that contains the terms, associated object properties and relationships as they are used in various relevant domain sets. ◦ We have used the Basic Formal Ontology (BFO http://www.ifomis.org/bfo/publications ) and Relation Ontology (RO) for our work plasma membrane is a cell component that has as its parts a maximal phospholipids bilayer in which instances of two or more types of protein are Classification and Differentiation embedded. Continuants and Occurrents Standardizing Relationships Temporal parameter Smith B, Ceusters W, Klagges B, Kohler J, Kumar A, Lomax J, Mungall CJ, Neuhaus F, Rector A, Rosse C Relations in Biomedical Ontologies. Genome Biology, 2005. BFO Continuant Occurrent Process, event Independent Continuant Dependent Continuant thing quality temperature depends on bearer RELATION TO TIME CONTINUANT INDEPENDENT OCCURRENT DEPENDENT GRANULARITY ORGAN AND ORGANISM Organism (NCBI Taxonomy) CELL AND CELLULAR COMPONENT Cell (CL) MOLECULE Anatomical Organ Entity Function (FMA, (FMP, CPRO) Phenotypic CARO) Quality (PaTO) Cellular Cellular Component Function (FMA, GO) (GO) Molecule (ChEBI, SO, RnaO, PrO) Molecular Function (GO) Biological Process (GO) Molecular Process (GO) The Open Biomedical Ontologies (OBO) Foundry 8 Merging Ontologies ◦ Any real application needs to make use of multiple ontologies ◦ Often the strategy is to construct a specific ontology by assembling elements of multiple ontologies ◦ What happens if One ontology uses an upper ontology (say BFO) and another doesn’t One ontology uses a fixed set of relationships and another doesn’t OBI – An Ontology for Biomedical Investigations OBI models experiments material entity processed material device organization cell culture chemical entities in solution PCR product molecular entity (ChEBI) protein complex (Gene Ontology) cell (Cell Ontology) anatomical entity (FMA, CARO) organism (NCBI taxonomy) Some material entities in OBI MIREOT Minimal Information to Reference External Ontology Terms The idea ◦ the minimal set that allows to unambiguously identify a term URI of the class URI of the source ontology Superclass of the term in the source ontology Position in the target ontology ◦ Additional useful information • • • • • Label, Definition, Other annotations: adding “human-readable” information Superclasses: for example, NCBI taxonomy Problem ◦ Lose complete inference But because the imported ontology might not be commensurate with the base ontology, the inferences are questionable Modularization of Ontologies A set of principles for ◦ Decomposing a larger ontologies into smaller meaningful components ◦ Assimilating a set of component ontologies into a larger ontology ◦ Modules must Have semantic locality Preserve loose coupling and autonomy Enable partial reuse of knowledge Preserve directionality of knowledge import Ensure scalability Uberon – an integrated multi-species anatomy ontology Using taxonomic constraints Over 6,500 classes representing anatomical entities Represents structures in a species-neutral way and includes extensive associations to existing speciescentric anatomical ontologies, Allows integration of model organism and human data Uses novel methods for representing taxonomic variation Used for translational phenotype analyses. A BIOMEDICAL RESEARCH PROBLEM Finding Drugs for Rare/Orphan Diseases Orphan Diseases Diseases affecting less than 200,000 people in the U.S Approx. 7000 rare diseases affecting 25 million globally1. Orphan Drug Act – 1983 ◦ Incentives for orphan drug development. Around 355 with approved orphan drug therapies. Recent interest of pharma giants in Orphan drug R&D. Child with Tay-Sachs disease Image Source: http://www.ntsad.org/index.php/the-diseases Source (1): Rados FDA Consumer 2003 Orphan disease information space - a need for systematic analysis Genetic causes • 80 % of rare diseases have a genetic origin Underlying mechanism of Disease Processes • Intricate/Complex • May involve single or multiple loci, QTLs, multiple genes • Time-varying • Unknown for many diseases Drugs • Exact mechanism of action complex • Not completely documented due to proprietary reasons Approaches to drug repositioning Drug-centric approach Drug-centric Disease-centric ◦ Hypothesis: ‘similar drugs’ have same therapeutic effects and are equally effective for a disease. Disease-centric approach ◦ Hypothesis: ‘similar diseases’ need the same therapies and can be treated with the same drugs. Repositioned Drugs Source (5): Adapted from Liu et al. Sept 2012 Finding Drugs as an Exploratory Problem If a Genetic Variant (GV) is associated with disease progression, then drug/chemical (which suppresses the GV or its gene product) is a possible treatment option for the associated disease. ↑ or ↓ production Gene product ↑ or ↓ production If a GV is associated with disease remission, then drug/chemical (which increases activity of the GV) is a possible treatment option for the associated disease. If a disease associated GV causes increased expression of certain receptors, a drug which suppresses this receptor will be a possible treatment option for the disease. ↑ or ↓ Expression Drug Genetic Variant (SNP) ↑ or ↓ progression Disease ↑ or ↓ progression Genetic Variant/ Disease Drug Receptor/ Enzyme Biomarker discovery Resources for drug repositioning Gene expression based resources Drug-centric GEO 834,730 samples Drug Bank 6711 drugs 4227 drug targets cMAP 1307 compounds OMIM Network & Computational Modeling 20,000 genes & phenotypes CTD 18,414, 321 Toxicogenomic relationships Disease-centric Orphanet EMR’s PubMed 6500 rare diseases 22 millions citations Text-based resources Source (5): Adapted from Liu et al. Sept 2012 HIBM – a rare disease Autosomal recessive disease Clinical/diagnostic features ◦ ◦ ◦ ◦ Proximal and distal muscle weakness (starts with distal) Onset during late teens Mild elevation of serum CK Progression of muscular weakness continues for10-20 years ◦ Spares the quadriceps ◦ Detection of “inclusion bodies” in muscle biopsies Rimmed vacuoles (clusters of autophagic vacuoles (AVs) and myeloid bodies) in muscle tissue Accumulation of beta-amyloid, accumulation of NCAM1 in muscle (hyposialylation) Intracellular deposition of Congo red-positive materials (such as b-amyloid and a-synuclein) ◦ No loss of cognitive function HIBM – a rare disease Genetic characteristics ◦ Caused by mutations in GNE at locus 9p13-p12 homozygous or compound heterozygous bi-functional enzyme, UDP-N-Acetylglucosamine 2epimerase/N-Acetylmannosamine Kinase catalyzes two adjacent steps in the sialic acid biosynthetic pathway feedback regulated phosphorylated (PKC) and ubiquitinated ◦ Associated with abnormal phosphorylation of tau activation of the ubiquitin proteasome system activation of the lysosomal system Using Ontology Recommenders Using Ontology Annotators HIBM (is-a myopathy) (myopathy abnormality-of muscle-tissue) Autosomal recessive (is-a genetic inheritance) disease Clinical/diagnostic features ◦ Proximal muscle weakness (OMIM) distal muscle weakness (OMIM) (starts with distal) ◦ Onset during late teens ◦ Mild elevation of (PATO) serum CK – (elevated creatine phosphokinase is-a elevated-enzyme-activity) ◦ Progression of (PATO) muscular weakness continues for10-20 years ◦ Spares (not affects) the quadriceps (Uberon) ◦ Detection of “inclusion bodies” in muscle biopsies Rimmed vacuoles (clusters of autophagic vacuoles (AVs) and myeloid bodies) in muscle tissue Accumulation of beta-amyloid, accumulation of NCAM1 in muscle (hyposialylation) (decreased occurrence of sialic acid in) Intracellular deposition of Congo red-positive materials (such as bamyloid and a-synuclein) ◦ No loss of cognitive function (cogpo) Why is it hard to create ontologies for cognitive functions? Organizing information with ontologies HIBM – a rare disease Genetic characteristics ◦ Caused by mutations in GNE at locus 9p13-p12 homozygous or compound heterozygous bi-functional enzyme, UDP-N-Acetylglucosamine 2epimerase/N-Acetylmannosamine Kinase catalyzes two adjacent steps in the sialic acid biosynthetic pathway (BioPAX) feedback regulated phosphorylated (PKC) and ubiquitinated (ubiquitination – GO) ◦ Associated with abnormal phosphorylation (GO) of tau protein (PRO) activation of the ubiquitin proteasome system activation of the lysosomal system P36L P27S C13S R11W C303X G206S C303V V572L, “Japanese” G206fsX4 P283S R306Q (homozygous) R202L G312R G559R G576E D225N V331A I200F V216A I557T I587T R246W V367I F528C A600T R177C V696M R246Q A630T I377fsX16 I472T A524V D176V D378Y A460V N519S R263L G134V G708S A631T M171V H132Q R162C A519S M712T rs28937594 R266W V421A A631V R129Q R420X “middle eastern” T507P R266Q Y675H 100 200 300 400 500 600 700 (homozygous) ManNAc 6-kinase UDP-GlcNAc 2-epimerase Y22-p Nuclear Export K195-u K267-u Y197-p K210-u Signal S199-p Allosteric Site ATP binding Active site Zn binding M712-p (rat) ATP binding UDP-GlcNAc 2-epimerase domain Black: mutations in uniprot Grey: mutations in papers ManNAc 6-kinase domain ATP binding site Allosteric site Substrate binding site Nuclear Export Signal Enzymatic active site -p Phosphorylation site Zn binding site -u Ubiquitination site G206S P36L P27S D225N V331A I200F V216A R246W R177C R246Q R306Q D176V R263L M171V R266W R162C H132Q R266Q 100 200 300 D378Y D378Y 400 V572L V572L A631T Human I557T A631V I472T G576E F528C A631V V696M A460V A524V I587T A460V A600T Y675H M712T rs28937594 N519S A630T 500 600 ManNAc 6-kinase UDP-GlcNAc 2-epimerase Y22-p Nuclear Export Y197-p K195-u S199-p Allosteric K210-u Signal 700 Site ATP binding Active site Zn binding M712-p ATP binding K267-u Kinase + + + ++ -- Epimerase -Oligamerization + -- -- -- -- + - - -- -- + Feedback inhibition process + H155A(rat) H132A (rat) H157A (rat) H49A (rat) H110A (rat) 100 200 D413K (rat) D413N (rat) R420M (rat) 400 500 (KO) tm1Rhk (KO) tm1Sngi Insert: HumanGNE*D176V) 600 700 ManNAc 6-kinase UDP-GlcNAc 2-epimerase G135E (CHO) M712T (mouse) V572L (mouse) ATP binding Active site Zn binding ATP binding Rat Ontological Mapping of Findings to Sequences Sequence Types and Features Ontology Ontological Model of Pathways using BioPAX Pathway: a set or series of interactions, often forming a network Exploring for related information What genes are related to inclusion body myopathies? Enrichment analysis using ontologies What are the relevant phenotypes? The human phenotype ontology ◦ Arranged as a directed acylic graph (DAG) A given phenotypic feature can be considered to be a more specific aspect or more than one parental term. Terms that are located close to the root of the graph are less specific than terms that are farther away from it. This is defined as the information content (IC) of a term (−log pi, where pi represents the frequency of the phenotypic manifestation i among all diseases in the database). mental retardation, which is a common phenotypic manifestation of many hereditary diseases, is less clinically specific (has less information content) than a feature such as calcific stippling. Comparing phenotypes Figure 3. Analysis of the phenotypic similarity of the Human Phenotype Ontology (HPO) terms downward slanting palpebral fissures and hypertelorism to annotations of (a) Greig cephalopolysyndactyly syndrome [GCPS (MIM 175700)] and (b) type II orofaciodigital syndrome [OFD2 (MIM 252100)].The most specific common ancestor of hypertelorism and telecanthus is the term abnormality of the eye, and the similarity between hypertelorism and telecanthus is calculated as the information content of the term abnormality of the eye. Therefore, a search with the query terms downward slanting palpebral fissures and hypertelorism yields a higher score for GCPS than for OFD2. Phenotypic similarity using EQ Recall phenotype description using EQ description Phenotypic similarity IC of the node, which is the negative log of the probability of that description being used to annotate a gene, allele, or genotype (collectively called a feature) Phenotypic Profile: Multiple EQ descriptions annotated to a genotype Phenotypes annotated to genotypes are propagated to their allele(s), and in turn to the gene, indicated with upward arrows. Similarity is analyzed between any two nodes of the same type, ◦ gene A-vs-B, allele A3-vs-B1, genotypes A1/A1-vs-A3/A3, or A3/A3-vs-B1/B1. The common subsuming phenotypes between A1/A1-vs-A3/A3 and gene A-vs-B are itemized in white boxes. Some individual phenotypic descriptions can have two common subsumers. For each phenotypic description (EQ), the calculated IC is shown. ◦ When comparing two items, four scores are determined: maxIC, the maximum IC score for the common subsuming EQ, which may be a direct (in the case of A1/A1-vs-A3/A3) or inferred (in the case of gene Avs-gene B) phenotype, avgICCS, the average of all common subsuming IC scores simIC, the similarity score which computes the ratio of the sum of IC values for EQ descriptions (including subsuming descriptions) held in common (intersection) to that of the total set (union) simJ, non-IC-based similarity score calculated with the Jaccard algorithm which is the ratio of the count of all nodes in common to nodes not in common. Phenoclustering Phenotype and genotype information can viewed as a network ◦ Graph clustering techniques with suitable similarity metrics can be used to define node proximity Phenoclustering: online mining of cross-species phenotypes Groth et al, Bioinformatics 2010 26(15): 1924. Investigating the hypothesis Exploratory Search ◦ A specialization of information exploration which represents the activities carried out by searchers who are Unfamiliar with the domain of their goals Unsure about the ways to achieve their goals Possibly even unsure about their exact goals ◦ Hypothesis investigation can be viewed as an exploratory search over a semantically connected graph Find entities of type drug that relate to one or more of these genes, possibly through these pathways, and possibly through these phenotypes Distinct from finding statistically correlated information and thresholding on p-values Role of ontologies in exploratory graph search Ontologies serve as indices to data Semantic labels as indices Relationships as join indices Ontological neighborhoods as multi-join indices ◦ Helps to construct “semantic neighborhoods” between data nodes that are far apart Ontologies as (implicit) query filters ◦ Find connections in the data graph only when the corresponding ontology entities satisfy a connectivity pattern Node/Node Type distances can denote node similarities ◦ Can be a function of graph distances in the ontology ◦ Can be extended to define relatedness measures between data neighborhood Example Exploratory Query ◦ Find drug:* related-to gene:GNE, through some pathways, and optionally through some muscular dystrophy ◦ A potential exploration path GNE missense mutations of GNE reduced GNE-epimerase activities GNE/MNK pathway ManNAC kinase clinical trials drug DEX-M4 Exercise: how can ontologies contribute to finding this path? Conclusions Upper ontologies are needed to organize concepts and relationships for a domain and application Principled methods of modularizing component ontologies help avoid large monolithic ontologies and potential inconsistencies Ontologies are not only used for conceptualizing a domain but also for tasks like data integration, enrichment analysis and (exploratory) search