The CROP (Common Reference Ontologies for Plants) Initiative Barry Smith September 13, 2013 http://ontology.buffalo.edu/smith 1 Agenda The OBO Foundry Principles Reference ontologies vs. application ontologies Other ontology consortia The CROP Initiative Examples of ontologies within CROP 2 On June 22, 1799, in Paris, everything changed 3 International System of Units 4 How to find data? How to find other people’s data? How to reason with data when you find it? How to work out what data does not yet exist? 5 How to solve the problem of making the data we find queryable and reusable by others? Part of the solution must involve: standardized terminologies and coding schemes 6 But there are multiple kinds of standardization for biological data, and they do not work well together Proposed solution: Ontology-based annotation of data 7 ontologies = standardized labels designed for use in annotations to make the data cognitively accessible to human beings and algorithmically accessible to computers 8 ontologies = high quality controlled structured vocabularies for the annotation (description) of data, images, journal articles … 9 Ramirez et al. Linking of Digital Images to Phylogenetic Data Matrices Using a Morphological Ontology Syst. Biol. 56(2):283–294, 2007 ontologies used in curation of literatur what cellular component? what molecular function? what biological process? 11 Proposed framework: the Semantic Web • html demonstrated the power of the Web to allow sharing of information • can we use semantic technology to create a Web 2.0 which would allow algorithmic reasoning with online information based on a common Web Ontology Language (OWL)? • can we use netcentricity, common URLs, to break down silos, and create useful integration of on-line data and information 12/24 Ontology success stories, and some reasons for failure • A fragment of the “Linked Open Data” in the biomedical domain 13 http://bioportal.bioontology.org/ 14 15 16 17 18 The more ontology-building is successful, the more it fails OWL breaks down data silos via controlled vocabularies for the description of data dictionaries Unfortunately the very success of this approach led to the creation of multiple, new, semantic silos – because multiple ontologies are being created in ad hoc ways 19/24 http://bioportal.bioontology.org/ Many ontologies in bioportal are created by importing content from existing ontologies and giving the terms imported new names and new IDs The result is chaos, with bits and pieces of the same ontologies chopped in multiple different places. Leads to massively redundant effort, forking and doom 20 A standard engineering methodology • It is easier to write useful software if one works with a simplified model • (“…we can’t know what reality is like in any case; we only have our concepts…”) • This looks like a useful model to me • (One week goes by:) This other thing looks like a useful model to him • Data in Pittsburgh does not interoperate with data in Vancouver • Science is siloed A good solution to this silo problem must be: • • • • • • • modular incremental independent of hardware and software bottom-up evidence-based revisable incorporate a strategy for motivating potential developers and users 22 Uses of ‘ontology’ in PubMed abstracts 23 24 main reason for GO’s success Gene Ontology and associated databases “make it possible to systematically dissect large gene lists in an attempt to assemble a summary of the most enriched and pertinent biology” PMC2615629 GO provides a controlled system of terms for use in annotating (describing, tagging) data • multi-species, multi-disciplinary, open source • contributing to the cumulativity of scientific results obtained by distinct research communities • compare use of kilograms, meters, seconds in formulating experimental results 26 GO is 3 ontologies cellular component molecular function biological process Top-Level Architecture Continuant Independent Continuant Occurrent (Process, Event) Dependent Continuant universals ..... ..... ..... instances 28 Problem with the GO • • • • • • it covers only three types of entities no diseases no laboratory artifacts no anatomy (above the cell) only species-terms for development no phenotypes 29 RELATION TO TIME CONTINUANT INDEPENDENT OCCURRENT DEPENDENT GRANULARITY ORGAN AND ORGANISM Organism (NCBI Taxonomy) CELL AND CELLULAR COMPONENT Cell (CL) MOLECULE Anatomical Organ Entity Function (FMA, (FMP, CPRO) Phenotypic CARO) Quality (PaTO) Cellular Cellular Component Function (FMA, GO) (GO) Molecule (ChEBI, SO, RnaO, PrO) Molecular Function (GO) Biological Process (GO) Molecular Process (GO) The Open Biomedical Ontologies (OBO) Foundry 30 RELATION TO TIME GRANULARITY INDEPENDENT ORGAN AND ORGANISM Organism (NCBI Taxonomy) CELL AND CELLULAR COMPONENT Cell (CL) MOLECULE CONTINUANT DEPENDENT Anatomical Organ Entity Function (FMA, (FMP, CPRO) Phenotypic CARO) Quality (PaTO) Cellular Cellular Component Function (FMA, GO) (GO) Molecule (ChEBI, SO, RNAO, PRO) OCCURRENT Molecular Function (GO) Organism-Level Process (GO) Cellular Process (GO) Molecular Process (GO) rationale of OBO Foundry coverage 31 First step (2001) a shared portal for (so far) 58 ontologies (low regimentation) http://obo.sourceforge.net NCBO BioPortal 32 33 OBO builds on the principles successfully implemented by the GO recognizing that ontologies need to be developed in tandem 34 Second step (2006) The OBO Foundry http://obofoundry.org/ 35 RELATION TO TIME CONTINUANT INDEPENDENT OCCURRENT DEPENDENT GRANULARITY ORGAN AND ORGANISM Organism (NCBI Taxonomy) CELL AND CELLULAR COMPONENT Cell (CL) MOLECULE Anatomical Organ Entity Function (FMA, (FMP, CPRO) Phenotypic CARO) Quality (PaTO) Cellular Cellular Component Function (FMA, GO) (GO) Molecule (ChEBI, SO, RnaO, PrO) Molecular Function (GO) Biological Process (GO) Molecular Process (GO) Building out from the original GO 36 RELATION TO TIME GRANULARITY INDEPENDENT ORGAN AND ORGANISM Organism (NCBI Taxonomy) CELL AND CELLULAR COMPONENT Cell (CL) MOLECULE CONTINUANT DEPENDENT Anatomical Organ Entity Function (FMA, (FMP, CPRO) Phenotypic CARO) Quality (PaTO) Cellular Cellular Component Function (FMA, GO) (GO) Molecule (ChEBI, SO, RnaO, PrO) OCCURRENT Molecular Function (GO) Organism-Level Process (GO) Cellular Process (GO) Molecular Process (GO) initial OBO Foundry coverage 37 OBO Foundry Principles common formal architecture clearly delineated content (redundant – overlaps with orthogonality) the ontology is well-documented (– overlaps with rules for definitions; needs expanding, for developers, for users, minimal metadata) plurality of independent users single locus of authority, trackers, help desk 38 OBO Foundry Principles textual definitions plus formal definitions all definitions should be of the genus-species form A =def. a B which Cs where B is the parent term of A in the ontology hierarchy • formal definitions use OBO format or OWL 39 Orthogonality • For each domain, there should be convergence upon a single ontology that is recommended for use by those who wish to become involved with the Foundry initiative • Part of the goal here is to avoid the need for mappings – which are in any case too expensive, too fragile, too difficult to keep up-to-date as mapped ontologies change • Orthogonality means: – everyone knows where to look to find out how to annotate each kind of data – everyone knows where to look to find content for application ontologies 40 Orthogonality = non-redundancy for the reference ontologies inside the Foundry • application ontologies can overlap, but then only in those areas where common coverage is supplied by a reference ontology 41 PRINCIPLES COMMON FORMAL ARCHITECTURE: The ontology uses relations which are unambiguously defined following the pattern of definitions laid down in the Basic Formal Ontology (BFO) http://www.ifomis.uni-saarland.de/bfo/ ‘formal’= domain neutral 42 Basic Formal Ontology Continuant Occurrent biological process Independent Continuant Dependent Continuant cell component molecular function OBO Foundry provides guidelines (traffic laws) to new groups of ontology developers in ways which can counteract current dispersion of effort New principle: Employ the methodology of cross-products compound terms in ontologies are to be defined as cross-products of simpler terms: E.g elevated blood glucose is a cross-product of PATO: increased concentration with FMA: blood and CheBI: glucose. = factoring out of ontologies into disciplinespecific modules (orthogonality) 45 The methodology of cross-products enforcing use of common relations in linking terms drawn from Foundry ontologies serves • to ensure that the ontologies are maintained and revised in tandem • logically defined relations serve to bind terms in different ontologies together to create a network 46 RELATION TO TIME CONTINUANT INDEPENDENT OCCURRENT DEPENDENT GRANULARITY ORGAN AND ORGANISM Organism (NCBI Taxonomy) CELL AND CELLULAR COMPONENT Cell (CL) MOLECULE Anatomical Organ Entity Function (FMA, (FMP, CPRO) Phenotypic CARO) Quality (PaTO) Cellular Cellular Component Function (FMA, GO) (GO) Molecule (ChEBI, SO, RnaO, PrO) Molecular Function (GO) Biological Process (GO) Molecular Process (GO) Building out from the original GO 47 RELATION TO TIME CONTINUANT INDEPENDENT OCCURRENT DEPENDENT GRANULARITY COMPLEX OF ORGANISMS ORGAN AND ORGANISM CELL AND CELLULAR COMPONENT MOLECULE Family, Community, Deme, Population Population Phenotype Organ Anatomical Function Organism Entity (FMP, CPRO) (NCBI (FMA, Phenotypic Taxonomy) CARO) Quality (PaTO) Cellular Cellular Cell Component Function (CL) (FMA, GO) (GO) Molecule (ChEBI, SO, RnaO, PrO) Molecular Function (GO) Population-level ontologies Population Process Biological Process (GO) Molecular Process (GO) 48 RELATION TO TIME CONTINUANT INDEPENDENT OCCURRENT DEPENDENT ORGAN AND ORGANISM CELL AND CELLULAR COMPONENT MOLECULE Organism (NCBI Taxonomy) Anatomical Entity (FMA, CARO) Cell (CL) Cellular Component (FMA, GO) Molecule (ChEBI, SO, RnaO, PrO) environments GRANULARITY Organ Function (FMP, CPRO) Phenotypic Quality (PaTO) Biological Process (GO) Cellular Function (GO) Molecular Function (GO) Molecular Process (GO) Environment Ontology 49 top level Basic Formal Ontology (BFO) Ontology for Biomedical Investigations (OBI) Information Artifact Ontology mid-level (IAO) Anatomy Ontology (FMA*, CARO) domain level Cell Ontology (CL) Cellular Component Ontology (FMA*, GO*) Environment Ontology (EnvO) Subcellular Anatomy Ontology (SAO) Sequence Ontology (SO*) Protein Ontology (PRO*) Spatial Ontology (BSPO) Infectious Disease Ontology (IDO*) Phenotypic Quality Ontology (PaTO) Biological Process Ontology (GO*) Molecular Function (GO*) Extension Strategy + Modular Organization 50 Third step: Creation of new ontology consortia, modeled on the OBO Foundry OBO Foundry Open Biological and Biomedical Ontologies NIF Standard Neuroscience Information Framework eagle-I Ontologies used by VIVO and CTSAconnect IDO Consortium Infectious Disease Ontology 51 A good solution to the silo problem must be: • • • • • • • modular incremental independent of software and hardware bottom-up evidence-based revisable incorporate a strategy for motivating potential developers and users 52 Because the ontologies in the Foundry are built as orthogonal modules which form an incrementally evolving network • scientists are motivated to commit to developing ontologies because they will need in their own work ontologies that fit into this network • users are motivated by the assurance that the ontologies they turn to are maintained by experts 53 More benefits of orthogonality • helps those new to ontology to find what they need • to find models of good practice • ensures mutual consistency of ontologies (trivially) • and thereby ensures additivity of annotations 54 More benefits of orthogonality • it rules out the sorts of simplification and partiality which may be acceptable under more pluralistic regimes • thereby brings an obligation on the part of ontology developers to commit to scientific accuracy and domain-completeness 55 More benefits of orthogonality • No need to reinvent the wheel for each new domain • Can profit from storehouse of lessons learned • Can more easily reuse what is made by others • Can more easily reuse training • Can more easily inspect and criticize results of others’ work • Leads to innovations (e.g. Mireot, Ontofox) in strategies for combining ontologies 56 Reference Ontologies vs. Application Ontologies Reference ontology = an ontology that captures generic content and is designed for aggressive reuse in multiple different types of context. Our assumption is that most reference ontologies will be created manually on the basis of explicit assertion of the taxonomical and other relations between their terms. Reference Ontologies vs. Application Ontologies By ‘application ontology’ we mean an ontology that is tied to specific local applications. Each application ontology is created by using ontology merging software to combine new, local content with generic content taken over from relevant reference ontologies Xiang, et al., “OntoFox: Web-Based Support for Ontology Reuse”, BMC Research Notes. 2010, 3:175. Normalization of the ontology space – content from reference ontologies is maximally re-used, e.g. in formulation of compound terms and of cross-product definitions (Compare normalization of a vector space) (Compare, again, SI System of Units) International System of Units 60 Infectious Disease Ontology (IDO) 61 We have data, e.g.: • TBDB: Tuberculosis Database, including Microarray data • VFDB: Virulence Factor DB • TropNetEurop Dengue Case Data • ISD: Influenza Sequence Database at LANL • MPD/MRD/CPP: Protein Data of PIR Resource Center for Biodefense Proteomics Research • PathPort: Pathogen Portal Project 62 Purpose of Infectious Disease Ontology (IDO) • Retrieval and integration of infectious disease relevant data – Sequence and protein data for pathogens – Case report data for patients – Clinical trial data for drugs, vaccines – Epidemiological Data for surveillance, prevention – ... • Goal: to make data deriving from different sources comparable and computable 63 IDO Strategy • Reference ontology (IDO Core) with terms relevant to any infectious disease • Disease- and organism-specific application ontologies – for different types of host, types of vector, types of pathogen, types of disease 64 Infectious Disease Ontology (IDO) • Member of the OBO Foundry • A suite of ontologies – IDO Core: • General terms in the ID domain. • A hub for all IDO extensions. – IDO Extensions: • Disease specific. • Developed by subject matter experts. • Provides: – Clear, precise, and consistent natural language definitions – Computable logical representations (OWL, OBO) How IDO evolves IDOMAL IDOFLU IDOCore IDORatSa IDORatStrep CORE and SPOKES: Domain ontologies IDOStrep IDOSa IDOMRSa IDOHumanSa IDOHIV IDOAntibioticResistant SEMI-LATTICE: By subject matter experts in different communities of IDOHumanStrep interest. IDOHumanBacterial IDO Process Model Sample Application: A lattice of infectious disease application ontologies from NARSA isolate data • Expose value of Genotype-Phenotype Linked Data by converting a free-text database from NARSA (Network on Antimicrobial Resistance in Staphylococcus Aureu) into a computational resource Ways of differentiating Staphylococcus aureus infectious diseases • Infectious Disease – – – – By host type By (sub-)species of pathogen By antibiotic resistance By anatomical site of infection • Bacterial Infectious Disease – By PFGE (Strain) – By MLST (Sequence Type) – By BURST (Clonal Complex) • Sa Infectious Disease – By SCCmec type • By ccr type • By mec class – spa type http://www.sccmec.org/Pages/SCC_ClassificationEN.html NRS701’s resistance to clindamycin ido.owl narsa.owl narsa-isolates.owl ndf-rt Further extensions of IDO • Vaccine (Vaccine Ontology) • Plant IDO from ICBO 2012: 71 Founding CROP The ontologies in CROP General ontologies taken over from OBO Foundry • ChEBI Chemistry ontology • GO Gene Ontology • PRO Protein Ontology • ENVO Environment Ontology + GAZ Gazetteer built on ontological principles • PATO Phenotype Ontology 73 Plant specific ontologies to be developed by CROP group PO Plant Ontology TO Trait Ontology EO Plant Environment Ontology Plant IDO Plant Disease Action items: fix relation between EnvO and EO fix relation between PATO and TO Taxonomy resource (for diseases of host and causal organisms + vectors/secondary hosts) NCBI Taxonomy has most of the hosts , but not the viruses Examples of CROP actions 1. ontology training 2. ontology hub-spokes formations (e.g. for plant development) 3. treaty negotiation meetings Next steps in CROP: PRO-PO-GO Meeting Buffalo, Spring 2013 PRO = protein ontology PO = plant ontology GO = gene ontology The Environment Ontology OBO Foundry Genomic Standards Consortium National Environment Research Council (UK) USDA, Gramene, J. Craig Venter Institute ... 78 Applications of EnvO in biology 79 80 81 82 How EnvO currently works for information retrieval Retrieve all experiments on organisms obtained from: – deep-sea thermal vents – arctic ice cores – rainforest canopy – alpine melt zone Retrieve all data on organisms sampled from: – hot and dry environments – cold and wet environments – a height above 5,000 meters Retrieve all the omic data from soil organisms subject to: – moderate heavy metal contamination 83 extending EnvO to clinical and translational research • we have public heath, community and population data • we need to make this data available for search and algorithmic processing • we create a consensus-based ontology which can interoperate with ontologies for neighboring domains of medicine and basic biology 84 Environment = totality of circumstances external to a living organism or group of organisms – pH – evapotranspiration – turbidity – available light – predominant vegetation – predatory pressure – nutrient limitation … 85 extend EnvO to the clinical domain – dietary patterns (Food Ontology: FAO, USDA) ... allergies – neighborhood patterns • • • • • • built environment, living conditions climate social networking crime, transport education, religion, work health, hygiene – disease patterns • bio-environment (bacteriological, ...) • patterns of disease transmission (links to IDO) 86 continuant Aligning EnvO to the Basic Formal Ontology system ecosystem biome object organism pond environmental feature site mountain slope spatial region … habitat • Habitat =def. An ecosystem which can support the life of a given organism, population, or community • Realized niche =def. An ecosystem which is that part of a habitat which supports the life of a given organism, population or community Aligning EnvO to the Basic Formal Ontology ecosystem biome system continuant habitat object organism pond environmental feature site mountain slope spatial region … Hutchinsonion niche (niche as volume in a functionally defined hyperspace) • =def. an n-dimensional hyper-volume whose dimensions correspond to resource gradients over which species are distributed – degree of slope, exposure to sunlight, soil fertility, foliage density, salinity... G.E. Hutchinson (1957, 1965) Aligning EnvO to the Basic Formal Ontology ecosystem biome system continuant habitat part_of niche object organism pond environmental feature site mountain slope spatial region … 94 95