How to build cross-species interoperable ontologies Chris Mungall, LBNL Melissa Haendel, OHSU The challenge.. • There are many fun and interesting issues involved in building and using cross-species ontologies – homology – evo-devo – reasoning using ontologies – connecting genomics databases to phenotypes but… • Unfortunately, there are many more prosaic issues with unsatisfying solutions – multiple ontologies already exist – limited cooperation between the developers of these ontologies – they differ widely in every aspect imaginable – they are heavily embedded in existing databases and applications and slow to change – tools and infrastructure support falls short of what we need • FORTUNATELY, solutions are emerging.. Outline • Anatomy Ontologies: Background • Case studies – GO: A unified cross-species ontology – CL: Cell Ontology: Unifying multiple existing efforts • Building interoperable gross anatomy ontologies – (Melissa) Ontologies • Computable qualitative representations of some part of the world • Relationships with computable properties – e.g. transitivity – languages and formats like owl and obo have a formal semantics • Entities are grouped into classes • Relationships are statements about all the members of a class – the most common form is the all-some statement Ontologies are not smart • Deductive Logic is not flexible • Example – Human knowledge: • chromosomes are found in the nucleus – Naïve ontology encoding: • every chromosome part_of some nucleus – But this is wrong • Ontologies don’t make exceptions! – Solution: • (1) create location-specific subclasses – nuclear chromosome – mitochondrial chromosome • (2) – invert statement: every nucleus has chromosomes Existing Anatomy Ontologies • • • • Human AOs Model Organism AOs Domain specific AOs Cross-species AOs FMA : Foundational Model of Anatomy • Domain: adult human – no develops_from relationships, few embryonic structures • Size: large (70k+ classes) • Language: frames • Approach – formal, Strict single inheritance, Purely structural perspective – No computable definitions – Heavily pre-coordinated • “Trunk of communicating branch of zygomatic branch of right facial nerve with zygomaticofacial branch of right zygomatic nerve” • “Distal epiphysis of of distal phalanx of right little toe” – Extensive spatial relationships in selected areas • e.g. veins, arteries • Uses – not designed for one particular use FMA Example / FMA:62955 ! Anatomical entity is_a FMA:61775 ! Physical anatomical entity is_a FMA:67165 ! Material anatomical entity is_a FMA:67135 ! Anatomical structure is_a FMA:67498 ! Organ is_a FMA:55670 ! Solid organ is_a FMA:55661 ! Parenchymatous organ is_a FMA:55662 ! Lobular organ is_a FMA:13889 ! Pituitary gland is_a FMA:20020 ! Vestibular gland is_a FMA:55533 ! Accessory thyroid gland is_a FMA:58090 ! Areolar gland is_a FMA:59101 ! Lacrimal gland is_a FMA:62088 ! Lactiferous gland is_a FMA:7195 ! Lung is_a FMA:7197 ! Liver is_a FMA:7198 ! Pancreas is_a FMA:7210 ! Testis is_a FMA:76835 ! Accessory pancreas is_a FMA:9597 ! Salivary gland is_a FMA:9599 ! Bulbo-urethral gland is_a FMA:9600 ! Prostate is_a FMA:9603 ! Thyroid gland Model Organism Anatomy Ontologies • Typically species-centric – – – – – – FBbt : Drosophila melanogaster WBbt: C elegans ZFA: Danio rerio XAO: Xenopus MA: Adult Mouse (no develops from) EMAP/EMAPA: developing mouse • Uses – primarily gene expression, also phenotype description – others: Virtual FLy Brain, Phenoscape • Approach: – use-case driven – practicality over formality – No computable definitions • (exception FBbt) Other anatomy ontologies • Developing human – • Vectors – – • CARO AEO Domain-specific anatomy ontologies – • TGMA – mosquito TADS - tick Upper ontologies – – • EHDAA2 NIF_Anatomy, NIF_Cell – neuroscience Phylogenetic or multi-taxon AOs – – – – – – – HAO – hymeoptera PO – plant TAO – telost AAO – amphibian SPD – Spider … we will return to these later.. Problem • These AOs are not developed in a coordinated fashion – use of a shared upper ontology does not buy us much – even the 3 mammalian AOs are massively different • Data annotated using these ontologies effectively becomes siloed • There is redundancy of effort in areas of shared biology • Are there lessons from existing ontologies? Building ontologies that are interoperable across species • Case Studies – GO – Cell Ontology Gene Ontology • Covers all kingdoms of life – viruses, bacteria, archaea – fungi, metazoans, plants • Covers biology at different scales • Issues – terminological confusion (e.g. “blood”) – large, difficult to maintain How does GO deal with taxonomic variation? • What GO says: – every nucleus is part_of some cell • What GO does not say: – every cell has_part some nucleus • wrong for bacteria (and mammalian erythrocytes) • Take home: – Logical quantifiers are essential to understanding the ontology – Saying what something is part of is safer than saying what its parts are Principle: avoidance of taxonomic differentia • Not in GO: – vertebrate eye development – insect eye development – cephalopod eye development • In GO: – eye development • camera-type eye development • compound eye development • Exceptions for usability: } no implication of homology – cell wall • fungal-type cell wall [differentia:cross-linked glycoproteins and carbohydrates, chitin / beta-glucan …] • plant-type cell wall [differentia: cellulose, pectin, …] The problem of vagueness in GO • “limb development” • “wing development” Adding taxonomic constraints to GO • GO now includes two additional relations – only_in_taxon – never_in_taxon – See: • Kusnierczyk, W: Taxonomy-based partitioning of the Gene Ontology, JBI 2008 • Deegan et al: Formalization of taxon-based constraints to detect inconsistencies in annotation and ontology development, BMC Bioinformatics 2010 Examples • lactation only_in_taxon Mammalia (NCBITaxon:40674 ) – OWL: lactation in_taxon only Mammalia • odontogenesis never_in_taxon Aves (NCBITaxon:8782) – OWL: odontogenesis in_taxon only not Aves • chloroplast only_in_taxon (Viridiplantae or Euglenozoa) (NCBITaxon:33682 or NCBITaxon:33090) Uses of taxon relationships 1. Clarifying meaning of GO terms 2. Detection of errors in electronic and manual annotation • • Automated reasoners GO previously had chicken genes involved in lactation, slime mold genes involved in fin regeneration… 3. Providing views over GO • e.g. subset of GO excluding terms that are never in drosophila Scalability of single-ontology approach: GO • How does GO cope with wide taxonomic diversity? – conservation at molecular level, wide diversity of phenotypes at level of gross anatomical development, physiology, and organismal behavior • GO Development – Focused on model systems • “beak development” added only recently • GO Behavior – Very broad coverage – Some specific terms, e.g. drosophila courtship Proposal: outsource portions of the ontology Ontology Views • Ontologies, traditional – independent standalone resources • Ontologies, new – interconnected resources – multiple views possible • Subsetting • Aggregation • Subsetting + Aggregation – views can be manually specified (e.g. go slims) or automatically constructed – Limited re-writing possible • e.g. names Views “slim” subset aggregate aggregate+subs et subset subset scattered subset domain/taxon-specific cut Subset of GO vertebrate subset Outline • Case studies – GO: A unified cross-species ontology – Cell Ontology: Unifying multiple existing efforts • Gross Anatomy Cell types • GO-Cell Component – cell parts • CL – cell ontology • Anatomical Ontologies – Includes cell types: • • • • • • FBbt (Drosophila) WBbt (C elegans) ZFA, TAO (Danio rerio, Teleost) FMA (Human) PO (Plant) FAO (Fungi) – Excludes cell types: • MA (adult mouse) • EMAPA (developing mouse) • EHDAA2 (developing human) Overlap (simplified view) CL ZFA brain NIF cell PO plant spore FMA MA alveolar macrophage neuron lung The Problem • • • • Duplicated work No unified view Confusion for users Confusion for annotators Alternative proposals 1. LUMP: Combine into one monolithic CL ontology 2. SPLIT: Taxon-specific cell types in taxoncentric ontologies a) Obsolete generic cell types currently in tcAOs -vs- b) Taxon-specific subclasses of generic cell types LUMP all cells fish plants plant spore human alveolar macrophage neuron mouse CL Lumping proposal • Advantages: – one stop shopping for CL • (but this can be done with aggregate views) • Disadvantages – tcAO IDs well-established – Little advantage to lumping plant cells with animal cells – Harder to manage editorially – Cross-granular relationships (Partial) Splitting proposal • Advantages: – Easier to manage – Sensible subdivision of labor: • Common cell types in shared common cell ontology – e.g. shared definition of “neuron” • Taxon-specific subtypes in taxon-centric ontologies • Disadvantages – Aggregate view is problematic • union of ontologies contains multiple classes labeled “neuron” – Can be solved by obsoleting existing generic cell classes in tcAOs and replacing by CL IDs • problem: cross-granular relationships Current solution for CL: split and retain IDs • Any cell type shared by two model taxa should be in CL • tcAOs retain both generic and specific cell type classes – Formally connected to CL via subclass relationships • or even stronger: taxon-specific equivalent Example aggregate view CL-metazoa FMA CL FBbt cell i muscle organ cell muscle cell cell i p i muscle cell i i muscle cell i frontal pulsatile organ muscle Example aggregate+subset view CL-metazoa FMA CL FBbt cell i cell muscle cell cell i i muscle cell i i muscle cell i frontal pulsatile organ muscle Who maintains the connections and how? • How: – maintained as xrefs for convenience • Who: – either tcAO or CL • Synchronization? – hard – reasoning over aggregate view Who maintains the connections? [Term] cl’s responsibility id: CL:0000584 name: enterocyte def: "An epithelial cell that has its apical plasma membrane folded into microvilli to provide ample surface for the absorption of nutrients from the intestinal lumen." [SANBI:mhl] xref: FMA:62122 is_a: CL:0000239 ! brush border epithelial cell [Term] id: ZFA:0009269 name: enterocyte namespace: zebrafish_anatomy def: "An epithelial cell that has its apical plasma membrane folded into microvilli to provide ample surface for the absorption of nutrients from the intestinal lumen." [SANBI:curator] synonym: "enterocytes" EXACT PLURAL [] xref: CL:0000584 xref: TAO:0009269 xref: ZFIN:ZDB-ANAT-070308-209 is_a: ZFA:0009143 ! brush border epithelial cell relationship: end ZFS:0000044 ! Adult relationship: part_of ZFA:0005124 ! intestinal epithelium relationship: start ZFS:0000000 ! Unknown cl.obo zfa.obo zfa’s responsibility Issues with aggregate view FMA lattices = hairballs cell CL FBbt duplicate names cell i muscle cell cell i i muscle cell i i muscle cell i frontal pulsatile organ muscle Duplicate names • Searching for “muscle cell” returns – – – – – CL:0000187 ! muscle cell FBbt:00005074 ! muscle cell FMA:67328 ! muscle cell ZFA:0009114 ! muscle cell NIF_Cell:sao519252327 ! Muscle Cell • Proposed solutions 1. rename in source ontology • yuck 2. make end-user applications smarter • not practical for n applications 3. auto-rename in ontology view • best solution Aggregate view [Term] id: CL:0000584 name: enterocyte def: "An epithelial cell that has its apical plasma membrane folded into microvilli to provide ample surface for the absorption of nutrients from the intestinal lumen." [SANBI:mhl] xref: FMA:62122 is_a: CL:0000239 ! brush border epithelial cell [Term] rewritten name id: ZFA:0009269 (or syn – TBD) name: zebrafish enterocyte def: "An epithelial cell that has its apical plasma membrane folded into microvilli to provide ample surface for the absorption of nutrients from the intestinal lumen." [SANBI:curator] synonym: "enterocytes" EXACT PLURAL [] xref: CL:0000584 xref: TAO:0009269 generated from xref: ZFIN:ZDB-ANAT-070308-209 is_a: CL:0000584 ! enterocyte xref is_a: ZFA:0009143 ! brush border epithelial cell relationship: end ZFS:0000044 ! Adult relationship: part_of ZFA:0005124 ! intestinal epithelium relationship: start ZFS:0000000 ! Unknown lattice cl-metazoa.obo FMA class not shown, but it would also subclass Summary: taxon variation in CL • Current solution is a compromise – Constraints • integrate with pre-existing tcAO ontologies • these ontologies have links to gross anatomy – tcAOs loosely integrated with CL – plant cell types should be left to PO – Synchronization remains a challenge Lessons for gross anatomy cross-ontology link (sample) cell caro / all tissue import nervous system gut circulatory system gland mollusca mantle shell foot cephalopod tentacle brachial lobe gonad arthropoda mushroom body mesoderm metazoa respiratory airway drosophila neuron types XYZ muscle tissue larva skeletal tissue vertebrata trachea limb vertebra bone fin tibia vertebral column cuticle antenna skeleton appendage parietal bone mesonephros teleost amphibia weberian ossicle mammalia mammary gland tibiafibula mouse zebrafish NO pons human Conclusions • Historically anatomy ontologies have been developed by different groups largely in isolation • The Phenotype RCN should coordinate these efforts • Dynamic Views • Explicit taxonomic relationships • end • Melissa Here Idealized model (M0) • A single ontology for ontology editors and consumers • Different editors have editing rights to different ontology partitions – by taxon – by domain (e.g. neuroscience, skeletal anatomy) • No taxon-specific subtypes – use structure, function etc as differentia • Users obtain dynamic views according to their needs Example M0 link (small sample) user/editor view ventral nerve cord cell gut larva mollusc view mollusc foot appendage respiratory airway nervous system neuro view mantle tissue mesoder m circulatory gonad system gland mammalian view muscle tissue trachea limb vertebra vertebral column pons mushroom body skeletal view metencephalon mesonephros mammary gland antenna tentacle brachial lobe fin pupal DN3 period neuron weberian ossicle tibiafibula skeletal tissue bone tibia parietal bone Slightly less idealized model (M1) • Maintain series of ontologies at different taxonomic levels – euk, plant, metazoan, vertebrate, mollusc, arthropod, insect, mammal, human, drosophila • Each ontology imports/MIREOTs relevant subset of ontology “above” it – this is recursive • Subtypes are only introduced as needed • Work together on commonalities at appropriate level above your ontology Example M1 cross-ontology link (sample) cell caro / all tissue import nervous system gut circulatory system gland mollusca mantle shell foot cephalopod tentacle brachial lobe gonad arthropoda mushroom body mesoderm metazoa respiratory airway drosophila neuron types XYZ muscle tissue larva skeletal tissue vertebrata trachea limb vertebra bone fin tibia vertebral column cuticle antenna skeleton appendage parietal bone mesonephros teleost amphibia weberian ossicle mammalia mammary gland tibiafibula mouse zebrafish NO pons human Objections to M1 • Biological – homology vs analogy – functional grouping classes • e.g. respiratory airway, eye • Practical – tools – what about existing AOs? • new AOs should be designed for integration from the ground up Protocol for new AOs 1. Collect draft list of terms 2. subdivide roughly into applicability at taxonomic levels 3. request new terms from existing AOs above you 4. is a new mid-level AO required? • yes – collaborate and create, go to 1. 5. import subset from next AO above 6. Build your ontology Example: the octopus ontology • Collect and subdivide terms: – cephalopod: tentacle, brachial lobe, subesophageal mass, beak, visceropericardial coelum, swim bladder – mollusc: mantle – metazoan: nervous system, muscle tissue • Mollusc anatomy ontology does not exist – either: (i) find collaborators and create – or: (ii) keep mollusc terms in your ontology for now, but mark them as possibly migrating upwards • Import terms from mollusc AO(i), or metazoan if (ii) no mollusc AO How are things organized now? • 3 examples: – PO – TAO/ZFA – Uberon • In Melissa’s talk Some AOs are cross-granular FMA cell i muscle cell p muscle organ p i muscle cell protoplasm subcellular cell tissue and gross anatomy Cross-granular relationships FMA cell i muscle cell p i muscle organ p Cross-granular relationships FMA cell i muscle cell p i CL muscle organ p cell i muscle cell i Obsoleting generic classes in tcAOs FMA cell i muscle cell p i CL muscle organ p cell i muscle cell i Migrating cross-granular relationships FMA cell i muscle cell p i CL cell muscle organ i p muscle cell i “true path” violations FMA cell i muscle cell p i CL FBbt cell muscle organ i p muscle cell i i frontal pulsatile organ muscle fix FMA cell i CL cell muscle organ i muscle cell p i FBbt muscle cell p muscle cell AND part of some human i i frontal pulsatile organ muscle PO: Plants • Single unified ontologies for all plants – cell types and gross anatomy • Generalized from ontology of flowering plants TAO and ZFA • Teleost and Zebrafish Uberon • Designed to unify existing tcAOs • Uses modern ontology development techniques – heavily axiomatized = less work for humans, leave it to reasoners • automated QC • automated classification • • • • Current size: 5k+ classes Multiple relationship types Links to and from GO, CL Aggregate views possible using xrefs maintained in uberon Uberon lessons • Original Design Goals – Unify metazoan tcAOs for cross-species phenotype queries – Seed initial version from text matching • Was this a good idea? – metazoans are fairly diverse • • • • many original dubious grouping classes have been eliminated or split functional grouping classes remain tissues, germ layers, etc less controversial Uberon is really a vertebrate AO in which we’ve added placeholder metazoan terms – labels are misleading • high false +ve, false –ve from txt matching • starting from textbook comparative anatomy knowledge would have been better (give time)