Building the Ontology Landscape for Cancer Big Data Research Barry Smith May 12, 2015 Addressing cancer big data challenges Session 1: through imaging ontologies (BS) Session 2: by capturing metadata for data integration and analysis (Chris Stoeckert) Session 3: through the Ontology of Disease (Lynn Schriml and Lindsay Cowell) Public Session: Cancer Big Data to Knowledge (BS) 2 National Center for Biomedical Ontology (NCBO) NIH Roadmap Center 2005-2015 Gene Ontology Semantic Web NCBO 3 Old biology data 4 New biology data MKVSDRRKFEKANFDEFESALNNKNDLVHCPSITLFESIPTEVRSF YEDEKSGLIKVVKFRTGAMDRKRSFEKVVISVMVGKNVKKFLTFV EDEPDFQGGPISKYLIPKKINLMVYTLFQVHTLKFNRKDYDTLSLF YLNRGYYNELSFRVLERCHEIASARPNDSSTMRTFTDFVSGAPIV RSLQKSTIRKYGYNLAPYMFLLLHVDELSIFSAYQASLPGEKKVDT ERLKRDLCPRKPIEIKYFSQICNDMMNKKDRLGDILHIILRACALNF GAGPRGGAGDEEDRSITNEEPIIPSVDEHGLKVCKLRSPNTPRRL RKTLDAVKALLVSSCACTARDLDIFDDNNGVAMWKWIKILYHEVA QETTLKDSYRITLVPSSDGISLLAFAGPQRNVYVDDTTRRIQLYTD YNKNGSSEPRLKTLDGLTSDYVFYFVTVLRQMQICALGNSYDAFN HDPWMDVVGFEDPNQVTNRDISRIVLYSYMFLNTAKGCLVEYAT FRQYMRELPKNAPQKLNFREMRQGLIALGRHCVGSRFETDLYES ATSELMANHSVQTGRNIYGVDFSLTSVSGTTATLLQERASERWIQ WLGLESDYHCSFSSTRNAEDVDISRIVLYSYMFLNTAKGCLVEYA TFRQYMRELPKNAPQKLNFREMRQGLIALGRHCVGSRFETDLYE 5 SATSELMANHSVQTGRNIYGVDFSLTSVSGTTATLLQERASERWI How to do biology across the genome? MKVSDRRKFEKANFDEFESALNNKNDLVHCPSITLFESIPTEVRSFYEDEKSGLIKVVKFRTGAMDRKRSFEKVVIS VMVGKNVKKFLTFVEDEPDFQGGPISKYLIPKKINLMVYTLFQVHTLKFNRKDYDTLSLFYLNRGYYNELSFRVLER CHEIASARPNDSSTMRTFTDFVSGAPIVRSLQKSTIRKYGYNLAPYMFLLLHVDELSIFSAYQASLPGEKKVDTERL KRDLCPRKPIEIKYFSQICNDMMNKKDRLGDILHIILRACALNFGAGPRGGAGDEEDRSITNEEPIIPSVDEHGLKVC KLRSPNTPRRLRKTLDAVKALLVSSCACTARDLDIFDDNNGVAMWKWIKILYHEVAQETTLKDSYRITLVPSSDGIS LLAFAGPQRNVYVDDTTRRIQLYTDYNKNGSSEPRLKTLDGLTSDYVFYFVTVLRQMQICALGNSYDAFNHDPWM DVVGFEDPNQVTNRDISRIVLYSYMFLNTAKGCLVEYATFRQYMRELPKNAPQKLNFREMRQGLIALGRHCVGSR FETDLYESATSELMANHSVQTGRNIYGVDFSLTSVSGTTATLLQERASERWIQWLGLESDYHCSFSSTRNAEDVM KVSDRRKFEKANFDEFESALNNKNDLVHCPSITLFESIPTEVRSFYEDEKSGLIKVVKFRTGAMDRKRSFEKVVISV MVGKNVKKFLTFVEDEPDFQGGPISKYLIPKKINLMVYTLFQVHTLKFNRKDYDTLSLFYLNRGYYNELSFRVLERC HEIASARPNDSSTMRTFTDFVSGAPIVRSLQKSTIRKYGYNLAPYMFLLLHVDELSIFSAYQASLPGEKKVDTERLK RDLCPRKPIEIKYFSQICNDMMNKKDRLGDILHIILRACALNFGAGPRGGAGDEEDRSITNEEPIIPSVDEHGLKVCK LRSPNTPRRLRKTLDAVKALLVSSCACTARDLDIFDDNNGVAMWKWIKILYHEVAQETTLKDSYRITLVPSSDGISLL AFAGPQRNVYVDDTTRRIQLYTDYNKNGSSEPRLKTLDGLTSDYVFYFVTVLRQMQICALGNSYDAFNHDPWMD VVGFEDPNQVTNRDISRIVLYSYMFLNTAKGCLVEYATFRQYMRELPKNAPQKLNFREMRQGLIALGRHCVGSRF ETDLYESATSELMANHSVQTGRNIYGVDFSLTSVSGTTATLLQERASERWIQWLGLESDYHCSFSSTRNAEDVMK VSDRRKFEKANFDEFESALNNKNDLVHCPSITLFESIPTEVRSFYEDEKSGLIKVVKFRTGAMDRKRSFEKVVISVM VGKNVKKFLTFVEDEPDFQGGPISKYLIPKKINLMVYTLFQVHTLKFNRKDYDTLSLFYLNRGYYNELSFRVLERCH EIASARPNDSSTMRTFTDFVSGAPIVRSLQKSTIRKYGYNLAPYMFLLLHVDELSIFSAYQASLPGEKKVDTERLKR DLCPRKPIEIKYFSQICNDMMNKKDRLGDILHIILRACALNFGAGPRGGAGDEEDRSITNEEPIIPSVDEHGLKVCKL RSPNTPRRLRKTLDAVKALLVSSCACTARDLDIFDDNNGVAMWKWIKILYHEVAQETTLKDSYRITLVPSSDGISLL AFAGPQRNVYVDDTTRRIQLYTDYNKNGSSEPRLKTLDGLTSDYVFYFVTVLRQMQICALGNSYDAFNHDPWMD VVGFEDPNQVTNRDISRIVLYSYMFLNTAKGCLVEYATFRQYMRELPKNAPQKLNFREMRQGLIALGRHCVGSRF ETDLYESATSELMANHSVQTGRNIYGVDFSLTSVSGTTATLLQERASERWIQWLGLESDYHCSFSSTRNAEDVMK VSDRRKFEKANFDEFESALNNKNDLVHCPSITLFESIPTEVRSFYEDEKSGLIKVVKFRTGAMDRKRSFEKVVISVM VGKNVKKFLTFVEDEPDFQGGPISKYLIPKKINLMVYTLFQVHTLKFNRKDYDTLSLFYLNRGYYNELSFRVLERCH6 EIASARPNDSSTMRTFTDFVSGAPIVRSLQKSTIRKYGYNLAPYMFLLLHVDELSIFSAYQASLPGEKKVDTERLKR how to link the kinds of phenomena represented here 7 to data like this? MKVSDRRKFEKANFDEFESALNNKNDLVHCPSITLFESIPTEVRSFYEDEKSGLIKVVKFRTGAMDRK RSFEKVVISVMVGKNVKKFLTFVEDEPDFQGGPIPSKYLIPKKINLMVYTLFQVHTLKFNRKDYDTLSL FYLNRGYYNELSFRVLERCHEIASARPNDSSTMRTFTDFVSGAPIVRSLQKSTIRKYGYNLAPYMFLLL HVDELSIFSAYQASLPGEKKVDTERLKRDLCPRKPIEIKYFSQICNDMMNKKDRLGDILHIILRACALNF GAGPRGGAGDEEDRSITNEEPIIPSVDEHGLKVCKLRSPNTPRRLRKTLDAVKALLVSSCACTARDLD IFDDNNGVAMWKWIKILYHEVAQETTLKDSYRITLVPSSDGISLLAFAGPQRNVYVDDTTRRIQLYTDY NKNGSSEPRLKTLDGLTSDYVFYFVTVLRQMQICALGNSYDAFNHDPWMDVVGFEDPNQVTNRDIS RIVLYSYMFLNTAKGCLVEYATFRQYMRELPKNAPQKLNFREMRQGLIALGRHCVGSRFETDLYESA TSELMANHSVQTGRNIYGVDSFSLTSVSGTTATLLQERASERWIQWLGLESDYHCSFSSTRNAEDVV AGEAASSNHHQKISRVTRKRPREPKSTNDILVAGQKLFGSSFEFRDLHQLRLCYEIYMADTPSVAVQA PPGYGKTELFHLPLIALASKGDVEYVSFLFVPYTVLLANCMIRLGRRGCLNVAPVRNFIEEGYDGVTDL YVGIYDDLASTNFTDRIAAWENIVECTFRTNNVKLGYLIVDEFHNFETEVYRQSQFGGITNLDFDAFEK AIFLSGTAPEAVADAALQRIGLTGLAKKSMDINELKRSEDLSRGLSSYPTRMFNLIKEKSEVPLGHVHKI RKKVESQPEEALKLLLALFESEPESKAIVVASTTNEVEELACSWRKYFRVVWIHGKLGAAEKVSRTKE FVTDGSMQVLIGTKLVTEGIDIKQLMMVIMLDNRLNIIELIQGVGRLRDGGLCYLLSRKNSWAARNRKG ELPPKEGCITEQVREFYGLESKKGKKGQHVGCCGSRTDLSADTVELIERMDRLAEKQATASMSIVAL PSSFQESNSSDRYRKYCSSDEDSNTCIHGSANASTNASTNAITTASTNVRTNATTNASTNATTNASTN ASTNATTNASTNATTNSSTNATTTASTNVRTSATTTASINVRTSATTTESTNSSTNATTTESTNSSTNA TTTESTNSNTSATTTASINVRTSATTTESTNSSTSATTTASINVRTSATTTKSINSSTNATTTESTNSNT NATTTESTNSSTNATTTESTNSSTNATTTESTNSNTSAATTESTNSNTSATTTESTNASAKEDANKDG NAEDNRFHPVTDINKESYKRKGSQMVLLERKKLKAQFPNTSENMNVLQFLGFRSDEIKHLFLYGIDIYF CPEGVFTQYGLCKGCQKMFELCVCWAGQKVSYRRIAWEALAVERMLRNDEEYKEYLEDIEPYHGDP 8 VGYLKYFSVKRREIYSQIQRNYAWYLAITRRRETISVLDSTRGKQGSQVFRMSGRQIKELYFKVWSNL Answer Tag the data with meaningful labels which together form an ontology ~ Semantic enhancement An ontology is a controlled structured vocabulary to support annotation of data 9 Questions How to build an ontology? How to bring it about that all scientists in each domain use the same ontology to annotate their data? How to bring it about that scientists in neighboring domains use ontologies that are interoperable? 10 By far the most successful: GO (Gene Ontology) 11 GO provides a controlled vocabulary of terms for use in annotating (describing, tagging) data • multi-species, multi-disciplinary, open source • built by biologists, maintained and improved by biologists • contributes to the cumulativity of scientific results obtained by distinct research communities 12 International System of Units (SI) 13 Gene products involved in cardiac muscle development in humans 14 Prerequisites for ontology success • Aggressive use in tagging data across multiple communities • Feedback cycle between ontology editors and ontology users to ensure continuous update • Logically and biologically coherent definitions – logical = to allow computational reasoning and quality assurance – biological = to ensure consistency between ontologies 15 GO is amazingly successful but it covers only generic biological entities of three sorts: – cellular components – molecular functions – biological processes and it does not provide representations of diseases, symptoms, anatomy, pathways, experiments … 16 Ontology success stories, and some reasons for failure • So people started building the needed extra ontologies more or less at random 17 18 19 20 21 22 23 24 25 26 Definition: Reaching a decision through the application of an algorithm designed to weigh the different factors involved. 27 Definition: Reaching a decision through the application of an algorithm designed to weigh the different factors involved. Confuses an algorithm with an act of reaching a decision Defines ‘algorithm’ as a special kind of application of an algorithm. (This is worse than circular.) 28 John Fox (Director, OpenClinical) As a user and teacher of ontological methods in medicine and engineering I have for years warned my students that the design of domain ontologies is a black art with no theoretical foundations and few practical principles. 29 Ontology success stories, and some reasons for failure • Linked Open Data, from Musicbrainz to Mouse Genome Informatics 30 What are the criteria of success for ontologies in supporting reasoning over Big Data? 1. logically and biologically correct subsumption hierarchies – correct: Beta cell is_a cell – incorrect: allergy is_a allergy record in Microsoft Healthvault 31 John Fox, again As a user and teacher of ontological methods in medicine and engineering I have for years warned my students that the design of domain ontologies is a black art with no theoretical foundations and few practical principles. … I now have a much more positive story for my students. … In the journey from black art to a truly scientific theory for ontology design this book is an important milestone. 32 33 RELATION TO TIME CONTINUANT INDEPENDENT OCCURRENT DEPENDENT GRANULARITY ORGAN AND ORGANISM Organism (NCBI Taxonomy) CELL AND CELLULAR COMPONENT Cell (CL) MOLECULE Anatomical Organ Entity Function (FMA, (FMP, CPRO) Phenotypic CARO) Quality (PaTO) Cellular Cellular Component Function (FMA, GO) (GO) Molecule (ChEBI, SO, RnaO, PrO) Molecular Function (GO) Biological Process (GO) Molecular Process (GO) Original OBO Foundry ontologies (Gene Ontology in yellow) 34 http://obofoundry.org – CHEBI: Chemical Entities of Biological Interest – CL: Cell Ontology – GO: Gene Ontology – OBI: Ontology for Biomedical Investigations – PATO: Phenotypic Quality Ontology – PO: Plant Ontology – PATO: Phenotypic Quality Ontology – PRO: Protein Ontology – XAO: Xenopus Anatomy Ontology – ZFA: Zebrafish Anatomy Ontology 35 top level mid-level Basic Formal Ontology (BFO) INDEPENDENT CONTINUANT (~THING)) Anatomy Ontology (FMA*, CARO) Cell Ontology (CL) domain level Subcellular Anatomy Ontology (SAO) Sequence Ontology (SO) Protein Ontology (PRO) DEPENDENT CONTINUANT (~ATTRIBUTE) OCCURRENT (~PROCESS) Disease Ontology (OGMS, IDO, HDO, HPO) Phenotypic Quality Ontology (PATO) Biological Process Ontology (GO) Molecular Function Ontology (GO) Extension Strategy + Modular Organization 36 Example: The Cell Ontology RELATION TO TIME GRANULARITY INDEPENDENT ORGAN AND ORGANISM Organism (NCBI Taxonomy) CELL AND CELLULAR COMPONENT Cell (CL) MOLECULE CONTINUANT DEPENDENT Anatomical Organ Entity Function (FMA, (FMP, CPRO) Phenotypic CARO) Quality (PaTO) Cellular Cellular Component Function (FMA, GO) (GO) Molecule (ChEBI, SO, RNAO, PRO) OCCURRENT Molecular Function (GO) Organism-Level Process (GO) Cellular Process (GO) Molecular Process (GO) rationale of OBO Foundry coverage 38 RELATION TO TIME CONTINUANT INDEPENDENT OCCURRENT DEPENDENT ORGAN AND ORGANISM CELL AND CELLULAR COMPONENT MOLECULE Organism Anatomical (NCBI Entity Taxonomy) (FMA, CARO) Cell (CL) Cellular Component (FMA, GO) Molecule (ChEBI, SO, RnaO, PrO) Environments GRANULARITY Organ Function (FMP, CPRO) Phenotypic Quality (PaTO) Biological Process (GO) Cellular Function (GO) Molecular Function (GO) Molecular Process (GO) Environment Ontology (EnvO) 39 examples of OBO Foundry approach extended into other domains NIF Standard IDO Consortium cROP UNEP Ontology Framework Neuroscience Information Framework Infectious Disease Ontology Suite Common Reference Ontologies for Plants United Nations Environment Program Ontologies 42 Common Reference Ontologies for Plants (cROP) The second important criterion of ontology success in supporting reasoning over Big Data is: keeping track of provenance = recording how data was generated and processed in a way external users can understand, to enhance • combinability • reproducibility 44 RELATION TO TIME Organism ORGAN AND NCBI ORGANISM Taxonomy CELL AND CELLULAR COMPONENT MOLECULE Cell (CL) Anatomical Entity (FMA, CARO) Cellular Component (FMA, GO) Molecule (ChEBI, SO, RnaO, PrO) DEPENDENT CONTINUANT Organ Function (FMP, CPRO) Cellular Function (GO) Molecular Function (GO) Phenotypic Quality (PATO) INDEPENDENT CONTINUANT Environment Ontology (ENVO) GRANULARITY CONTINUANT OCCURRENT Biological Process (GO) Ontology for Biomedical Investigations (OBI) Molecular Process (GO) Recognizing a new family of protocol-driven processes (investigation, assay, …) 45 Basic Formal Ontology (BFO) INDEPENDENT CONTINUANT (~THING)) Anatomy Ontology (FMA*, CARO) Cell Ontology (CL) Subcellular Anatomy Ontology (SAO) Sequence Ontology (SO) Protein Ontology (PRO) DEPENDENT CONTINUANT (~ATTRIBUTE) OCCURRENT (~PROCESS) Disease Ontology (OGMS, IDO, HDO, HPO) Phenotypic Quality Ontology (PATO) Molecular Function Ontology (GO) Biological Process Protocoldriven process (OBI) Extension Strategy + Modular Organization 46 The Ontology for Biomedical Investigations Structure of a typical investigation as viewed by OBI (from http://obi-ontology.org/page/Investigation) RELATION TO TIME OCCURRENT Organism ORGAN AND NCBI ORGANISM Taxonomy CELL AND CELLULAR COMPONENT MOLECULE Cell (CL) Anatomical Entity (FMA, CARO) Cellular Component (FMA, GO) Molecule (ChEBI, SO, RnaO, PrO) DEPENDENT CONTINUANT Organ Function (FMP, CPRO) Cellular Function (GO) Molecular Function (GO) Phenotypic Quality (PATO) INDEPENDENT CONTINUANT Environment Ontology (ENVO) GRANULARITY CONTINUANT INFORMATION ARTIFACT IAO Software, Algorithms, … Biological Process (GO) OBI Sequence Data, EHR Data … Images, Molecular Image Data, Process OBI: Flow Cytometry (GO) Imaging Data, … Recognizing a new family of information entities: data, publications, images, algorithms … 48 Basic Formal Ontology (BFO) INDEPENDENT CONTINUANT (~THING)) Anatomy Ontology (FMA*, CARO) Cell Ontology (CL) Subcellular Anatomy Ontology (SAO) DEPENDENT CONTINUANT (~ATTRIBUTE) INFORMATION ARTIFACT (~DATA) OCCURRENT (~PROCESS) Disease Ontology (OGMS, IDO, HDO, HPO) Phenotypic Quality Ontology (PATO) Data Biological Process Assays Sequence Ontology Molecular Function (SO) Ontology Protein Ontology (GO) (PRO) Extension Strategy + Modular Organization 49 Even here, things are not as bad as they seem 50 51 52 53 http://purl.obolibrary.org/ obo/IAO_0000064: algorithm 54 IAO = Information Artifact Ontology: https://code.google.com/p/informati on-artifact-ontology/ 55 http://bioportal.bioontology.org/ontologies/IAO 56 A list of ontologies using IAO Adverse Event Reporting Ontology (AERO) Bioinformatics Web Service Ontology Biological Collections Ontology (BCO) Chemical Methods Ontology (CHMO) Cognitive Paradigm Ontology (COGPO) Comparative Data Analysis Ontology Computational Neuroscience Ontology Core Clinical Protocol Ontology (C2PO) Document Act Ontology Eagle-I Research Resource Ontology (ERO) The Email Ontology Emotion Ontology (MFOEM) Experimental Factor Ontology (EFO) Exposé Ontology IAO-Intel Infectious Disease Ontology (IDO) Influenza Research Database (IRD) Information Entity Ontology Mental Functioning Ontology (MF) Ontology for Biomedical Investigations Ontology for Drug Discovery Investigations Ontology for General Medical Science (OGMS) Ontology for Newborn Screening Followup and Translational Research (ONSTR) Ontology of Clinical Research (OCRE) Ontology of Data Mining (OntoDM) Ontology of Medically Related Social Entities (OMRSE) Ontology of Vaccine Adverse Events Oral Health and Disease Ontology (OHDO) Population and Community Ontology Proper Name Ontology Semanticscience Integrated Ontology Software Ontology (SWO) Translational Medicine Ontology (TMO) Twitter Ontology Vaccine Ontology (VO) Basic Formal Ontology (BFO) INDEPENDENT DEPENDENT OCCURRENT CONTINUANT CONTINUANT (~PROCESS) (~THING)) (~ATTRIBUTE) Patient Demograp Phenotype hics Disease (Disease, processes …) Anatomy Histology Chemistry Biological Genotype processes (GO) (GO) IAO OBI Data about all of Instruments, these things Biomaterials, including Functions image data … Parameters, algorithms, Assay types, software, Statistics protocols, … … aboutness 58 Basic Formal Ontology (BFO) INDEPENDENT DEPENDENT OCCURRENT CONTINUANT CONTINUANT (~PROCESS) (~THING)) (~ATTRIBUTE) Patient Demograp Phenotype hics Disease (Disease, processes …) Anatomy Histology Chemistry Biological Genotype processes (GO) (GO) IAO OBI Data about all of Instruments, these things Biomaterials, including Functions image data … Parameters, algorithms, Assay types, software, Statistics protocols, … biomedical imaging ontology 59 The third important criterion of ontology success in supporting reasoning over Big Data is: use the framework of modular, general-purpose reference ontologies as starting points for creating families of purpose-specific application ontologies in ever widening circles (scalability) 60 BFO Ontology for General Medical Science (OGMS) Cardiovascular Disease Ontology Genetic Disease Ontology Cancer Disease Ontology Genetic Disease Ontology Immune Disease Ontology Environmental Disease Ontology Oral Disease Ontology Infectious Disease Ontology IDO Staph Aureus IDO MRSA IDO Australian MRSA IDO Australian Hospital MRSA … 61 Problems with: Denys-Drash syndrome is_a rare nonneoplastic disorder 1. Denys-Drash syndrome involves nephroblastoma and is therefore neoplastic 2. X is_a rare Y does not track biology What are the criteria of success for ontologies in supporting reasoning over Big Data? correct: Beta cell is_a cell incorrect: rare disease is_a disease If the ontology hierarchy is to support biologically useful reasoning it must track biology 66