Molecular Function

advertisement
Building the Ontology Landscape
for Cancer Big Data Research
Barry Smith
May 12, 2015
Addressing cancer big data
challenges
Session 1: through imaging ontologies (BS)
Session 2: by capturing metadata for data
integration and analysis (Chris Stoeckert)
Session 3: through the Ontology of Disease (Lynn
Schriml and Lindsay Cowell)
Public Session: Cancer Big Data to Knowledge (BS)
2
National Center for Biomedical
Ontology (NCBO)
NIH Roadmap Center 2005-2015
Gene Ontology
Semantic Web
NCBO
3
Old biology data
4
New biology data
MKVSDRRKFEKANFDEFESALNNKNDLVHCPSITLFESIPTEVRSF
YEDEKSGLIKVVKFRTGAMDRKRSFEKVVISVMVGKNVKKFLTFV
EDEPDFQGGPISKYLIPKKINLMVYTLFQVHTLKFNRKDYDTLSLF
YLNRGYYNELSFRVLERCHEIASARPNDSSTMRTFTDFVSGAPIV
RSLQKSTIRKYGYNLAPYMFLLLHVDELSIFSAYQASLPGEKKVDT
ERLKRDLCPRKPIEIKYFSQICNDMMNKKDRLGDILHIILRACALNF
GAGPRGGAGDEEDRSITNEEPIIPSVDEHGLKVCKLRSPNTPRRL
RKTLDAVKALLVSSCACTARDLDIFDDNNGVAMWKWIKILYHEVA
QETTLKDSYRITLVPSSDGISLLAFAGPQRNVYVDDTTRRIQLYTD
YNKNGSSEPRLKTLDGLTSDYVFYFVTVLRQMQICALGNSYDAFN
HDPWMDVVGFEDPNQVTNRDISRIVLYSYMFLNTAKGCLVEYAT
FRQYMRELPKNAPQKLNFREMRQGLIALGRHCVGSRFETDLYES
ATSELMANHSVQTGRNIYGVDFSLTSVSGTTATLLQERASERWIQ
WLGLESDYHCSFSSTRNAEDVDISRIVLYSYMFLNTAKGCLVEYA
TFRQYMRELPKNAPQKLNFREMRQGLIALGRHCVGSRFETDLYE
5
SATSELMANHSVQTGRNIYGVDFSLTSVSGTTATLLQERASERWI
How to do biology across the genome?
MKVSDRRKFEKANFDEFESALNNKNDLVHCPSITLFESIPTEVRSFYEDEKSGLIKVVKFRTGAMDRKRSFEKVVIS
VMVGKNVKKFLTFVEDEPDFQGGPISKYLIPKKINLMVYTLFQVHTLKFNRKDYDTLSLFYLNRGYYNELSFRVLER
CHEIASARPNDSSTMRTFTDFVSGAPIVRSLQKSTIRKYGYNLAPYMFLLLHVDELSIFSAYQASLPGEKKVDTERL
KRDLCPRKPIEIKYFSQICNDMMNKKDRLGDILHIILRACALNFGAGPRGGAGDEEDRSITNEEPIIPSVDEHGLKVC
KLRSPNTPRRLRKTLDAVKALLVSSCACTARDLDIFDDNNGVAMWKWIKILYHEVAQETTLKDSYRITLVPSSDGIS
LLAFAGPQRNVYVDDTTRRIQLYTDYNKNGSSEPRLKTLDGLTSDYVFYFVTVLRQMQICALGNSYDAFNHDPWM
DVVGFEDPNQVTNRDISRIVLYSYMFLNTAKGCLVEYATFRQYMRELPKNAPQKLNFREMRQGLIALGRHCVGSR
FETDLYESATSELMANHSVQTGRNIYGVDFSLTSVSGTTATLLQERASERWIQWLGLESDYHCSFSSTRNAEDVM
KVSDRRKFEKANFDEFESALNNKNDLVHCPSITLFESIPTEVRSFYEDEKSGLIKVVKFRTGAMDRKRSFEKVVISV
MVGKNVKKFLTFVEDEPDFQGGPISKYLIPKKINLMVYTLFQVHTLKFNRKDYDTLSLFYLNRGYYNELSFRVLERC
HEIASARPNDSSTMRTFTDFVSGAPIVRSLQKSTIRKYGYNLAPYMFLLLHVDELSIFSAYQASLPGEKKVDTERLK
RDLCPRKPIEIKYFSQICNDMMNKKDRLGDILHIILRACALNFGAGPRGGAGDEEDRSITNEEPIIPSVDEHGLKVCK
LRSPNTPRRLRKTLDAVKALLVSSCACTARDLDIFDDNNGVAMWKWIKILYHEVAQETTLKDSYRITLVPSSDGISLL
AFAGPQRNVYVDDTTRRIQLYTDYNKNGSSEPRLKTLDGLTSDYVFYFVTVLRQMQICALGNSYDAFNHDPWMD
VVGFEDPNQVTNRDISRIVLYSYMFLNTAKGCLVEYATFRQYMRELPKNAPQKLNFREMRQGLIALGRHCVGSRF
ETDLYESATSELMANHSVQTGRNIYGVDFSLTSVSGTTATLLQERASERWIQWLGLESDYHCSFSSTRNAEDVMK
VSDRRKFEKANFDEFESALNNKNDLVHCPSITLFESIPTEVRSFYEDEKSGLIKVVKFRTGAMDRKRSFEKVVISVM
VGKNVKKFLTFVEDEPDFQGGPISKYLIPKKINLMVYTLFQVHTLKFNRKDYDTLSLFYLNRGYYNELSFRVLERCH
EIASARPNDSSTMRTFTDFVSGAPIVRSLQKSTIRKYGYNLAPYMFLLLHVDELSIFSAYQASLPGEKKVDTERLKR
DLCPRKPIEIKYFSQICNDMMNKKDRLGDILHIILRACALNFGAGPRGGAGDEEDRSITNEEPIIPSVDEHGLKVCKL
RSPNTPRRLRKTLDAVKALLVSSCACTARDLDIFDDNNGVAMWKWIKILYHEVAQETTLKDSYRITLVPSSDGISLL
AFAGPQRNVYVDDTTRRIQLYTDYNKNGSSEPRLKTLDGLTSDYVFYFVTVLRQMQICALGNSYDAFNHDPWMD
VVGFEDPNQVTNRDISRIVLYSYMFLNTAKGCLVEYATFRQYMRELPKNAPQKLNFREMRQGLIALGRHCVGSRF
ETDLYESATSELMANHSVQTGRNIYGVDFSLTSVSGTTATLLQERASERWIQWLGLESDYHCSFSSTRNAEDVMK
VSDRRKFEKANFDEFESALNNKNDLVHCPSITLFESIPTEVRSFYEDEKSGLIKVVKFRTGAMDRKRSFEKVVISVM
VGKNVKKFLTFVEDEPDFQGGPISKYLIPKKINLMVYTLFQVHTLKFNRKDYDTLSLFYLNRGYYNELSFRVLERCH6
EIASARPNDSSTMRTFTDFVSGAPIVRSLQKSTIRKYGYNLAPYMFLLLHVDELSIFSAYQASLPGEKKVDTERLKR
how to link the kinds of phenomena
represented here
7
to data like this?
MKVSDRRKFEKANFDEFESALNNKNDLVHCPSITLFESIPTEVRSFYEDEKSGLIKVVKFRTGAMDRK
RSFEKVVISVMVGKNVKKFLTFVEDEPDFQGGPIPSKYLIPKKINLMVYTLFQVHTLKFNRKDYDTLSL
FYLNRGYYNELSFRVLERCHEIASARPNDSSTMRTFTDFVSGAPIVRSLQKSTIRKYGYNLAPYMFLLL
HVDELSIFSAYQASLPGEKKVDTERLKRDLCPRKPIEIKYFSQICNDMMNKKDRLGDILHIILRACALNF
GAGPRGGAGDEEDRSITNEEPIIPSVDEHGLKVCKLRSPNTPRRLRKTLDAVKALLVSSCACTARDLD
IFDDNNGVAMWKWIKILYHEVAQETTLKDSYRITLVPSSDGISLLAFAGPQRNVYVDDTTRRIQLYTDY
NKNGSSEPRLKTLDGLTSDYVFYFVTVLRQMQICALGNSYDAFNHDPWMDVVGFEDPNQVTNRDIS
RIVLYSYMFLNTAKGCLVEYATFRQYMRELPKNAPQKLNFREMRQGLIALGRHCVGSRFETDLYESA
TSELMANHSVQTGRNIYGVDSFSLTSVSGTTATLLQERASERWIQWLGLESDYHCSFSSTRNAEDVV
AGEAASSNHHQKISRVTRKRPREPKSTNDILVAGQKLFGSSFEFRDLHQLRLCYEIYMADTPSVAVQA
PPGYGKTELFHLPLIALASKGDVEYVSFLFVPYTVLLANCMIRLGRRGCLNVAPVRNFIEEGYDGVTDL
YVGIYDDLASTNFTDRIAAWENIVECTFRTNNVKLGYLIVDEFHNFETEVYRQSQFGGITNLDFDAFEK
AIFLSGTAPEAVADAALQRIGLTGLAKKSMDINELKRSEDLSRGLSSYPTRMFNLIKEKSEVPLGHVHKI
RKKVESQPEEALKLLLALFESEPESKAIVVASTTNEVEELACSWRKYFRVVWIHGKLGAAEKVSRTKE
FVTDGSMQVLIGTKLVTEGIDIKQLMMVIMLDNRLNIIELIQGVGRLRDGGLCYLLSRKNSWAARNRKG
ELPPKEGCITEQVREFYGLESKKGKKGQHVGCCGSRTDLSADTVELIERMDRLAEKQATASMSIVAL
PSSFQESNSSDRYRKYCSSDEDSNTCIHGSANASTNASTNAITTASTNVRTNATTNASTNATTNASTN
ASTNATTNASTNATTNSSTNATTTASTNVRTSATTTASINVRTSATTTESTNSSTNATTTESTNSSTNA
TTTESTNSNTSATTTASINVRTSATTTESTNSSTSATTTASINVRTSATTTKSINSSTNATTTESTNSNT
NATTTESTNSSTNATTTESTNSSTNATTTESTNSNTSAATTESTNSNTSATTTESTNASAKEDANKDG
NAEDNRFHPVTDINKESYKRKGSQMVLLERKKLKAQFPNTSENMNVLQFLGFRSDEIKHLFLYGIDIYF
CPEGVFTQYGLCKGCQKMFELCVCWAGQKVSYRRIAWEALAVERMLRNDEEYKEYLEDIEPYHGDP
8
VGYLKYFSVKRREIYSQIQRNYAWYLAITRRRETISVLDSTRGKQGSQVFRMSGRQIKELYFKVWSNL
Answer
Tag the data with meaningful labels which
together form an ontology
~ Semantic enhancement
An ontology is a controlled structured
vocabulary to support annotation of data
9
Questions
How to build an ontology?
How to bring it about that all scientists in
each domain use the same ontology to
annotate their data?
How to bring it about that scientists in
neighboring domains use ontologies that
are interoperable?
10
By far the most successful: GO (Gene Ontology)
11
GO provides a controlled vocabulary of terms
for use in annotating (describing, tagging) data
• multi-species, multi-disciplinary, open
source
• built by biologists, maintained and
improved by biologists
• contributes to the cumulativity of scientific
results obtained by distinct research
communities
12
International System of Units (SI)
13
Gene products involved in cardiac muscle
development in humans
14
Prerequisites for ontology success
• Aggressive use in tagging data across multiple
communities
• Feedback cycle between ontology editors and
ontology users to ensure continuous update
• Logically and biologically coherent definitions
– logical = to allow computational reasoning and
quality assurance
– biological = to ensure consistency between
ontologies
15
GO is amazingly successful
but it covers only generic biological entities of
three sorts:
– cellular components
– molecular functions
– biological processes
and it does not provide representations of
diseases, symptoms, anatomy, pathways,
experiments …
16
Ontology success stories, and
some reasons for failure
•
So people started
building the needed
extra ontologies more
or less at random
17
18
19
20
21
22
23
24
25
26
Definition: Reaching a decision through the
application of an algorithm designed to
weigh the different factors involved.
27
Definition: Reaching a decision
through the application of an algorithm
designed to weigh the different factors
involved.
Confuses an algorithm with an act of
reaching a decision
Defines ‘algorithm’ as a special kind of
application of an algorithm. (This is
worse than circular.)
28
John Fox (Director, OpenClinical)
As a user and teacher of ontological methods in
medicine and engineering I have for years
warned my students that the design of domain
ontologies is a black art with no theoretical
foundations and few practical principles.
29
Ontology success stories, and
some reasons for failure
•
Linked Open Data,
from Musicbrainz to
Mouse Genome
Informatics
30
What are the criteria of success for
ontologies in supporting reasoning
over Big Data?
1. logically and biologically correct
subsumption hierarchies
– correct: Beta cell is_a cell
– incorrect: allergy is_a allergy
record in Microsoft Healthvault
31
John Fox, again
As a user and teacher of ontological methods in
medicine and engineering I have for years
warned my students that the design of domain
ontologies is a black art with no theoretical
foundations and few practical principles. … I
now have a much more positive story for my
students. … In the journey from black art to a
truly scientific theory for ontology design this
book is an important milestone.
32
33
RELATION
TO TIME
CONTINUANT
INDEPENDENT
OCCURRENT
DEPENDENT
GRANULARITY
ORGAN AND
ORGANISM
Organism
(NCBI
Taxonomy)
CELL AND
CELLULAR
COMPONENT
Cell
(CL)
MOLECULE
Anatomical
Organ
Entity
Function
(FMA,
(FMP, CPRO) Phenotypic
CARO)
Quality
(PaTO)
Cellular
Cellular
Component Function
(FMA, GO)
(GO)
Molecule
(ChEBI, SO,
RnaO, PrO)
Molecular Function
(GO)
Biological
Process
(GO)
Molecular Process
(GO)
Original OBO Foundry ontologies
(Gene Ontology in yellow)
34
http://obofoundry.org
– CHEBI: Chemical Entities of Biological Interest
– CL: Cell Ontology
– GO: Gene Ontology
– OBI: Ontology for Biomedical Investigations
– PATO: Phenotypic Quality Ontology
– PO: Plant Ontology
– PATO: Phenotypic Quality Ontology
– PRO: Protein Ontology
– XAO: Xenopus Anatomy Ontology
– ZFA: Zebrafish Anatomy Ontology
35
top level
mid-level
Basic Formal Ontology (BFO)
INDEPENDENT
CONTINUANT
(~THING))
Anatomy Ontology
(FMA*, CARO)
Cell Ontology
(CL)
domain
level
Subcellular Anatomy
Ontology (SAO)
Sequence Ontology
(SO)
Protein Ontology
(PRO)
DEPENDENT
CONTINUANT
(~ATTRIBUTE)
OCCURRENT
(~PROCESS)
Disease Ontology
(OGMS, IDO, HDO,
HPO)
Phenotypic Quality
Ontology
(PATO)
Biological Process
Ontology (GO)
Molecular Function
Ontology
(GO)
Extension Strategy + Modular Organization
36
Example: The Cell Ontology
RELATION TO
TIME
GRANULARITY
INDEPENDENT
ORGAN AND
ORGANISM
Organism
(NCBI
Taxonomy)
CELL AND
CELLULAR
COMPONENT
Cell
(CL)
MOLECULE
CONTINUANT
DEPENDENT
Anatomical
Organ
Entity
Function
(FMA,
(FMP, CPRO) Phenotypic
CARO)
Quality
(PaTO)
Cellular
Cellular
Component Function
(FMA, GO)
(GO)
Molecule
(ChEBI, SO,
RNAO, PRO)
OCCURRENT
Molecular Function
(GO)
Organism-Level
Process
(GO)
Cellular Process
(GO)
Molecular
Process
(GO)
rationale of OBO Foundry coverage
38
RELATION
TO TIME
CONTINUANT
INDEPENDENT
OCCURRENT
DEPENDENT
ORGAN AND
ORGANISM
CELL AND
CELLULAR
COMPONENT
MOLECULE
Organism Anatomical
(NCBI
Entity
Taxonomy) (FMA, CARO)
Cell
(CL)
Cellular
Component
(FMA, GO)
Molecule
(ChEBI, SO,
RnaO, PrO)
Environments
GRANULARITY
Organ
Function
(FMP, CPRO)
Phenotypic
Quality
(PaTO)
Biological
Process
(GO)
Cellular
Function
(GO)
Molecular Function
(GO)
Molecular Process
(GO)
Environment Ontology (EnvO)
39
examples of OBO Foundry approach
extended into other domains
NIF Standard
IDO Consortium
cROP
UNEP Ontology
Framework
Neuroscience Information
Framework
Infectious Disease Ontology
Suite
Common Reference Ontologies
for Plants
United Nations Environment
Program Ontologies
42
Common Reference Ontologies for Plants (cROP)
The second important criterion of
ontology success in supporting
reasoning over Big Data is:
keeping track of provenance
= recording how data was generated
and processed in a way external users
can understand, to enhance
• combinability
• reproducibility
44
RELATION TO
TIME
Organism
ORGAN AND
NCBI
ORGANISM
Taxonomy
CELL AND
CELLULAR
COMPONENT
MOLECULE
Cell
(CL)
Anatomical
Entity
(FMA,
CARO)
Cellular
Component
(FMA, GO)
Molecule
(ChEBI, SO,
RnaO, PrO)
DEPENDENT
CONTINUANT
Organ
Function
(FMP,
CPRO)
Cellular
Function
(GO)
Molecular
Function
(GO)
Phenotypic Quality (PATO)
INDEPENDENT
CONTINUANT
Environment Ontology (ENVO)
GRANULARITY
CONTINUANT
OCCURRENT
Biological
Process
(GO)
Ontology for
Biomedical
Investigations
(OBI)
Molecular
Process
(GO)
Recognizing a new family of protocol-driven
processes (investigation, assay, …)
45
Basic Formal Ontology (BFO)
INDEPENDENT
CONTINUANT
(~THING))
Anatomy Ontology
(FMA*, CARO)
Cell Ontology
(CL)
Subcellular Anatomy
Ontology (SAO)
Sequence Ontology
(SO)
Protein Ontology
(PRO)
DEPENDENT
CONTINUANT
(~ATTRIBUTE)
OCCURRENT
(~PROCESS)
Disease Ontology
(OGMS, IDO, HDO,
HPO)
Phenotypic Quality
Ontology
(PATO)
Molecular Function
Ontology
(GO)
Biological
Process
Protocoldriven
process
(OBI)
Extension Strategy + Modular Organization
46
The Ontology for Biomedical
Investigations
Structure of a typical investigation as viewed by OBI
(from http://obi-ontology.org/page/Investigation)
RELATION TO
TIME
OCCURRENT
Organism
ORGAN AND
NCBI
ORGANISM
Taxonomy
CELL AND
CELLULAR
COMPONENT
MOLECULE
Cell
(CL)
Anatomical
Entity
(FMA,
CARO)
Cellular
Component
(FMA, GO)
Molecule
(ChEBI, SO,
RnaO, PrO)
DEPENDENT
CONTINUANT
Organ
Function
(FMP,
CPRO)
Cellular
Function
(GO)
Molecular
Function
(GO)
Phenotypic Quality (PATO)
INDEPENDENT
CONTINUANT
Environment Ontology (ENVO)
GRANULARITY
CONTINUANT
INFORMATION
ARTIFACT
IAO
Software,
Algorithms,
…
Biological
Process
(GO)
OBI
Sequence Data,
EHR Data
…
Images,
Molecular
Image Data,
Process
OBI:
Flow Cytometry
(GO)
Imaging
Data, …
Recognizing a new family of information entities:
data, publications, images, algorithms …
48
Basic Formal Ontology (BFO)
INDEPENDENT
CONTINUANT
(~THING))
Anatomy Ontology
(FMA*, CARO)
Cell Ontology
(CL)
Subcellular Anatomy
Ontology (SAO)
DEPENDENT
CONTINUANT
(~ATTRIBUTE)
INFORMATION
ARTIFACT
(~DATA)
OCCURRENT
(~PROCESS)
Disease Ontology
(OGMS, IDO, HDO,
HPO)
Phenotypic Quality
Ontology
(PATO)
Data
Biological
Process
Assays
Sequence Ontology
Molecular Function
(SO)
Ontology
Protein Ontology
(GO)
(PRO)
Extension Strategy + Modular Organization
49
Even here, things are
not as bad as they
seem
50
51
52
53
http://purl.obolibrary.org/
obo/IAO_0000064:
algorithm
54
IAO = Information Artifact
Ontology:
https://code.google.com/p/informati
on-artifact-ontology/
55
http://bioportal.bioontology.org/ontologies/IAO
56
A list of ontologies using IAO
Adverse Event Reporting Ontology (AERO)
Bioinformatics Web Service Ontology
Biological Collections Ontology (BCO)
Chemical Methods Ontology (CHMO)
Cognitive Paradigm Ontology (COGPO)
Comparative Data Analysis Ontology
Computational Neuroscience Ontology
Core Clinical Protocol Ontology (C2PO)
Document Act Ontology
Eagle-I Research Resource Ontology (ERO)
The Email Ontology
Emotion Ontology (MFOEM)
Experimental Factor Ontology (EFO)
Exposé Ontology
IAO-Intel
Infectious Disease Ontology (IDO)
Influenza Research Database (IRD)
Information Entity Ontology
Mental Functioning Ontology (MF)
Ontology for Biomedical Investigations
Ontology for Drug Discovery Investigations
Ontology for General Medical Science
(OGMS)
Ontology for Newborn Screening Followup and Translational Research (ONSTR)
Ontology of Clinical Research (OCRE)
Ontology of Data Mining (OntoDM)
Ontology of Medically Related Social
Entities (OMRSE)
Ontology of Vaccine Adverse Events Oral
Health and Disease Ontology (OHDO)
Population and Community Ontology
Proper Name Ontology
Semanticscience Integrated Ontology
Software Ontology (SWO)
Translational Medicine Ontology (TMO)
Twitter Ontology
Vaccine Ontology (VO)
Basic Formal Ontology (BFO)
INDEPENDENT DEPENDENT
OCCURRENT
CONTINUANT CONTINUANT
(~PROCESS)
(~THING)) (~ATTRIBUTE)
Patient
Demograp
Phenotype
hics
Disease
(Disease,
processes
…)
Anatomy
Histology
Chemistry
Biological
Genotype
processes
(GO)
(GO)
IAO
OBI
Data about all of Instruments,
these things
Biomaterials,
including
Functions
image data …
Parameters,
algorithms,
Assay types,
software,
Statistics
protocols, …
…
aboutness
58
Basic Formal Ontology (BFO)
INDEPENDENT DEPENDENT
OCCURRENT
CONTINUANT CONTINUANT
(~PROCESS)
(~THING)) (~ATTRIBUTE)
Patient
Demograp
Phenotype
hics
Disease
(Disease,
processes
…)
Anatomy
Histology
Chemistry
Biological
Genotype
processes
(GO)
(GO)
IAO
OBI
Data about all of Instruments,
these things
Biomaterials,
including
Functions
image data …
Parameters,
algorithms,
Assay types,
software,
Statistics
protocols, …
biomedical imaging ontology
59
The third important criterion of
ontology success in supporting
reasoning over Big Data is:
use the framework of modular,
general-purpose reference
ontologies as starting points for
creating families of purpose-specific
application ontologies in ever
widening circles (scalability)
60
BFO
Ontology for General Medical
Science (OGMS)
Cardiovascular Disease Ontology
Genetic Disease Ontology
Cancer Disease Ontology
Genetic Disease Ontology
Immune Disease Ontology
Environmental Disease Ontology
Oral Disease Ontology
Infectious Disease Ontology
IDO Staph Aureus
IDO MRSA
IDO Australian MRSA
IDO Australian Hospital MRSA
…
61
Problems with:
Denys-Drash syndrome is_a rare nonneoplastic disorder
1. Denys-Drash syndrome involves
nephroblastoma and is therefore
neoplastic
2. X is_a rare Y does not track biology
What are the criteria of success for
ontologies in supporting reasoning
over Big Data?
correct: Beta cell is_a cell
incorrect: rare disease is_a disease
If the ontology hierarchy is to support
biologically useful reasoning it must
track biology
66
Download