CROP_Initiative_Sep2.. - Buffalo Ontology Site

advertisement
The CROP
(Common Reference
Ontologies for Plants)
Initiative
Barry Smith
September 13, 2013
http://ontology.buffalo.edu/smith
1
Agenda
The OBO Foundry
Principles
Reference ontologies vs.
application ontologies
Other ontology consortia
The CROP Initiative
Examples of ontologies within
CROP
2
On June 22, 1799, in Paris,
everything changed
3
International System of Units
4
How to find data?
How to find other people’s data?
How to reason with data when you find it?
How to work out what data does not yet
exist?
5
How to solve the problem of making
the data we find queryable and reusable by others?
Part of the solution must involve:
standardized terminologies and
coding schemes
6
But there are multiple kinds of
standardization for biological data, and
they do not work well together
Proposed solution: Ontology-based
annotation of data
7
ontologies = standardized labels
designed for use in annotations
to make the data cognitively
accessible to human beings
and algorithmically accessible
to computers
8
ontologies = high quality controlled
structured vocabularies for the
annotation (description) of data,
images, journal articles …
9
Ramirez et al.
Linking of Digital Images to Phylogenetic Data Matrices Using a
Morphological Ontology
Syst. Biol. 56(2):283–294, 2007
ontologies used in curation of literatur
what cellular component?
what molecular function?
what biological process?
11
Proposed framework: the
Semantic Web
• html demonstrated the power of the Web to
allow sharing of information
• can we use semantic technology to create a
Web 2.0 which would allow algorithmic
reasoning with online information based on a
common Web Ontology Language (OWL)?
• can we use netcentricity, common URLs, to
break down silos, and create useful integration
of on-line data and information
12/24
Ontology success stories, and
some reasons for failure
•
A fragment of the “Linked Open
Data” in the biomedical domain
13
http://bioportal.bioontology.org/
14
15
16
17
18
The more ontology-building is
successful, the more it fails
OWL breaks down data silos via controlled
vocabularies for the description of data
dictionaries
Unfortunately the very success of this
approach led to the creation of multiple,
new, semantic silos – because multiple
ontologies are being created in ad hoc ways
19/24
http://bioportal.bioontology.org/
Many ontologies in bioportal are created by
importing content from existing ontologies and
giving the terms imported new names and new
IDs
The result is chaos, with bits and pieces of the
same ontologies chopped in multiple different
places.
Leads to massively redundant effort, forking
and doom
20
A standard engineering methodology
• It is easier to write useful software if one works
with a simplified model
• (“…we can’t know what reality is like in any
case; we only have our concepts…”)
• This looks like a useful model to me
• (One week goes by:) This other thing looks like
a useful model to him
• Data in Pittsburgh does not interoperate with
data in Vancouver
• Science is siloed
A good solution to this silo problem
must be:
•
•
•
•
•
•
•
modular
incremental
independent of hardware and software
bottom-up
evidence-based
revisable
incorporate a strategy for motivating
potential developers and users
22
Uses of ‘ontology’ in PubMed abstracts
23
24
main reason for GO’s success
Gene Ontology and associated databases
“make it possible to systematically dissect
large gene lists in an attempt to assemble a
summary of the most enriched and
pertinent biology”
PMC2615629
GO provides a controlled system of
terms for use in annotating
(describing, tagging) data
• multi-species, multi-disciplinary, open
source
• contributing to the cumulativity of
scientific results obtained by distinct
research communities
• compare use of kilograms, meters,
seconds in formulating experimental
results
26
GO is 3 ontologies
cellular
component
molecular
function
biological
process
Top-Level Architecture
Continuant
Independent
Continuant
Occurrent
(Process, Event)
Dependent
Continuant
universals
..... ..... .....
instances
28
Problem with the GO
•
•
•
•
•
•
it covers only three types of entities
no diseases
no laboratory artifacts
no anatomy (above the cell)
only species-terms for development
no phenotypes
29
RELATION
TO TIME
CONTINUANT
INDEPENDENT
OCCURRENT
DEPENDENT
GRANULARITY
ORGAN AND
ORGANISM
Organism
(NCBI
Taxonomy)
CELL AND
CELLULAR
COMPONENT
Cell
(CL)
MOLECULE
Anatomical
Organ
Entity
Function
(FMA,
(FMP, CPRO) Phenotypic
CARO)
Quality
(PaTO)
Cellular
Cellular
Component Function
(FMA, GO)
(GO)
Molecule
(ChEBI, SO,
RnaO, PrO)
Molecular Function
(GO)
Biological
Process
(GO)
Molecular Process
(GO)
The Open Biomedical Ontologies (OBO) Foundry
30
RELATION TO
TIME
GRANULARITY
INDEPENDENT
ORGAN AND
ORGANISM
Organism
(NCBI
Taxonomy)
CELL AND
CELLULAR
COMPONENT
Cell
(CL)
MOLECULE
CONTINUANT
DEPENDENT
Anatomical
Organ
Entity
Function
(FMA,
(FMP, CPRO) Phenotypic
CARO)
Quality
(PaTO)
Cellular
Cellular
Component Function
(FMA, GO)
(GO)
Molecule
(ChEBI, SO,
RNAO, PRO)
OCCURRENT
Molecular Function
(GO)
Organism-Level
Process
(GO)
Cellular Process
(GO)
Molecular
Process
(GO)
rationale of OBO Foundry coverage
31
First step (2001)
a shared portal for (so far) 58 ontologies
(low regimentation)
http://obo.sourceforge.net  NCBO BioPortal
32
33
OBO builds on the principles
successfully implemented by the GO
recognizing that ontologies need to
be developed in tandem
34
Second step
(2006)
The OBO Foundry
http://obofoundry.org/
35
RELATION
TO TIME
CONTINUANT
INDEPENDENT
OCCURRENT
DEPENDENT
GRANULARITY
ORGAN AND
ORGANISM
Organism
(NCBI
Taxonomy)
CELL AND
CELLULAR
COMPONENT
Cell
(CL)
MOLECULE
Anatomical
Organ
Entity
Function
(FMA,
(FMP, CPRO) Phenotypic
CARO)
Quality
(PaTO)
Cellular
Cellular
Component Function
(FMA, GO)
(GO)
Molecule
(ChEBI, SO,
RnaO, PrO)
Molecular Function
(GO)
Biological
Process
(GO)
Molecular Process
(GO)
Building out from the original GO
36
RELATION TO
TIME
GRANULARITY
INDEPENDENT
ORGAN AND
ORGANISM
Organism
(NCBI
Taxonomy)
CELL AND
CELLULAR
COMPONENT
Cell
(CL)
MOLECULE
CONTINUANT
DEPENDENT
Anatomical
Organ
Entity
Function
(FMA,
(FMP, CPRO) Phenotypic
CARO)
Quality
(PaTO)
Cellular
Cellular
Component Function
(FMA, GO)
(GO)
Molecule
(ChEBI, SO,
RnaO, PrO)
OCCURRENT
Molecular Function
(GO)
Organism-Level
Process
(GO)
Cellular Process
(GO)
Molecular
Process
(GO)
initial OBO Foundry coverage
37
OBO Foundry Principles
 common formal architecture
 clearly delineated content (redundant –
overlaps with orthogonality)
 the ontology is well-documented (– overlaps
with rules for definitions; needs expanding,
for developers, for users, minimal metadata)
 plurality of independent users
 single locus of authority, trackers, help desk
38
OBO Foundry Principles
 textual definitions plus formal definitions
 all definitions should be of the genus-species
form
A =def. a B which Cs
where B is the parent term of A in the ontology
hierarchy
• formal definitions use OBO format or OWL
39
Orthogonality
• For each domain, there should be convergence upon
a single ontology that is recommended for use by
those who wish to become involved with the
Foundry initiative
• Part of the goal here is to avoid the need for
mappings – which are in any case too expensive, too
fragile, too difficult to keep up-to-date as mapped
ontologies change
• Orthogonality means:
– everyone knows where to look to find out how to annotate
each kind of data
– everyone knows where to look to find content for
application ontologies
40
Orthogonality = non-redundancy
for the reference ontologies inside
the Foundry
• application ontologies can overlap, but then
only in those areas where common coverage
is supplied by a reference ontology
41
PRINCIPLES
 COMMON FORMAL ARCHITECTURE: The
ontology uses relations which are unambiguously
defined following the pattern of definitions laid
down in the Basic Formal Ontology (BFO)

http://www.ifomis.uni-saarland.de/bfo/
‘formal’= domain neutral
42
Basic Formal Ontology
Continuant
Occurrent
biological process
Independent
Continuant
Dependent
Continuant
cell component
molecular function
OBO Foundry
provides guidelines (traffic laws) to
new groups of ontology developers in
ways which can counteract current
dispersion of effort
New principle: Employ the
methodology of cross-products
compound terms in ontologies are to be
defined as cross-products of simpler terms:
E.g elevated blood glucose is a cross-product of
PATO: increased concentration with FMA: blood and
CheBI: glucose.
= factoring out of ontologies into disciplinespecific modules (orthogonality)
45
The methodology of cross-products
enforcing use of common relations in linking terms
drawn from Foundry ontologies serves
• to ensure that the ontologies are maintained and
revised in tandem
• logically defined relations serve to bind terms in
different ontologies together to create a network
46
RELATION
TO TIME
CONTINUANT
INDEPENDENT
OCCURRENT
DEPENDENT
GRANULARITY
ORGAN AND
ORGANISM
Organism
(NCBI
Taxonomy)
CELL AND
CELLULAR
COMPONENT
Cell
(CL)
MOLECULE
Anatomical
Organ
Entity
Function
(FMA,
(FMP, CPRO) Phenotypic
CARO)
Quality
(PaTO)
Cellular
Cellular
Component Function
(FMA, GO)
(GO)
Molecule
(ChEBI, SO,
RnaO, PrO)
Molecular Function
(GO)
Biological
Process
(GO)
Molecular Process
(GO)
Building out from the original GO
47
RELATION
TO TIME
CONTINUANT
INDEPENDENT
OCCURRENT
DEPENDENT
GRANULARITY
COMPLEX OF
ORGANISMS
ORGAN AND
ORGANISM
CELL AND
CELLULAR
COMPONENT
MOLECULE
Family, Community,
Deme, Population
Population
Phenotype
Organ
Anatomical
Function
Organism
Entity
(FMP, CPRO)
(NCBI
(FMA,
Phenotypic
Taxonomy)
CARO)
Quality
(PaTO)
Cellular
Cellular
Cell
Component Function
(CL)
(FMA, GO)
(GO)
Molecule
(ChEBI, SO,
RnaO, PrO)
Molecular Function
(GO)
Population-level ontologies
Population
Process
Biological
Process
(GO)
Molecular Process
(GO)
48
RELATION
TO TIME
CONTINUANT
INDEPENDENT
OCCURRENT
DEPENDENT
ORGAN AND
ORGANISM
CELL AND
CELLULAR
COMPONENT
MOLECULE
Organism
(NCBI
Taxonomy)
Anatomical
Entity
(FMA,
CARO)
Cell
(CL)
Cellular
Component
(FMA, GO)
Molecule
(ChEBI, SO,
RnaO, PrO)
environments
GRANULARITY
Organ
Function
(FMP, CPRO)
Phenotypic
Quality
(PaTO)
Biological
Process
(GO)
Cellular
Function
(GO)
Molecular Function
(GO)
Molecular Process
(GO)
Environment Ontology
49
top level
Basic Formal Ontology (BFO)
Ontology for
Biomedical
Investigations
(OBI)
Information Artifact
Ontology
mid-level
(IAO)
Anatomy Ontology
(FMA*, CARO)
domain
level
Cell
Ontology
(CL)
Cellular
Component
Ontology
(FMA*, GO*)
Environment
Ontology
(EnvO)
Subcellular Anatomy Ontology (SAO)
Sequence Ontology
(SO*)
Protein Ontology
(PRO*)
Spatial Ontology
(BSPO)
Infectious
Disease
Ontology
(IDO*)
Phenotypic
Quality
Ontology
(PaTO)
Biological
Process
Ontology (GO*)
Molecular
Function
(GO*)
Extension Strategy + Modular Organization
50
Third step:
Creation of new ontology consortia,
modeled on the OBO Foundry
OBO Foundry
Open Biological and
Biomedical Ontologies
NIF Standard
Neuroscience
Information Framework
eagle-I Ontologies
used by VIVO and
CTSAconnect
IDO Consortium
Infectious Disease
Ontology
51
A good solution to the silo problem must
be:
•
•
•
•
•
•
•
modular
incremental
independent of software and hardware
bottom-up
evidence-based
revisable
incorporate a strategy for motivating
potential developers and users
52
Because the ontologies in the
Foundry
are built as orthogonal modules which form an
incrementally evolving network
• scientists are motivated to commit to
developing ontologies because they will need in
their own work ontologies that fit into this
network
• users are motivated by the assurance that the
ontologies they turn to are maintained by
experts
53
More benefits of orthogonality
• helps those new to ontology to find what they
need
• to find models of good practice
• ensures mutual consistency of ontologies
(trivially)
• and thereby ensures additivity of annotations
54
More benefits of orthogonality
• it rules out the sorts of simplification and
partiality which may be acceptable under
more pluralistic regimes
• thereby brings an obligation on the part of
ontology developers to commit to scientific
accuracy and domain-completeness
55
More benefits of orthogonality
• No need to reinvent the wheel for each new
domain
• Can profit from storehouse of lessons learned
• Can more easily reuse what is made by others
• Can more easily reuse training
• Can more easily inspect and criticize results of
others’ work
• Leads to innovations (e.g. Mireot, Ontofox) in
strategies for combining ontologies
56
Reference Ontologies vs.
Application Ontologies
Reference ontology = an ontology that
captures generic content and is designed
for aggressive reuse in multiple different
types of context. Our assumption is that
most reference ontologies will be created
manually on the basis of explicit assertion
of the taxonomical and other relations
between their terms.
Reference Ontologies vs.
Application Ontologies
By ‘application ontology’ we mean an
ontology that is tied to specific local
applications. Each application ontology is
created by using ontology merging software
to combine new, local content with generic
content taken over from relevant reference
ontologies
Xiang, et al., “OntoFox: Web-Based Support for Ontology
Reuse”, BMC Research Notes. 2010, 3:175.
Normalization of the ontology space
– content from reference ontologies
is maximally re-used, e.g. in
formulation of compound terms and
of cross-product definitions
(Compare normalization of a vector
space)
(Compare, again, SI System of Units)
International System of Units
60
Infectious Disease Ontology
(IDO)
61
We have data, e.g.:
• TBDB: Tuberculosis Database, including
Microarray data
• VFDB: Virulence Factor DB
• TropNetEurop Dengue Case Data
• ISD: Influenza Sequence Database at LANL
• MPD/MRD/CPP: Protein Data of PIR Resource
Center for Biodefense Proteomics Research
• PathPort: Pathogen Portal Project
62
Purpose of Infectious Disease
Ontology (IDO)
• Retrieval and integration of infectious disease
relevant data
– Sequence and protein data for pathogens
– Case report data for patients
– Clinical trial data for drugs, vaccines
– Epidemiological Data for surveillance, prevention
– ...
• Goal: to make data deriving from different
sources comparable and computable
63
IDO Strategy
• Reference ontology (IDO Core) with terms
relevant to any infectious disease
• Disease- and organism-specific application
ontologies
– for different types of host, types of vector, types
of pathogen, types of disease
64
Infectious Disease Ontology (IDO)
• Member of the OBO Foundry
• A suite of ontologies
– IDO Core:
• General terms in the ID domain.
• A hub for all IDO extensions.
– IDO Extensions:
• Disease specific.
• Developed by subject matter experts.
• Provides:
– Clear, precise, and consistent natural language definitions
– Computable logical representations (OWL, OBO)
How IDO evolves
IDOMAL
IDOFLU
IDOCore
IDORatSa
IDORatStrep
CORE and
SPOKES:
Domain
ontologies
IDOStrep
IDOSa
IDOMRSa
IDOHumanSa
IDOHIV
IDOAntibioticResistant SEMI-LATTICE:
By subject matter
experts in different
communities of
IDOHumanStrep
interest.
IDOHumanBacterial
IDO Process Model
Sample Application: A lattice of infectious disease
application ontologies from NARSA isolate data
• Expose value of Genotype-Phenotype Linked
Data by converting a free-text database from
NARSA (Network on Antimicrobial Resistance
in Staphylococcus Aureu) into a
computational resource
Ways of differentiating
Staphylococcus aureus infectious diseases
• Infectious Disease
–
–
–
–
By host type
By (sub-)species of pathogen
By antibiotic resistance
By anatomical site of infection
• Bacterial Infectious Disease
– By PFGE (Strain)
– By MLST (Sequence Type)
– By BURST (Clonal Complex)
• Sa Infectious Disease
– By SCCmec type
• By ccr type
• By mec class
– spa type
http://www.sccmec.org/Pages/SCC_ClassificationEN.html
NRS701’s resistance to clindamycin
ido.owl
narsa.owl
narsa-isolates.owl
ndf-rt
Further extensions of IDO
• Vaccine (Vaccine Ontology)
• Plant IDO
from ICBO 2012:
71
Founding CROP
The ontologies in CROP
General ontologies taken over from OBO Foundry
• ChEBI
Chemistry ontology
• GO
Gene Ontology
• PRO
Protein Ontology
• ENVO
Environment Ontology
+ GAZ Gazetteer built on ontological principles
• PATO
Phenotype Ontology
73
Plant specific ontologies to be
developed by CROP group
PO Plant Ontology
TO Trait Ontology
EO Plant Environment Ontology
Plant IDO
Plant Disease
Action items:
fix relation between EnvO and EO
fix relation between PATO and TO
Taxonomy resource
(for diseases of host and causal
organisms + vectors/secondary
hosts)
NCBI Taxonomy has most of the
hosts , but not the viruses
Examples of CROP actions
1. ontology training
2. ontology hub-spokes formations
(e.g. for plant development)
3. treaty negotiation meetings
Next steps in CROP:
PRO-PO-GO Meeting
Buffalo, Spring 2013
PRO = protein ontology
PO = plant ontology
GO = gene ontology
The Environment Ontology
OBO Foundry
Genomic Standards Consortium
National Environment Research Council (UK)
USDA, Gramene, J. Craig Venter Institute ...
78
Applications of EnvO in biology
79
80
81
82
How EnvO currently works for
information retrieval
Retrieve all experiments on organisms obtained from:
– deep-sea thermal vents
– arctic ice cores
– rainforest canopy
– alpine melt zone
Retrieve all data on organisms sampled from:
– hot and dry environments
– cold and wet environments
– a height above 5,000 meters
Retrieve all the omic data from soil organisms subject to:
– moderate heavy metal contamination
83
extending EnvO to clinical and
translational research
• we have public heath, community and
population data
• we need to make this data available for search
and algorithmic processing
• we create a consensus-based ontology which
can interoperate with ontologies for neighboring
domains of medicine and basic biology
84
Environment = totality of circumstances external
to a living organism or group of organisms
– pH
– evapotranspiration
– turbidity
– available light
– predominant vegetation
– predatory pressure
– nutrient limitation …
85
extend EnvO to the clinical domain
– dietary patterns (Food Ontology: FAO, USDA) ...
allergies
– neighborhood patterns
•
•
•
•
•
•
built environment, living conditions
climate
social networking
crime, transport
education, religion, work
health, hygiene
– disease patterns
• bio-environment (bacteriological, ...)
• patterns of disease transmission (links to IDO)
86
continuant
Aligning EnvO to the Basic Formal Ontology
system
ecosystem
biome
object
organism
pond
environmental
feature
site
mountain slope
spatial region
…
habitat
•
Habitat =def. An ecosystem which can
support the life of a given organism,
population, or community
•
Realized niche =def. An ecosystem which is
that part of a habitat which supports the life
of a given organism, population or community
Aligning EnvO to the Basic Formal Ontology
ecosystem
biome
system
continuant
habitat
object
organism
pond
environmental
feature
site
mountain slope
spatial region
…
Hutchinsonion niche
(niche as volume in a functionally
defined hyperspace)
•
=def. an n-dimensional hyper-volume
whose dimensions correspond to resource
gradients over which species are distributed
– degree of slope, exposure to sunlight, soil
fertility, foliage density, salinity...
G.E. Hutchinson (1957, 1965)
Aligning EnvO to the Basic Formal Ontology
ecosystem
biome
system
continuant
habitat
part_of
niche
object
organism
pond
environmental
feature
site
mountain slope
spatial region
…
94
95
Download