NIST-Ontolog-NCOR Mini-Series: Ontology Measurement and

advertisement
The OBO Foundry
A Gold Standard Approach to
Ontology Evaluation
Barry Smith
http://ontology.buffalo.edu/smith
http://ontologist.com
1
Two types of ontology
natural-science ontologies capture
terminology-level knowledge underlying the
best current science
contrasted with administrative ontologies
(e.g. billing ontologies, bloodbank
ontologies, lab workflow ontologies)
prepared for specific, local purposes
http://ontologist.com
2
scientific ontologies have special features
Every term in a scientific ontology must be
such that the developers of the ontology
believe it to refer to some entity on the basis
of the best current evidence
 scientific ontologies are realism-based
http://ontologist.com
3
For scientific ontologies
reusability is crucial
compatibility with neighboring scientific
ontologies
it is generalizations that are
important
= universals, types, kinds
http://ontologist.com
4
An ontology is a representation
of universals
We learn about universals in reality from
looking at the results of scientific
experiments in the form of scientific theories
experiments relate to what is particular
science describes what is general
http://ontologist.com
5
what is the difference between an
ontology and a scientific theory?
an ontology is also a
terminological standardization
WHAT DOES THIS MEAN?
http://ontologist.com
6
1st aspect: additivity
cell = def. plant cell, consisting of protoplast
and cell wall; ... [Plant Ontology]
what happens when the users of the Plant
Ontology need to consider bacterial
pathogens in plants?
http://ontologist.com
7
2nd aspect: calibration with reality
gold standard kilogram
the same universal is
defined by reference
either to some artifact
or to some universal
physical constant
(for realists there is no
problem here)
http://ontologist.com
8
VIM: the International
Vocabulary of Metrology
(i) repeated measurements always give rise to some
variation in values,
(ii) one can never be sure (fallibilism) that one has got the
true value,
Hence:
(iii) there are no true values.
To keep happy those who dismiss the notion of the true
value, the international community is agreeing to a set of
terms which intentionally allow two possible interpretations
once again: bad philosophy leads to bad standards
Compare:http://ontology.buffalo.edu/medo/Wuesteria.pdf
http://ontologist.com
9
from: The NIST Reference on
Constants, Units and Uncertainty
The creation of the decimal Metric System
at the time of the French Revolution and the
subsequent deposition of two platinum
standards representing the meter and the
kilogram, on 22 June 1799, in the Archives
de la République in Paris can be seen as
the first step in the development of the
present International System of Units.
http://ontologist.com
10
from: The NIST Reference on
Constants, Units and Uncertainty
In the 1860s Maxwell and Thomson ‘formulated the
requirement for a coherent system of units with base units
and derived units.
In 1874 the British Association for the Advancement of
Science introduced the CGS system, a three-dimensional
coherent unit system based on the three mechanical units
centimeter, gram and second, using prefixes ranging from
micro to mega to express decimal submultiples and
multiples.
The following development of physics as an experimental
science was largely based on this system.’
http://ontologist.com
11
http://ontologist.com
12
Base and Derived Units
Units based on undefined SI dimensions:
meter, second, kilogram, ampere,
candela, kelvin, mole.
Units based on defined SI dimensions:
volume, area, velocity, acceleration,
newton, joule, pascal, coulomb, farad,
henry, hertz, lumen, lux, ohm, etc.
Dimensions can be multiplied and
divided (meters/second).
http://ontologist.com
13
The SI System of Units
is a qualitative ontology: it captures
qualitative dimensions of reality to which
quantities can be applied (it captures
measurable dimensions of reality)
there is a degree of conventionality in the
choice of basic vs. derived units, and in the
standard [e.g. the Paris meter] that is used
to define the unit in each dimension
http://ontologist.com
14
but the dimensions themselves exist
independently of our conventions
so that an ontology of these dimensions is a true
representation of an independently existing reality
http://ontologist.com
15
Quantities are Universals
Ingvar Johansson:
Many different things can simultaneously
have a mass of 5kg (length of 4m, etc.).
Determinate quantities are universals, which
means that they have many instances
http://ontologist.com
16
Units Ontology
developed in conjunction with PATO, the
Phenotypic qualities ontology
obo.sourceforge.net/cgi-bin/detail.cgi?quality
http://ontologist.com
17
fiat subtypes of qualities
quality
spatial quality
length
weight
1cm
1g
temperature
is_a
…
1mm
1kg
18
Representation of measurements
quality
unit
spatial quality
mm
cm
kg
g
length
weight
temperature
is_a
measurement_of
19
Ingvar Johansson:
(a) no object can possibly at one and the
same time take two values of the same
quantity dimension
(b) in case of additive quantities, only
quantities of the same dimension can be
added together to give rise to a sum: no
material object can have two masses, and
masses can only be added to other masses
http://ontologist.com
20
Controlled vocabulary
Each SI unit is represented by a symbol,
not an abbreviation. The use of unit
symbols is regulated by precise rules.
These symbols are the same in every
language of the world, even though the
names of the units themselves vary in
spelling according to national conventions.
http://ontologist.com
21
The SI system of units gives you:
a gold standard controlled vocabulary for the
expression of scientific results which makes these
results comparable and integratable
– my hypotheses can be checked against your data
my measuring equipment can be callibrated
against your measuring equipment (because each
can be callibrated against the same gold
standard)
the SI system of units can serve as a gold
standard because it is a true reflection of an
independent reality
http://ontologist.com
22
a system of units is a legend for measurement data
heartrate
speed cadence
torque
http://ontologist.com
power
23
compare: legends for maps
http://ontologist.com
24
Creating a system of units
is not easy; it has to match the way the
measurable dimensions are interconnected
in reality
it may need to be revised in light of new
discoveries about how reality is structured
http://ontologist.com
25
after Maxwell and Thomson
the subsequent development of physics as
an experimental science was largely based
on their system of standardized units.
http://ontologist.com
26
analogous achievements also in
chemistry
IUPAC
InChI
and in molecular biology,
for proteins, enzymes, genes, etc.
IUBMB
HUGO Gene Nomenclature Committee,
etc.
http://ontologist.com
27
Periodic Table
http://ontologist.com
28
the goal of realist ontology
to generalize this achievement
– specifically in biology
– and in medicine (where forces are at work
which tend to thwart standardization of
vocabulary)
to move from standardizations of nouns to
standardizations of sentences
http://ontologist.com
29
gene expression data
realist ontologies are legends for data
where in the body ?
in what kind of cell?
what kind of
disease process ?
 need for semantic annotation of data
http://ontologist.com
31
http://ontologist.com
32
the Gene Ontology is
already a de facto standard
http://ontologist.com
33
natural language labels organized in a graphtheoretic structure,designed to make the data
cognitively accessible to human beings
algorithmically accessible to machines
linked up to other data resources
because the same labels have been used
http://ontologist.com
34
compare: legends for cartoons
(for diagrams in scientific texts)
http://ontologist.com
35
ontologies are legends for
mathematical equations
xi = vector of measurements of gene i
k = the state of the gene ( as “on” or “off”)
θi = set of parameters of the Gaussian model
...
...
http://ontologist.com
36
or chemistry diagrams
Prasanna, et al.
Chemical Compound
Navigator: A Web-Based
Chem-BLAST, Chemical
Taxonomy-Based Search
Engine for Browsing
Compounds
PROTEINS: Structure, Function,
and Bioinformatics 63:907–917
(2006)
http://ontologist.com
37
annotation using common ontologies
yields integration of databases
GlyProt
MouseEcotope
Holliday junction
helicase complex
DiabetInGene
GluChem
http://ontologist.com
38
What is mapping (1)
“Given two ontologies A and B, mapping one
ontology with another means that for each
concept (node) in ontology A, we try to find a
corresponding concept (node), which has
the same or similar semantics, in ontology
B and vice verse.”
M. Ehrig M and Y. Sure, Ontology mapping - an integrated approach. In
Proceedings of the First European Semantic Web Symposium, ESWS 2004,
volume 3053 of Lecture Notes in Computer Science, pages 76–91, Heraklion,
Greece, May 2004. Springer Verlag.
http://ontologist.com
39
What is mapping (2)
“the task of relating the vocabulary of two
ontologies in such a way that the
mathematical structure of ontological
signatures and their intended interpretations,
as specified by the ontological axioms, are
respected ”.
[ontological signature = a hierarchy of concept symbols together with a
set of relation symbols whose arguments are defined over the concepts
of the concept hierarchy]
Y. Kalfoglou and M. Schorlemmer, Ontology mapping: the state of the art. Knowl. Eng. Rev., 18(1): 2003.
http://ontologist.com
40
What is mapping (3)
“a formal expression that states the semantic
relation between two entities belonging to
different ontologies”,
“Simple examples are:
concept c1 in ontology O1 is equivalent to
concept c2 in ontology O2;
concept c1 in ontology O1 is similar to concept
c2 in ontology O2;
individual i1 in ontology O1 is the same as
individual i2 in ontology O2”
P. Bouquet et al. KnowledgeWeb deliverable D2.2.1. Specification of a common framework for characterizing alignment.
http://ontologist.com
41
One way to support ontology
matching (and evaluation)
have experts manually prepare for each
given matching problem a gold standard to
which matching efforts could be compared.
– M. Ehrig and J. Euzenat, Relaxed Precision and Recall for
Ontology Matching, in: Proc. K-Cap 2005 workshop on
Integrating ontology, Banff (CA), p. 25-32, 2005.
http://ontologist.com
42
Gold standard methodology for
ontology evaluation
is very expensive
who are the experts?
sometimes cannot be done for political reasons
• UMLS metathesaurus
even a gold standard can contain errors
http://ontologist.com
43
Solution: The OBO Foundry
1. some large pieces already exist (especially
Gene Ontology, Foundational Model of
Anatomy)
2. processes of unification and reform already in
place
3. all participants aiming for additivity
4. procedures for constant update in light of
scientific advance
http://obofoundry.org
http://ontologist.com
44
The GO methodology of annotations
science basis of the GO: trained experts curating peerreviewed literature
RESULT: a slowly growing computer-interpretable map of
biological reality within which major databases are
automatically integrated in semantically searchable form
Contrast: data-mining based approaches to ontology
construction
http://ontologist.com
45
Systematic annotation of references to
gene products in literature
• leads to improvements and extensions of
the ontology
• leads to better annotations
• leads to a virtuous cycle of improvement in
the quality and reach of both future
annotations and the ontology itself
http://ontologist.com
46
Five bangs for your GO buck
science base
cross-species database integration
cross-granularity database integration
through links to the entities in biological reality
 semantic searchability links people to
software
http://ontologist.com
47
First step (2003)
a shared portal for (so far) 58 ontologies
(low regimentation)
http://obo.sourceforge.net  NCBO BioPortal
http://ontologist.com
48
http://ontologist.com
49
Second step (2004)
reform efforts initiated, e.g. linking GO to other
OBO ontologies to ensure orthogonality
GO
id: CL:0000062
name: osteoblast
def: "A bone-forming cell which secretes an extracellular matrix.
Hydroxyapatite crystals are then deposited into the matrix to form
bone."
is_a: CL:0000055
relationship: develops_from CL:0000008
relationship: develops_from CL:0000375
Osteoblast differentiation: Processes whereby an
osteoprogenitor cell or a cranial neural crest cell
acquires the specialized features of an osteoblast, a
bone-forming cell which secretes extracellular matrix.
http://ontologist.com
+
Cell type
=
New Definition
50
Third step (2006)
The OBO Foundry
http://obofoundry.org/
http://ontologist.com
51
A prospective standard
designed to guarantee interoperability of ontologies from
the very start (contrast to: post hoc mapping)
established March 2006
12 initial candidate OBO ontologies – focused primarily on
basic science domains
several being constructed ab initio
by influential consortia who have the authority to impose
their use on large parts of the relevant communities.
http://ontologist.com
52
GO Gene Ontology
undergoing
ChEBI Chemical Ontology
rigorous
CL Cell Ontology
FMA Foundational Model of Anatomy reform
PaTO Phenotype Quality Ontology
SO Sequence Ontology
CARO Common Anatomy Reference Ontology
CTO Clinical Trial Ontology
FuGO Functional Genomics Investigation Ontology
PrO Protein Ontology
RnaO RNA Ontology
RO Relation Ontology
new
The OBO Foundry
http://ontologist.com
http://obofoundry.org/
53
Ontology
Scope
URL
Custodians
Cell Ontology
(CL)
cell types from prokaryotes
to mammals
obo.sourceforge.net/cgibin/detail.cgi?cell
Jonathan Bard, Michael
Ashburner, Oliver Hofman
Chemical Entities of Biological Interest (ChEBI)
molecular entities
ebi.ac.uk/chebi
Paula Dematos,
Rafael Alcantara
Common Anatomy Reference Ontology (CARO)
anatomical structures in
human and model organisms
(under development)
Melissa Haendel, Terry
Hayamizu, Cornelius Rosse,
David Sutherland,
Foundational Model of
Anatomy (FMA)
structure of the human body
fma.biostr.washington.
edu
JLV Mejino Jr.,
Cornelius Rosse
Functional Genomics
Investigation Ontology
(FuGO)
design, protocol, data
instrumentation, and analysis
fugo.sf.net
FuGO Working Group
Gene Ontology
(GO)
cellular components,
molecular functions,
biological processes
www.geneontology.org
Gene Ontology Consortium
Phenotypic Quality
Ontology
(PaTO)
qualities of anatomical
structures
obo.sourceforge.net/cgi
-bin/ detail.cgi?
attribute_and_value
Michael Ashburner, Suzanna
Lewis, Georgios Gkoutos
Protein Ontology
(PrO)
protein types and
modifications
(under development)
Protein Ontology Consortium
Relation Ontology (RO)
relations
obo.sf.net/relationship
Barry Smith, Chris Mungall
RNA Ontology
(RnaO)
three-dimensional RNA
structures
(under development)
RNA Ontology Consortium
properties and features of
nucleic sequences
song.sf.net
Karen Eilbeck
Sequence Ontology
http://ontologist.com
(SO)
54
GOALS
 to providing a FRAMEWORK OF RULES to
counteract the current policy of ad hoc creation
of new ontologies y each clinical research group
 REUSABILITY: if data-schemas are formulated
using a single well-integrated framework
ontology system in widespread use, then this
data will be to this degree itself become more
widely accessible and usable
The OBO Foundry
http://ontologist.com
http://obofoundry.org/
55
GOALS
 to serve as BENCHMARK FOR
IMPROVEMENTS: once a system of
interoperable reference ontologies is there, it will
make sense to calibrate existing terminologies in
its terms in order to achieve more robust
alignment and greater domain coverage
The OBO Foundry
http://ontologist.com
http://obofoundry.org/
56
Gold standard
Two aspects:
1. an expression of practice carried out perfectly
(for example, the optimal therapy for a given
medical problem)
2. based on complete acceptance or consensus:
everyone qualified to render a judgement would
agree to what the gold standard is.
Friedman CP, Wyatt J. Evaluation Methods in Medical Informatics
http://ontologist.com
57
Gold standards
are worth approximating. That is, “tarnished” or
“fuzzy” standards are better than no standards
at all. ... studies comparing the performance of
information resources against imperfect
standards, so long as the degree of imperfection
has been estimated, represent a stronger
approach than those that bypass the issue of a
standard altogether.
Friedman CP, Wyatt J. Evaluation Methods in Medical Informatics
http://ontologist.com
58
Gold standards
can also be partial: to serve ontology matching
and evaluation it is enough to have ontologies
comprehending even selected aspects of
biomedical reality, provided the assertions
contained in these ontologies are universally
true
in non-closed worlds, gold standards will always
be partial
in complex disciplines gold standards will always
be evolving
http://ontologist.com
59
the constraint of universality
OBO Foundry ontologies accept only those relations
between their terms which obtain universally (= for
all instances)
lung is_a anatomical structure
lobe of lung part_of lung
Compare:
electrons have a negative electric charge
electrons have a negative electric charge of
1.6 × 10-19 coulomb
http://ontologist.com
60
Principle of Low Hanging Fruit
Ontologies should include even absolutely
trivial assertions (assertions you know to be
universally true)
herpes virus is_a virus
Computers need to be led by the hand
http://ontologist.com
61
if the standard is to work
it has to simulate the achievements of the SI system
of units
• simple
• controlled vocabulary
• wide acceptance
• uncontroversial
• allows cross-disciplinary, cross-experimenter
callibration
• my data can confirm or disconfirm your
hypothesis
http://ontologist.com
62
Download