ppt - Phenotype RCN

advertisement
How to build cross-species
interoperable ontologies
Chris Mungall, LBNL
Melissa Haendel, OHSU
The challenge..
• There are many fun and interesting issues
involved in building and using cross-species
ontologies
– homology
– evo-devo
– reasoning using ontologies
– connecting genomics databases to phenotypes
but…
• Unfortunately, there are many more prosaic
issues with unsatisfying solutions
– multiple ontologies already exist
– limited cooperation between the developers of these
ontologies
– they differ widely in every aspect imaginable
– they are heavily embedded in existing databases and
applications and slow to change
– tools and infrastructure support falls short of what we
need
• FORTUNATELY, solutions are emerging..
Outline
• Anatomy Ontologies: Background
• Case studies
– GO: A unified cross-species ontology
– CL: Cell Ontology: Unifying multiple existing
efforts
• Building interoperable gross anatomy
ontologies
– (Melissa)
Ontologies
• Computable qualitative representations of some
part of the world
• Relationships with computable properties
– e.g. transitivity
– languages and formats like owl and obo have a formal
semantics
• Entities are grouped into classes
• Relationships are statements about all the
members of a class
– the most common form is the all-some statement
Ontologies are not smart
• Deductive Logic is not flexible
• Example
– Human knowledge:
• chromosomes are found in the nucleus
– Naïve ontology encoding:
• every chromosome part_of some nucleus
– But this is wrong
• Ontologies don’t make exceptions!
– Solution:
• (1) create location-specific subclasses
– nuclear chromosome
– mitochondrial chromosome
• (2) – invert statement: every nucleus has chromosomes
Existing Anatomy Ontologies
•
•
•
•
Human AOs
Model Organism AOs
Domain specific AOs
Cross-species AOs
FMA : Foundational Model of Anatomy
• Domain: adult human
– no develops_from relationships, few embryonic structures
• Size: large (70k+ classes)
• Language: frames
• Approach
– formal, Strict single inheritance, Purely structural perspective
– No computable definitions
– Heavily pre-coordinated
• “Trunk of communicating branch of zygomatic branch of right facial nerve with
zygomaticofacial branch of right zygomatic nerve”
• “Distal epiphysis of of distal phalanx of right little toe”
– Extensive spatial relationships in selected areas
• e.g. veins, arteries
• Uses
– not designed for one particular use
FMA Example
/ FMA:62955 ! Anatomical entity
is_a FMA:61775 ! Physical anatomical entity
is_a FMA:67165 ! Material anatomical entity
is_a FMA:67135 ! Anatomical structure
is_a FMA:67498 ! Organ
is_a FMA:55670 ! Solid organ
is_a FMA:55661 ! Parenchymatous organ
is_a FMA:55662 ! Lobular organ
is_a FMA:13889 ! Pituitary gland
is_a FMA:20020 ! Vestibular gland
is_a FMA:55533 ! Accessory thyroid gland
is_a FMA:58090 ! Areolar gland
is_a FMA:59101 ! Lacrimal gland
is_a FMA:62088 ! Lactiferous gland
is_a FMA:7195 ! Lung
is_a FMA:7197 ! Liver
is_a FMA:7198 ! Pancreas
is_a FMA:7210 ! Testis
is_a FMA:76835 ! Accessory pancreas
is_a FMA:9597 ! Salivary gland
is_a FMA:9599 ! Bulbo-urethral gland
is_a FMA:9600 ! Prostate
is_a FMA:9603 ! Thyroid gland
Model Organism Anatomy Ontologies
• Typically species-centric
–
–
–
–
–
–
FBbt : Drosophila melanogaster
WBbt: C elegans
ZFA: Danio rerio
XAO: Xenopus
MA: Adult Mouse (no develops from)
EMAP/EMAPA: developing mouse
• Uses
– primarily gene expression, also phenotype description
– others: Virtual FLy Brain, Phenoscape
• Approach:
– use-case driven
– practicality over formality
– No computable definitions
• (exception FBbt)
Other anatomy ontologies
•
Developing human
–
•
Vectors
–
–
•
CARO
AEO
Domain-specific anatomy ontologies
–
•
TGMA – mosquito
TADS - tick
Upper ontologies
–
–
•
EHDAA2
NIF_Anatomy, NIF_Cell – neuroscience
Phylogenetic or multi-taxon AOs
–
–
–
–
–
–
–
HAO – hymeoptera
PO – plant
TAO – telost
AAO – amphibian
SPD – Spider
…
we will return to these later..
Problem
• These AOs are not developed in a coordinated
fashion
– use of a shared upper ontology does not buy us much
– even the 3 mammalian AOs are massively different
• Data annotated using these ontologies effectively
becomes siloed
• There is redundancy of effort in areas of shared
biology
• Are there lessons from existing ontologies?
Building ontologies that are
interoperable across species
• Case Studies
– GO
– Cell Ontology
Gene Ontology
• Covers all kingdoms of life
– viruses, bacteria, archaea
– fungi, metazoans, plants
• Covers biology at different scales
• Issues
– terminological confusion (e.g. “blood”)
– large, difficult to maintain
How does GO deal with taxonomic
variation?
• What GO says:
– every nucleus is part_of some cell
• What GO does not say:
– every cell has_part some nucleus
• wrong for bacteria (and mammalian erythrocytes)
• Take home:
– Logical quantifiers are essential to understanding the
ontology
– Saying what something is part of is safer than saying
what its parts are
Principle: avoidance of taxonomic
differentia
• Not in GO:
– vertebrate eye development
– insect eye development
– cephalopod eye development
• In GO:
– eye development
• camera-type eye development
• compound eye development
• Exceptions for usability:
}
no implication of
homology
– cell wall
• fungal-type cell wall [differentia:cross-linked glycoproteins and carbohydrates, chitin /
beta-glucan …]
• plant-type cell wall [differentia: cellulose, pectin, …]
The problem of vagueness in GO
• “limb development”
• “wing development”
Adding taxonomic constraints to GO
• GO now includes two additional relations
– only_in_taxon
– never_in_taxon
– See:
• Kusnierczyk, W: Taxonomy-based partitioning of the
Gene Ontology, JBI 2008
• Deegan et al: Formalization of taxon-based constraints
to detect inconsistencies in annotation and ontology
development, BMC Bioinformatics 2010
Examples
• lactation only_in_taxon Mammalia (NCBITaxon:40674 )
– OWL: lactation in_taxon only Mammalia
• odontogenesis never_in_taxon Aves (NCBITaxon:8782)
– OWL: odontogenesis in_taxon only not Aves
• chloroplast only_in_taxon (Viridiplantae or Euglenozoa)
(NCBITaxon:33682 or NCBITaxon:33090)
Uses of taxon relationships
1. Clarifying meaning of GO terms
2. Detection of errors in electronic and manual
annotation
•
•
Automated reasoners
GO previously had chicken genes involved in
lactation, slime mold genes involved in fin
regeneration…
3. Providing views over GO
•
e.g. subset of GO excluding terms that are never in
drosophila
Scalability of single-ontology
approach: GO
• How does GO cope with wide taxonomic diversity?
– conservation at molecular level, wide diversity of
phenotypes at level of gross anatomical development,
physiology, and organismal behavior
• GO Development
– Focused on model systems
• “beak development” added only recently
• GO Behavior
– Very broad coverage
– Some specific terms, e.g. drosophila courtship
Proposal: outsource portions of the
ontology
Ontology Views
• Ontologies, traditional
– independent standalone resources
• Ontologies, new
– interconnected resources
– multiple views possible
• Subsetting
• Aggregation
• Subsetting + Aggregation
– views can be manually specified (e.g. go slims) or
automatically constructed
– Limited re-writing possible
• e.g. names
Views
“slim”
subset
aggregate
aggregate+subs
et
subset
subset
scattered
subset
domain/taxon-specific
cut
Subset
of GO
vertebrate
subset
Outline
• Case studies
– GO: A unified cross-species ontology
– Cell Ontology: Unifying multiple existing efforts
• Gross Anatomy
Cell types
• GO-Cell Component
– cell parts
• CL – cell ontology
• Anatomical Ontologies
– Includes cell types:
•
•
•
•
•
•
FBbt (Drosophila)
WBbt (C elegans)
ZFA, TAO (Danio rerio, Teleost)
FMA (Human)
PO (Plant)
FAO (Fungi)
– Excludes cell types:
• MA (adult mouse)
• EMAPA (developing mouse)
• EHDAA2 (developing human)
Overlap (simplified view)
CL
ZFA
brain
NIF
cell
PO
plant
spore
FMA
MA
alveolar
macrophage
neuron
lung
The Problem
•
•
•
•
Duplicated work
No unified view
Confusion for users
Confusion for annotators
Alternative proposals
1. LUMP: Combine into one monolithic CL
ontology
2. SPLIT: Taxon-specific cell types in taxoncentric ontologies
a) Obsolete generic cell types currently in tcAOs
-vs-
b) Taxon-specific subclasses of generic cell types
LUMP
all cells
fish
plants
plant
spore
human
alveolar
macrophage
neuron
mouse
CL Lumping proposal
• Advantages:
– one stop shopping for CL
• (but this can be done with aggregate views)
• Disadvantages
– tcAO IDs well-established
– Little advantage to lumping plant cells with animal
cells
– Harder to manage editorially
– Cross-granular relationships
(Partial) Splitting proposal
• Advantages:
– Easier to manage
– Sensible subdivision of labor:
• Common cell types in shared common cell ontology
– e.g. shared definition of “neuron”
• Taxon-specific subtypes in taxon-centric ontologies
• Disadvantages
– Aggregate view is problematic
• union of ontologies contains multiple classes labeled “neuron”
– Can be solved by obsoleting existing generic cell classes in
tcAOs and replacing by CL IDs
• problem: cross-granular relationships
Current solution for CL: split and retain
IDs
• Any cell type shared by two model taxa should
be in CL
• tcAOs retain both generic and specific cell
type classes
– Formally connected to CL via subclass
relationships
• or even stronger: taxon-specific equivalent
Example aggregate view
CL-metazoa
FMA
CL
FBbt
cell
i
muscle
organ
cell
muscle
cell
cell
i
p
i
muscle
cell
i
i
muscle
cell
i
frontal
pulsatile
organ
muscle
Example aggregate+subset view
CL-metazoa
FMA
CL
FBbt
cell
i
cell
muscle
cell
cell
i
i
muscle
cell
i
i
muscle
cell
i
frontal
pulsatile
organ
muscle
Who maintains the connections and
how?
• How:
– maintained as xrefs for
convenience
• Who:
– either tcAO or CL
• Synchronization?
– hard
– reasoning over aggregate
view
Who maintains the connections?
[Term]
cl’s responsibility
id: CL:0000584
name: enterocyte
def: "An epithelial cell that has its apical plasma membrane folded into microvilli to
provide ample surface for the absorption of nutrients from the intestinal lumen."
[SANBI:mhl]
xref: FMA:62122
is_a: CL:0000239 ! brush border epithelial cell
[Term]
id: ZFA:0009269
name: enterocyte
namespace: zebrafish_anatomy
def: "An epithelial cell that has its apical plasma membrane folded into microvilli to
provide ample surface for the absorption of nutrients from the intestinal lumen."
[SANBI:curator]
synonym: "enterocytes" EXACT PLURAL []
xref: CL:0000584
xref: TAO:0009269
xref: ZFIN:ZDB-ANAT-070308-209
is_a: ZFA:0009143 ! brush border epithelial cell
relationship: end ZFS:0000044 ! Adult
relationship: part_of ZFA:0005124 ! intestinal epithelium
relationship: start ZFS:0000000 ! Unknown
cl.obo
zfa.obo
zfa’s responsibility
Issues with aggregate view
FMA
lattices =
hairballs
cell
CL
FBbt
duplicate
names
cell
i
muscle
cell
cell
i
i
muscle
cell
i
i
muscle
cell
i
frontal
pulsatile
organ
muscle
Duplicate names
• Searching for “muscle cell” returns
–
–
–
–
–
CL:0000187 ! muscle cell
FBbt:00005074 ! muscle cell
FMA:67328 ! muscle cell
ZFA:0009114 ! muscle cell
NIF_Cell:sao519252327 ! Muscle Cell
• Proposed solutions
1. rename in source ontology
•
yuck
2. make end-user applications smarter
•
not practical for n applications
3. auto-rename in ontology view
•
best solution
Aggregate view
[Term]
id: CL:0000584
name: enterocyte
def: "An epithelial cell that has its apical plasma membrane folded into microvilli to
provide ample surface for the absorption of nutrients from the intestinal lumen."
[SANBI:mhl]
xref: FMA:62122
is_a: CL:0000239 ! brush border epithelial cell
[Term]
rewritten name
id: ZFA:0009269
(or syn – TBD)
name: zebrafish enterocyte
def: "An epithelial cell that has its apical plasma membrane folded into microvilli to
provide ample surface for the absorption of nutrients from the intestinal lumen."
[SANBI:curator]
synonym: "enterocytes" EXACT PLURAL []
xref: CL:0000584
xref: TAO:0009269
generated from
xref: ZFIN:ZDB-ANAT-070308-209
is_a: CL:0000584 ! enterocyte
xref
is_a: ZFA:0009143 ! brush border epithelial cell
relationship: end ZFS:0000044 ! Adult
relationship: part_of ZFA:0005124 ! intestinal epithelium
relationship: start ZFS:0000000 ! Unknown
lattice
cl-metazoa.obo
FMA class not
shown, but it
would also
subclass
Summary: taxon variation in CL
• Current solution is a compromise
– Constraints
• integrate with pre-existing tcAO ontologies
• these ontologies have links to gross anatomy
– tcAOs loosely integrated with CL
– plant cell types should be left to PO
– Synchronization remains a challenge
Lessons for gross anatomy
cross-ontology
link (sample)
cell
caro / all
tissue
import
nervous
system
gut
circulatory
system
gland
mollusca
mantle
shell
foot
cephalopod
tentacle
brachial lobe
gonad
arthropoda
mushroom
body
mesoderm
metazoa
respiratory
airway
drosophila
neuron types
XYZ
muscle
tissue
larva
skeletal
tissue
vertebrata
trachea
limb
vertebra
bone
fin
tibia
vertebral
column
cuticle
antenna
skeleton
appendage
parietal
bone
mesonephros
teleost
amphibia
weberian
ossicle
mammalia
mammary
gland
tibiafibula
mouse
zebrafish
NO pons
human
Conclusions
• Historically anatomy ontologies have been
developed by different groups largely in
isolation
• The Phenotype RCN should coordinate these
efforts
• Dynamic Views
• Explicit taxonomic relationships
• end
• Melissa Here
Idealized model (M0)
• A single ontology for ontology editors and
consumers
• Different editors have editing rights to different
ontology partitions
– by taxon
– by domain (e.g. neuroscience, skeletal anatomy)
• No taxon-specific subtypes
– use structure, function etc as differentia
• Users obtain dynamic views according to their
needs
Example M0
link
(small sample)
user/editor
view
ventral
nerve
cord
cell
gut
larva
mollusc
view
mollusc
foot
appendage
respiratory
airway
nervous
system
neuro
view
mantle
tissue
mesoder
m
circulatory
gonad system
gland
mammalian
view
muscle
tissue
trachea
limb
vertebra
vertebral
column
pons
mushroom
body
skeletal
view
metencephalon
mesonephros
mammary
gland
antenna
tentacle
brachial lobe
fin
pupal DN3
period
neuron
weberian
ossicle
tibiafibula
skeletal
tissue
bone
tibia
parietal
bone
Slightly less idealized model (M1)
• Maintain series of ontologies at different
taxonomic levels
– euk, plant, metazoan, vertebrate, mollusc, arthropod,
insect, mammal, human, drosophila
• Each ontology imports/MIREOTs relevant subset
of ontology “above” it
– this is recursive
• Subtypes are only introduced as needed
• Work together on commonalities at appropriate
level above your ontology
Example M1
cross-ontology
link (sample)
cell
caro / all
tissue
import
nervous
system
gut
circulatory
system
gland
mollusca
mantle
shell
foot
cephalopod
tentacle
brachial lobe
gonad
arthropoda
mushroom
body
mesoderm
metazoa
respiratory
airway
drosophila
neuron types
XYZ
muscle
tissue
larva
skeletal
tissue
vertebrata
trachea
limb
vertebra
bone
fin
tibia
vertebral
column
cuticle
antenna
skeleton
appendage
parietal
bone
mesonephros
teleost
amphibia
weberian
ossicle
mammalia
mammary
gland
tibiafibula
mouse
zebrafish
NO pons
human
Objections to M1
• Biological
– homology vs analogy
– functional grouping classes
• e.g. respiratory airway, eye
• Practical
– tools
– what about existing AOs?
• new AOs should be designed for integration from the
ground up
Protocol for new AOs
1. Collect draft list of terms
2. subdivide roughly into applicability at taxonomic
levels
3. request new terms from existing AOs above you
4. is a new mid-level AO required?
•
yes – collaborate and create, go to 1.
5. import subset from next AO above
6. Build your ontology
Example: the octopus ontology
• Collect and subdivide terms:
– cephalopod: tentacle, brachial lobe, subesophageal mass,
beak, visceropericardial coelum, swim bladder
– mollusc: mantle
– metazoan: nervous system, muscle tissue
• Mollusc anatomy ontology does not exist
– either: (i) find collaborators and create
– or: (ii) keep mollusc terms in your ontology for now, but
mark them as possibly migrating upwards
• Import terms from mollusc AO(i), or metazoan if (ii) no
mollusc AO
How are things organized now?
• 3 examples:
– PO
– TAO/ZFA
– Uberon
• In Melissa’s talk
Some AOs are cross-granular
FMA
cell
i
muscle
cell
p
muscle
organ
p
i
muscle
cell
protoplasm
subcellular
cell
tissue and gross anatomy
Cross-granular relationships
FMA
cell
i
muscle
cell
p
i
muscle
organ
p
Cross-granular relationships
FMA
cell
i
muscle
cell
p
i
CL
muscle
organ
p
cell
i
muscle
cell
i
Obsoleting generic classes in tcAOs
FMA
cell
i
muscle
cell
p
i
CL
muscle
organ
p
cell
i
muscle
cell
i
Migrating cross-granular relationships
FMA
cell
i
muscle
cell
p
i
CL
cell
muscle
organ
i
p
muscle
cell
i
“true path” violations
FMA
cell
i
muscle
cell
p
i
CL
FBbt
cell
muscle
organ
i
p
muscle
cell
i
i
frontal
pulsatile
organ
muscle
fix
FMA
cell
i
CL
cell
muscle
organ
i
muscle
cell
p
i
FBbt
muscle
cell
p
muscle cell
AND part of
some human
i
i
frontal
pulsatile
organ
muscle
PO: Plants
• Single unified ontologies for all plants
– cell types and gross anatomy
• Generalized from ontology of flowering plants
TAO and ZFA
• Teleost and Zebrafish
Uberon
• Designed to unify existing tcAOs
• Uses modern ontology development techniques
– heavily axiomatized = less work for humans, leave it to
reasoners
• automated QC
• automated classification
•
•
•
•
Current size: 5k+ classes
Multiple relationship types
Links to and from GO, CL
Aggregate views possible using xrefs maintained in
uberon
Uberon lessons
• Original Design Goals
– Unify metazoan tcAOs for cross-species phenotype queries
– Seed initial version from text matching
• Was this a good idea?
– metazoans are fairly diverse
•
•
•
•
many original dubious grouping classes have been eliminated or split
functional grouping classes remain
tissues, germ layers, etc less controversial
Uberon is really a vertebrate AO in which we’ve added placeholder metazoan
terms
– labels are misleading
• high false +ve, false –ve from txt matching
• starting from textbook comparative anatomy knowledge would have been
better (give time)
Download