Week 8 * Using Ontologies in Biomedical Research

advertisement
MED267
Modeling Clinical Data and Knowledge
for Computation
Week 9 – Using Ontologies in
Biomedical Research
Amarnath Gupta
A BRIEF RECAP
SOME PRELIMINARIES AND SOME NOT-SOPRELIMINARIES
Ontology
• A formal representation of knowledge as a set of concepts
Often Expressed in a language called OWL-DL
within a domain, and the relationships between those
concepts.
◦ Classes: sets, collections, concepts, classes in programming, types of objects,
or kinds of things
◦ Attributes: aspects, properties, features, characteristics, or parameters that
objects (and classes) can have
◦ Relations: ways in which classes and individuals can be related to one another
◦ Individuals: instances or objects (the basic or "ground level" objects)
◦ Restrictions: formally stated descriptions of what must be true in order for
some assertion to be accepted as input
◦ Rules: statements in the form of an antecedent-consequent sentence that
describe the logical inferences that can be drawn from an assertion in a
particular form
◦ Axioms: assertions (including rules) in a logical form that together comprise
the overall theory in its domain of application.
An ontology can be viewed as a graph
with an acyclic backbone and a logical interpretation
Querying ontologies


An ontology is a 2-graph
system

◦ Class graph
◦ Instance graph


◦ An edge query language
Reasoner Queries
◦ Inferencing
◦ Classification
◦ Consistency
SPARQL 1.1
◦ An edge query language
with regular expressions
on edges
Data Queries
◦ Binding retrieval
◦ Subgraph retrieval
SPARQL 1.0

OWL-QL
◦ DL query language

Rule Language
◦ SWRL

Emerging trends
◦ Keyword query languages
◦ Subgraph query languages
We will revisit the query language issue as we go forward
Upper Ontologies
•
An upper ontology (or foundation ontology) is a model of the common
objects that are generally applicable across a wide range of domain
ontologies. It employs a core glossary that contains the terms, associated
object properties and relationships as they are used in various relevant
domain sets.
◦ We have used the Basic Formal Ontology (BFO
http://www.ifomis.org/bfo/publications ) and Relation Ontology (RO) for
our work
plasma membrane is a cell component that has as its
parts a maximal phospholipids bilayer in which
instances of two or more types of protein are
 Classification and Differentiation embedded.
 Continuants and Occurrents
 Standardizing Relationships
 Temporal parameter
Smith B, Ceusters W, Klagges B, Kohler J,
Kumar A, Lomax J, Mungall CJ, Neuhaus
F, Rector A, Rosse C Relations in
Biomedical Ontologies. Genome
Biology, 2005.
BFO
Continuant
Occurrent
Process, event
Independent
Continuant
Dependent
Continuant
thing
quality
temperature depends
on bearer
RELATION
TO TIME
CONTINUANT
INDEPENDENT
OCCURRENT
DEPENDENT
GRANULARITY
ORGAN AND
ORGANISM
Organism
(NCBI
Taxonomy)
CELL AND
CELLULAR
COMPONENT
Cell
(CL)
MOLECULE
Anatomical
Organ
Entity
Function
(FMA,
(FMP, CPRO) Phenotypic
CARO)
Quality
(PaTO)
Cellular
Cellular
Component Function
(FMA, GO)
(GO)
Molecule
(ChEBI, SO,
RnaO, PrO)
Molecular Function
(GO)
Biological
Process
(GO)
Molecular Process
(GO)
The Open Biomedical Ontologies (OBO) Foundry
8
Merging Ontologies
◦ Any real application needs to make use of
multiple ontologies
◦ Often the strategy is to construct a specific
ontology by assembling elements of multiple
ontologies
◦ What happens if
 One ontology uses an upper ontology (say BFO) and
another doesn’t
 One ontology uses a fixed set of relationships and
another doesn’t
OBI – An Ontology for Biomedical
Investigations

OBI models experiments
material
entity
processed
material
device
organization
cell
culture
chemical entities
in solution
PCR
product
molecular entity
(ChEBI)
protein complex
(Gene Ontology)
cell
(Cell Ontology)
anatomical entity
(FMA, CARO)
organism
(NCBI taxonomy)
Some material
entities in OBI
MIREOT
Minimal Information to Reference External Ontology Terms

The idea
◦ the minimal set that allows to unambiguously identify a
term




URI of the class
URI of the source ontology
Superclass of the term in the source ontology
Position in the target ontology
◦ Additional useful information
•
•
•
•
•
Label,
Definition,
Other annotations: adding “human-readable” information
Superclasses: for example, NCBI taxonomy
Problem
◦ Lose complete inference
 But because the imported ontology might not be commensurate
with the base ontology, the inferences are questionable
Modularization of Ontologies

A set of principles for
◦ Decomposing a larger ontologies into smaller
meaningful components
◦ Assimilating a set of component ontologies
into a larger ontology
◦ Modules must





Have semantic locality
Preserve loose coupling and autonomy
Enable partial reuse of knowledge
Preserve directionality of knowledge import
Ensure scalability
Uberon – an integrated multi-species
anatomy ontology
Using taxonomic
constraints

Over 6,500 classes representing anatomical entities

Represents structures in a species-neutral way and
includes extensive associations to existing speciescentric anatomical ontologies,

Allows integration of model organism and human
data

Uses novel methods for representing taxonomic
variation

Used for translational phenotype analyses.
A BIOMEDICAL
RESEARCH PROBLEM
Finding Drugs for Rare/Orphan Diseases
Orphan Diseases



Diseases affecting less than
200,000 people in the U.S
Approx. 7000 rare diseases
affecting 25 million globally1.
Orphan Drug Act – 1983
◦ Incentives for orphan drug
development.


Around 355 with approved orphan
drug therapies.
Recent interest of pharma giants in
Orphan drug R&D.
Child with Tay-Sachs disease
Image Source: http://www.ntsad.org/index.php/the-diseases
Source (1): Rados FDA Consumer 2003
Orphan disease information space
- a need for systematic analysis
Genetic causes
• 80 % of rare diseases have a genetic origin
Underlying mechanism of Disease Processes
• Intricate/Complex
• May involve single or multiple loci, QTLs, multiple genes
• Time-varying
• Unknown for many diseases
Drugs
• Exact mechanism of action complex
• Not completely documented due to proprietary reasons
Approaches to drug repositioning

Drug-centric
approach
Drug-centric
Disease-centric
◦ Hypothesis: ‘similar
drugs’ have same
therapeutic effects and
are equally effective
for a disease.

Disease-centric
approach
◦ Hypothesis: ‘similar
diseases’ need the
same therapies and
can be treated with
the same drugs.
Repositioned
Drugs
Source (5): Adapted from Liu et al. Sept 2012
Finding Drugs as an Exploratory Problem
 If a Genetic Variant (GV) is
associated with disease
progression, then drug/chemical
(which suppresses the GV or its
gene product) is a possible
treatment option for the
associated disease.
↑ or ↓
production
Gene product
↑ or ↓
production
 If a GV is associated with
disease remission, then
drug/chemical (which increases
activity of the GV) is a possible
treatment option for the
associated disease.
 If a disease associated GV causes
increased expression of certain
receptors, a drug which suppresses
this receptor will be a possible
treatment option for the disease.
↑ or ↓
Expression
Drug
Genetic Variant
(SNP)
↑ or ↓
progression
Disease
↑ or ↓
progression
Genetic
Variant/
Disease
Drug
Receptor/
Enzyme
Biomarker discovery
Resources for drug repositioning
Gene expression based resources
Drug-centric
GEO
834,730 samples
Drug
Bank
6711 drugs
4227 drug targets
cMAP
1307 compounds
OMIM
Network
& Computational
Modeling
20,000 genes &
phenotypes
CTD
18,414, 321
Toxicogenomic
relationships
Disease-centric
Orphanet
EMR’s
PubMed
6500 rare diseases
22 millions citations
Text-based resources
Source (5): Adapted from Liu et al. Sept 2012
HIBM – a rare disease


Autosomal recessive disease
Clinical/diagnostic features
◦
◦
◦
◦
Proximal and distal muscle weakness (starts with distal)
Onset during late teens
Mild elevation of serum CK
Progression of muscular weakness continues for10-20
years
◦ Spares the quadriceps
◦ Detection of “inclusion bodies” in muscle biopsies
 Rimmed vacuoles (clusters of autophagic vacuoles (AVs) and
myeloid bodies) in muscle tissue
 Accumulation of beta-amyloid, accumulation of NCAM1 in
muscle (hyposialylation)
 Intracellular deposition of Congo red-positive materials (such as
b-amyloid and a-synuclein)
◦ No loss of cognitive function
HIBM – a rare disease

Genetic characteristics
◦ Caused by mutations in GNE at locus 9p13-p12
 homozygous or compound heterozygous
 bi-functional enzyme, UDP-N-Acetylglucosamine 2epimerase/N-Acetylmannosamine Kinase
 catalyzes two adjacent steps in the sialic acid biosynthetic
pathway
 feedback regulated
 phosphorylated (PKC) and ubiquitinated
◦ Associated with
 abnormal phosphorylation of tau
 activation of the ubiquitin proteasome system
 activation of the lysosomal system
Using Ontology Recommenders
Using Ontology Annotators
HIBM
(is-a myopathy) (myopathy abnormality-of muscle-tissue)
Autosomal recessive (is-a genetic inheritance) disease
 Clinical/diagnostic features

◦ Proximal muscle weakness (OMIM) distal muscle weakness (OMIM)
(starts with distal)
◦ Onset during late teens
◦ Mild elevation of (PATO) serum CK – (elevated creatine phosphokinase
is-a elevated-enzyme-activity)
◦ Progression of (PATO) muscular weakness continues for10-20
years
◦ Spares (not affects) the quadriceps (Uberon)
◦ Detection of “inclusion bodies” in muscle biopsies
 Rimmed vacuoles (clusters of autophagic vacuoles (AVs) and myeloid
bodies) in muscle tissue
 Accumulation of beta-amyloid, accumulation of NCAM1 in muscle
(hyposialylation) (decreased occurrence of sialic acid in)
 Intracellular deposition of Congo red-positive materials (such as bamyloid and a-synuclein)
◦ No loss of cognitive function (cogpo)
Why is it hard to create ontologies for cognitive functions?
Organizing information with
ontologies
HIBM – a rare disease

Genetic characteristics
◦ Caused by mutations in GNE at locus 9p13-p12
 homozygous or compound heterozygous
 bi-functional enzyme, UDP-N-Acetylglucosamine 2epimerase/N-Acetylmannosamine Kinase
 catalyzes two adjacent steps in the sialic acid biosynthetic
pathway (BioPAX)
 feedback regulated
 phosphorylated (PKC) and ubiquitinated (ubiquitination – GO)
◦ Associated with
 abnormal phosphorylation (GO) of tau protein (PRO)
 activation of the ubiquitin proteasome system
 activation of the lysosomal system
P36L
P27S
C13S
R11W
C303X
G206S
C303V
V572L, “Japanese”
G206fsX4 P283S R306Q
(homozygous)
R202L
G312R
G559R G576E
D225N
V331A
I200F V216A
I557T I587T
R246W
V367I
F528C
A600T
R177C
V696M
R246Q
A630T
I377fsX16 I472T A524V
D176V
D378Y A460V N519S
R263L
G134V
G708S
A631T
M171V
H132Q R162C
A519S
M712T rs28937594
R266W
V421A
A631V
R129Q
R420X
“middle eastern”
T507P
R266Q
Y675H
100
200
300
400
500
600
700
(homozygous)
ManNAc 6-kinase
UDP-GlcNAc 2-epimerase
Y22-p
Nuclear Export
K195-u
K267-u
Y197-p K210-u
Signal
S199-p
Allosteric Site
ATP binding
Active site
Zn binding
M712-p
(rat)
ATP binding
UDP-GlcNAc 2-epimerase
domain
Black: mutations in uniprot
Grey: mutations in papers
ManNAc 6-kinase domain
ATP binding site
Allosteric site
Substrate binding site
Nuclear Export Signal
Enzymatic active site
-p
Phosphorylation site
Zn binding site
-u
Ubiquitination site
G206S
P36L
P27S
D225N
V331A
I200F V216A
R246W
R177C
R246Q R306Q
D176V
R263L
M171V
R266W
R162C
H132Q
R266Q
100
200
300
D378Y
D378Y
400
V572L
V572L
A631T
Human
I557T
A631V
I472T
G576E
F528C
A631V V696M
A460V A524V
I587T
A460V
A600T Y675H M712T rs28937594
N519S
A630T
500
600
ManNAc 6-kinase
UDP-GlcNAc 2-epimerase
Y22-p
Nuclear Export
Y197-p
K195-u S199-p Allosteric
K210-u
Signal
700
Site
ATP binding
Active site
Zn binding
M712-p
ATP binding
K267-u
Kinase +
+ +
++
--
Epimerase -Oligamerization +
-- --
-- --
+
- -
-- --
+
Feedback
inhibition
process
+
H155A(rat)
H132A (rat) H157A (rat)
H49A (rat)
H110A (rat)
100
200
D413K (rat)
D413N (rat)
R420M (rat)
400
500
(KO) tm1Rhk (KO) tm1Sngi
Insert: HumanGNE*D176V)
600
700
ManNAc 6-kinase
UDP-GlcNAc 2-epimerase
G135E (CHO)
M712T (mouse)
V572L (mouse)
ATP binding
Active site
Zn binding
ATP binding
Rat
Ontological Mapping of Findings to
Sequences
Sequence Types and
Features Ontology
Ontological Model of Pathways using
BioPAX

Pathway: a set or series of interactions, often forming a network
Exploring for related information

What genes are related to inclusion body
myopathies?
Enrichment analysis using ontologies
What are the relevant phenotypes?

The human phenotype ontology
◦ Arranged as a directed acylic graph (DAG)
 A given phenotypic feature can be considered to be a
more specific aspect or more than one parental term.
 Terms that are located close to the root of the graph
are less specific than terms that are farther away from it.
 This is defined as the information content (IC) of a term
(−log pi, where pi represents the frequency of the phenotypic
manifestation i among all diseases in the database).
 mental retardation, which is a common phenotypic
manifestation of many hereditary diseases, is less clinically
specific (has less information content) than a feature such
as calcific stippling.
Comparing phenotypes
Figure 3. Analysis of the phenotypic
similarity of the Human Phenotype
Ontology (HPO) terms downward
slanting palpebral fissures and
hypertelorism to annotations of (a)
Greig cephalopolysyndactyly syndrome
[GCPS (MIM 175700)] and (b) type II
orofaciodigital syndrome [OFD2 (MIM
252100)].The most specific common
ancestor of
hypertelorism and telecanthus is the
term abnormality of the eye, and the
similarity
between hypertelorism and telecanthus is
calculated as the information content
of the term abnormality of the eye.
Therefore, a search with the query
terms downward slanting palpebral
fissures and hypertelorism yields a higher
score for GCPS than for OFD2.
Phenotypic similarity using EQ

Recall phenotype description using EQ
description
Phenotypic similarity
IC of the node, which is the
negative log of the probability
of that description being used
to annotate a gene, allele, or
genotype (collectively called a
feature)

Phenotypic Profile: Multiple EQ descriptions annotated
to a genotype

Phenotypes annotated to genotypes are propagated to
their allele(s), and in turn to the gene, indicated with
upward arrows.

Similarity is analyzed between any two nodes of the same
type,
◦
gene A-vs-B, allele A3-vs-B1, genotypes A1/A1-vs-A3/A3,
or A3/A3-vs-B1/B1.

The common subsuming phenotypes between
A1/A1-vs-A3/A3 and gene A-vs-B are itemized in
white boxes. Some individual phenotypic
descriptions can have two common subsumers.

For each phenotypic description (EQ), the
calculated IC is shown.
◦
When comparing two items, four scores are
determined:




maxIC, the maximum IC score for the common
subsuming EQ, which may be a direct (in the case of
A1/A1-vs-A3/A3) or inferred (in the case of gene Avs-gene B) phenotype,
avgICCS, the average of all common subsuming IC
scores
simIC, the similarity score which computes the ratio
of the sum of IC values for EQ descriptions
(including subsuming descriptions) held in common
(intersection) to that of the total set (union)
simJ, non-IC-based similarity score calculated with
the Jaccard algorithm which is the ratio of the count
of all nodes in common to nodes not in common.
Phenoclustering

Phenotype and genotype information can
viewed as a network
◦ Graph clustering techniques with suitable
similarity metrics can be used to define node
proximity
Phenoclustering: online mining of
cross-species phenotypes
Groth et al, Bioinformatics 2010
26(15): 1924.
Investigating the hypothesis

Exploratory Search
◦ A specialization of information exploration which
represents the activities carried out by searchers who
are
 Unfamiliar with the domain of their goals
 Unsure about the ways to achieve their goals
 Possibly even unsure about their exact goals

◦
Hypothesis investigation can be viewed as an
exploratory search over a semantically
connected graph
Find entities of type drug that relate to one or more
of these genes, possibly through these pathways, and
possibly through these phenotypes

Distinct from finding statistically correlated information and
thresholding on p-values
Role of ontologies in exploratory
graph search

Ontologies serve as indices to data
 Semantic labels as indices
 Relationships as join indices
 Ontological neighborhoods as multi-join indices
◦ Helps to construct “semantic neighborhoods” between
data nodes that are far apart

Ontologies as (implicit) query filters
◦ Find connections in the data graph only when the
corresponding ontology entities satisfy a connectivity
pattern

Node/Node Type distances can denote node
similarities
◦ Can be a function of graph distances in the ontology
◦ Can be extended to define relatedness measures between
data neighborhood
Example

Exploratory Query
◦ Find drug:* related-to gene:GNE, through
some pathways, and optionally through some
muscular dystrophy
◦ A potential exploration path
 GNE  missense mutations of GNE  reduced
GNE-epimerase activities  GNE/MNK pathway
 ManNAC kinase  clinical trials  drug 
DEX-M4
 Exercise: how can ontologies contribute to finding
this path?
Conclusions



Upper ontologies are needed to organize
concepts and relationships for a domain and
application
Principled methods of modularizing
component ontologies help avoid large
monolithic ontologies and potential
inconsistencies
Ontologies are not only used for
conceptualizing a domain but also for tasks
like data integration, enrichment analysis and
(exploratory) search
Download