CSC 522 Performance Evaluation of Algorithms

advertisement
Ontology-Based Knowledge
Discovery and Sharing in Biological
and Medical Research
Jingshan Huang
Assistant Professor
School of Computer and Information Sciences
University of South Alabama
http://cis.usouthal.edu/~huang/
Dept. of Chemical Pathology @ CUHK
Hong Kong
August 17, 2010
Presentation Outline
• Research Motivation
• Ontologies and Ontological Techniques
• Apply Ontological Techniques into Biological
and Medical Research
• Ongoing Research – OMIT Project
Research Motivation – Overview
•
Information from heterogeneous sources has different
semantics
Long (English)
Long (Chinese Pinyin) -> 龙 (龍) ->
•
•
•
•
Knowledge discovery and sharing in biological/medical
research is both important and challenging
Integrating the information from heterogeneous sources
must make use of all available clues, including syntax,
semantics, context, and pragmatics
Ontologies are a formal model to encode semantics
Ontological techniques are critical in knowledge acquisition
Research Motivation – More Details
•
•
•
•
•
•
Why???
In medical informatics area, an abundance of digital data has possibly
promised a profound impact in knowledge discovery and innovation
Worldwide health scientists are producing, accessing, analyzing,
integrating, and storing massive amounts of digital medical data daily
Such data was obtained through observation, experimentation, and
simulation
If we were able to effectively transfer and integrate data from all
possible resources, then it is possible to obtain:
① a deeper understanding of all these data sets,
② better exposed knowledge, and
③ appropriate insights and actions
Unfortunately, in many cases, the data users are not the data producers
They thus face challenges in harnessing data in unforeseen and
unplanned ways
Research Motivation – An Example Scenario
•
•
•
•
•
•
•
The identification and characterization of important roles microRNAs
(miRNAs) played in human cancer is an increasingly active area
In particular, it is very challenging to effectively identify
miRNAs’ target genes
Cancer patients’ prognosis depends largely on their chemosensitivity
(sensitivity to chemotherapy)
Research has discovered that some specific genes increase the
permeability of mitochondria (a cellular component) membrane, which in
turn leads to apoptosis (cell death)
As a result, the patient’s chemosensitivity will increase and the
chemotherapy will be more effective
Certain miRNAs can regulate the aforementioned genes and thus affect
cancer patients’ prognosis
If biologists were able to identify such miRNAs, a breakthrough on cancer
treatment would have been made
Unfortunately, such identification is very difficult…
Research Motivation – An Example Scenario (cont.)
•
•
Biologists need to extract a large number of candidate target genes from
existing miRNA databases
They will also have to manually search these genes’ related information
from resources other than miRNA databases for every one of
hundreds of candidate target genes
①
②
③
cellular component
biological process
and so on…
•
In a word, the whole process is time-consuming, error-prone, and subject
to biologists’ limited prior knowledge
•
In addition, such a situation could be even worse
①
②
③
It is further aggravated by great complexity and imprecise terminologies, which
characterize typical biological and biomedical research fields
A great deal of variety has been identified in the adoption of different biological terms,
along with different relationships among all these terms
Such variety has inhibited effective information acquisition by humans
Research Motivation – Summary
•
•
•
The biological and medical research area is facing a
challenging problem: knowledge discovery and sharing
among distributed parties
In order to integrate heterogeneous data, and thereby
efficiently revolutionize the traditional medical and
biological research, new methodologies are in great need
As a formal knowledge representation model, ontologies
play a key role in defining formal semantics in
traditional knowledge engineering
Conclusion:
It is necessary to apply ontological techniques into
the biological and medical research investigation
Presentation Outline
• Research Motivation
• Ontologies and Ontological Techniques
• Apply Ontological Techniques into Biological
and Medical Research
• Ongoing Research – OMIT Project
Definition of Ontologies
•
The simplest definition:
An ontology is a computational model (a.k.a. knowledge representation
model) of some domain of the world
•
•
It describes the semantics of the terms (a.k.a. concepts) used in the domain
It is often captured in the form of DAG (directed acyclic graph)
What is a DAG then?
•
•
Nodes represent ontology concepts while arcs represent their relationships
May be augmented by rules, constraints, or functions
•
In brief, ontologies aim to make explicit the knowledge contained within software
applications for a particular domain:
An ontology = a finite set of concepts + properties + relationships
•
•
•
Such graphical structures are also known as ontology schemas
Actual data sets contained in these schemas are referred to as instances
Most real-world ontologies have very few or no instances at all
Ontology Engineering
•
•
The creation and maintenance of ontologies in the domain of interest
In other words, it focuses on the methodologies by which to build ontologies
•
To
①
②
③
•
Languages to represent ontologies in computer systems
①
OWL (Web Ontology Language) – most popular one
②
Open Biological and Biomedical Ontologies (OBO)
③
Knowledge Interchange Format (KIF)
④
Open Knowledge Base Connectivity (OKBC)
•
GUI tools for ontology engineering
①
Protégé (by Stanford) – most popular one
②
CmapTools (by IHMC)
③
OntoEdit (by Ontoprise)
create an ontology, three different approaches can be applied
Top-down approach (knowledge driven)
Bottom-up approach (data/inference driven)
Combination of top-down and bottom-up
Ontology Engineering
(Protégé GUI – Upper Bio Ontology)
Ontology Engineering
(Example OWL File – Upper Bio Ontology)
Ontology Heterogeneity
•
Heterogeneity is an important, inherent characteristic of ontologies
developed by different parties for the same (or similar) domains
•
This is due to the fact that ontologies reflect their designers’ different
conceptual models for some domain
•
The heterogeneous semantics may occur in different ways
① different terms could be used for the same concept;
② an identical term could be adopted for different concepts;
③ properties and relationships could be different
As a result, Ontology Matching has become an increasingly active topic
Ontology Matching
•
“Ontology Matching” is short for “Ontology Schema Matching”
•
Also known as “Ontology Alignment” or “Ontology Mapping”
•
It refers to the process of determining correspondences between concepts
from heterogeneous ontologies
•
It aims to handle the aforementioned challenge in ontology heterogeneity
•
Many different relationships will be involved
①
②
③
④
⑤
equivalentWith
subClassOf
superClassOf
siblings
and so on…
Current Ontology-Matching Algorithms

Rule-Based Matching
① Consider schema information alone
② Specify a set of rules
③ Apply them to schema information

Learning-Based Matching
① Consider both schema and instances
② Apply different machine learning techniques
Brief Introduction of Machine Learning
① A scientific discipline that is concerned with the design and
development of some special algorithms
② These algorithms allow computers to change behavior based on
“training data”
③ The major focus is to recognize complex patterns and make
intelligent decisions
Pros and Cons for Current Approaches

Rule-Based Matching
① Is relatively fast ()
② Ignores instance information ()
③ Uses ad hoc predefined weights ()
concept semantics: name + properties + relationships

Learning-Based Matching
① Obtains extra clues from instances ()
② Runs longer ()
③ Has difficulty in getting sufficient instances ()
most real-world ontologies do not have instances
Presentation Outline
• Research Motivation
• Ontologies and Ontological Techniques
• Apply Ontological Techniques into Biological
and Medical Research
• Ongoing Research – OMIT Project
Ontological Techniques in Bio Research
•
Ontological techniques have been widely applied to medical and biological
research
•
The most successful example is the Gene Ontology (GO) project
•
Unified Medical Language System (UMLS) and the National Center
for Biomedical Ontology (NCBO) are two other successful examples
•
Besides, efforts have been carried out for ontology-based data integration
in bioinformatics and medical informatics
Why Gene Ontology (GO) Project?
•
•
•
•
•
•
•
Biologists have wasted a lot of time and effort in searching for all of the
available information about each small area of research
It is further hampered by the wide variations in terminology that may be
common usage at any given time
A simple example: if you were searching for new targets for antibiotics,
you might want to find all the gene products that are involved in bacterial
protein synthesis
Suppose that one database describes these molecules as being involved in
“translation”, whereas another uses the phrase “protein synthesis”
It will then be difficult for human to find functionally equivalent terms, let
alone any computer software
As an effort to address the need for consistent descriptions of gene
products in different databases, the GO began as a collaboration between
three model organism databases (Flies, Saccharomyces, and Mouse) in
1998
The GO Consortium has grown to include many databases, including
several of the world’s major repositories for plant, animal, and microbial
genomes
Three Sub-Ontologies in the GO
•
Cellular Component, Biological Process, and Molecular Function
•
A gene product might:
①
②
③
be associated with or located in one or more cellular components;
be active in one or more biological processes;
during which it performs one or more molecular functions
Example
The gene product, cytochrome c , can be described by:
①
②
③
the molecular function term “oxidoreductase activity”
the biological process terms “oxidative phosphorylation” and “induction of cell
death”
the cellular component terms “mitochondrial matrix” and “mitochondrial inner
membrane”
GO Structure
•
•
The GO ontology is essentially a Hierarchy-Like DAG
In other words, each node is a GO term, and each arc represents a
relationship between two GO terms
•
Directed feature
For example, a mitochondrion is an organelle, but not vice versa
Acyclic feature (cycles are not allowed)
For example, it is inappropriate to specify that “A1 is an A2” “A2 is an A3”
… “Ai is an A1”
Hierarchy-Like feature (generalized-specialized relationship plus
possibly multiple parents)
For example, the biological process term hexose biosynthetic process has
two parents, hexose metabolic process and monosaccharide biosynthetic
process (biosynthetic process is a type of metabolic process and a hexose
is a type of monosaccharide)
•
•
An Example GO Diagram
Three Relationships in the GO
•
The GO ontology defines three different relationships among terms
①
is a , a.k.a. is a subtype of, represented as
②
part of , represented as
③
regulates , represented as
;
; and
Note that regulates includes two sub-relationships, i.e., negatively
regulates and positively regulates, represented as
and
, respectively
is a Relationship in the GO
•
If A is a B, it means that A is a subtype of B
①
For example, mitotic cell cycle is a cell cycle
② Another example, lyase activity is a catalytic activity
•
The difference between is a relationship and “is an instance of” (meaning
that a specific example of something), for example:
①
A cat is a mammal
② George is an instance of a cat, therefore, the claim that “George is a
cat” is incorrect
③ However, it is safe to claim that every one of the instances of a cat is
also an instance of a mammal
Reasoning over is a Relationship
•
The is a relationship is transitive:
•
Example
part of Relationship in the GO
•
•
•
•
B is part of A, meaning that the presence of B implies the presence of A
But not vice versa, i.e., given the presence of A, we cannot conclude the
presence of B
In other words
① all B are part of A
② but only some A have part B
Example
Reasoning over part of Relationship (1)
•
The part of relationship is also transitive:
•
Example
Reasoning over part of Relationship (2)
•
part of followed by is a :
•
Example
Reasoning over part of Relationship (3)
•
part of following is a :
•
Example
Reasoning over part of Relationship (4)
•
•
The aforementioned logical rules regarding the part of and is a
relationships hold no matter how many intervening is a and part of
relationships are there
Example
regulates Relationship in the GO
•
•
•
•
B regulates A, meaning that the presence of B implies the presence of A
But not vice versa, i.e., given the presence of A, we cannot conclude the
presence of B
In other words
① all B regulate A
② but only some A are regulated by B
Example
Reasoning over regulates Relationship (1)
•
Both negatively regulates and positively regulates imply regulates
•
Example
Reasoning over regulates Relationship (2)
Reasoning over regulates Relationship (3)
Reasoning over regulates Relationship (4)
Reasoning over regulates Relationship (5)
•
Example
Reasoning over regulates Relationship (6)
•
Example
Reasoning over regulates Relationship (7)
•
Example
Presentation Outline
• Research Motivation
• Ontologies and Ontological Techniques
• Apply Ontological Techniques into Biological
and Medical Research
• Ongoing Research – OMIT Project
Ongoing Research: OMIT Project
http://omit.cis.usouthal.edu/
Besides Sun Lab at CUHK, there are five other collaborating labs from around the world
Project Overview
• An innovative computing framework based on the Ontology for
•
•
•
•
MicroRNA Target Prediction (OMIT) to handle the aforementioned
challenge in predicting miRNAs’ target genes
The OMIT is a domain-specific ontology upon which it is possible to
facilitate knowledge discovery and sharing from existing sources
The long-term research objective of the OMIT framework is to assist
biologists in unraveling important roles of miRNAs in human
cancer, and thus to help clinicians in making sound decisions
when treating cancer patients
We aim to synthesize data from existing source miRNA databases into a
comprehensive conceptual model that permits an emphasis on data
semantics
Consequently, a more accurate, complete view of miRNAs’ biological
functions can be acquired
We thus provide users with a single query engine that takes their
needs in a nonprocedural specification format
System Framework
Five Tasks in the OMIT Project
① To develop a miRNA-domain-specific ontology that contains a set of
OMIT concepts, along with the relationships among these concepts
② To align the OMIT with the GO so that gene-related information can
be automatically acquired and integrated
③ To annotate source miRNA databases with OMIT concepts for existing
databases to be enriched with formal semantics
④ To integrate OMIT-annotated miRNA databases into a centralized RDF
data warehouse
⑤ To perform complicated search/query in a unified style so that deep
knowledge can be obtained out of a wealth of miRNA data
An Example Research Scenario
Suppose a cancer biologist is interested in investigating the
chemosensitivity of breast cancer cells
• By comparing chemosensitive and chemoresistant cancer cells it is
demonstrated that miR-125b, a specific miRNA, may confer the increased
chemosensitivity of cancer cells
• After the OMIT system obtains candidate targets for miR-125b, the gene
information of these targets will be further acquired, including cellular
localization (e.g., in mitochondria) and biological process (e.g., apoptosis)
• The availability of such integrated knowledge will make it much easier for
the cancer biologist to deduct the actual targets for miR-125b
• As a result, a breakthrough in breast cancer treatment may be granted
A Typical Knowledge Acquisition Cycle
• Steps 1-3: the user initiates a search/query; recognized miRNA concept is
used to query the RDF data warehouse
• Steps 4-5: miRNA targets are retrieved and utilized to acquire more gene
information
• Steps 6-8: miRNA targets and their related gene information are returned
to the user
Corresponding RDF-based query:
SELECT DISTINCT OMIT:targetGene
FROM OMIT:miRNA, GO-CC:cellComponent, GO-BP:bioProcess
WHERE OMIT:miRNA ID = “miR-125b”
AND OMIT:miRNA targetID = GO-CC:cellComponent geneID
AND OMIT:miRNA targetID = GO-BP:bioProcess geneID
AND GO-CC:cellComponent localization = “mitochondria”
AND GO-CC:cellComponent permeabilityIncrease = “yes”
AND GO-BP:bioProcess apoptosisIncrease = “yes”
USING NAMESPACE
OMIT = <http://omit.cis.usouthal.edu/ontology/OMIT.owl>,
GO-CC = <http://www.geneontology.org/formats/oboInOwl#>,
GO-BP = <http://www.geneontology.org/formats/oboInOwl#>.
Top-Level OMIT Concepts
Expanded View of OMIT Concepts (Portion)
Linkage between the OMIT and the GO
• Some OMIT concepts are directly inherited and extended from
GO concepts
For example, OMIT concept GeneExpression is designed to describe
miRNAs’ regulation of gene expression. This concept is inherited from
concept gene expression in the BiologicalProcess ontology. This way,
subclasses of gene expression, such as negative regulation of gene
expression, are then accessible in the OMIT for describing the negative
gene regulation of miRNAs in question
• Some OMIT concepts are equivalent to (or similar to) GO
concepts
For example, OMIT concept PathologicalEvent and its subclasses are
designed to describe biological processes that are disturbed when a cell
becomes cancerous. Although not immediately inherited from any specific
GO concepts, these OMIT concepts do match up with certain concepts in
the BiologicalProcess ontology. OMIT concepts TargetGene and Protein
are two other examples, which correspond to individual genes and
individual gene products, respectively, in the GO
OMIT GUI Design
OMIT Summary
• It is an innovative computing framework based on the miRNA-domainspecific ontology
• It aims to handle the challenge of predicting miRNAs’ target genes
• The OMIT is the very first ontology in the miRNA domain
• It will assist biologists in unraveling important roles of miRNAs in human
cancer, and thus help clinicians in making sound decisions when treating
cancer patients
• Such long-term research goal will be achieved via facilitating knowledge
discovery and sharing from existing sources
• The first version OMIT ontology has been added into NCBO BioPortal
(http://bioportal.bioontology.org/ontologies/42873)
• Updates are available at the project website: http://omit.cis.usouthal.edu/
Presentation Outline
• Research Motivation
• Ontologies and Ontological Techniques
• Apply Ontological Techniques into Biological
and Medical Research
• Ongoing Research – OMIT Project
Summary
•
Knowledge discovery and sharing is critical in biological
and medical research
•
As a formal knowledge representation model, ontologies
render great help in defining formal semantics
•
Ontological techniques have been widely applied in the
bioinformatics and medical informatics
•
The most successful example is the Gene Ontology (GO)
project
•
Our ongoing project, OMIT, aims to investigate the
challenging issue of miRNA target prediction in human
cancer
• Suggestions?
• Comments?
• Questions?
Thank you!!!
Download