Ontology-Based Knowledge Discovery and Sharing in Biological and Medical Research Jingshan Huang Assistant Professor School of Computer and Information Sciences University of South Alabama http://cis.usouthal.edu/~huang/ Dept. of Chemical Pathology @ CUHK Hong Kong August 17, 2010 Presentation Outline • Research Motivation • Ontologies and Ontological Techniques • Apply Ontological Techniques into Biological and Medical Research • Ongoing Research – OMIT Project Research Motivation – Overview • Information from heterogeneous sources has different semantics Long (English) Long (Chinese Pinyin) -> 龙 (龍) -> • • • • Knowledge discovery and sharing in biological/medical research is both important and challenging Integrating the information from heterogeneous sources must make use of all available clues, including syntax, semantics, context, and pragmatics Ontologies are a formal model to encode semantics Ontological techniques are critical in knowledge acquisition Research Motivation – More Details • • • • • • Why??? In medical informatics area, an abundance of digital data has possibly promised a profound impact in knowledge discovery and innovation Worldwide health scientists are producing, accessing, analyzing, integrating, and storing massive amounts of digital medical data daily Such data was obtained through observation, experimentation, and simulation If we were able to effectively transfer and integrate data from all possible resources, then it is possible to obtain: ① a deeper understanding of all these data sets, ② better exposed knowledge, and ③ appropriate insights and actions Unfortunately, in many cases, the data users are not the data producers They thus face challenges in harnessing data in unforeseen and unplanned ways Research Motivation – An Example Scenario • • • • • • • The identification and characterization of important roles microRNAs (miRNAs) played in human cancer is an increasingly active area In particular, it is very challenging to effectively identify miRNAs’ target genes Cancer patients’ prognosis depends largely on their chemosensitivity (sensitivity to chemotherapy) Research has discovered that some specific genes increase the permeability of mitochondria (a cellular component) membrane, which in turn leads to apoptosis (cell death) As a result, the patient’s chemosensitivity will increase and the chemotherapy will be more effective Certain miRNAs can regulate the aforementioned genes and thus affect cancer patients’ prognosis If biologists were able to identify such miRNAs, a breakthrough on cancer treatment would have been made Unfortunately, such identification is very difficult… Research Motivation – An Example Scenario (cont.) • • Biologists need to extract a large number of candidate target genes from existing miRNA databases They will also have to manually search these genes’ related information from resources other than miRNA databases for every one of hundreds of candidate target genes ① ② ③ cellular component biological process and so on… • In a word, the whole process is time-consuming, error-prone, and subject to biologists’ limited prior knowledge • In addition, such a situation could be even worse ① ② ③ It is further aggravated by great complexity and imprecise terminologies, which characterize typical biological and biomedical research fields A great deal of variety has been identified in the adoption of different biological terms, along with different relationships among all these terms Such variety has inhibited effective information acquisition by humans Research Motivation – Summary • • • The biological and medical research area is facing a challenging problem: knowledge discovery and sharing among distributed parties In order to integrate heterogeneous data, and thereby efficiently revolutionize the traditional medical and biological research, new methodologies are in great need As a formal knowledge representation model, ontologies play a key role in defining formal semantics in traditional knowledge engineering Conclusion: It is necessary to apply ontological techniques into the biological and medical research investigation Presentation Outline • Research Motivation • Ontologies and Ontological Techniques • Apply Ontological Techniques into Biological and Medical Research • Ongoing Research – OMIT Project Definition of Ontologies • The simplest definition: An ontology is a computational model (a.k.a. knowledge representation model) of some domain of the world • • It describes the semantics of the terms (a.k.a. concepts) used in the domain It is often captured in the form of DAG (directed acyclic graph) What is a DAG then? • • Nodes represent ontology concepts while arcs represent their relationships May be augmented by rules, constraints, or functions • In brief, ontologies aim to make explicit the knowledge contained within software applications for a particular domain: An ontology = a finite set of concepts + properties + relationships • • • Such graphical structures are also known as ontology schemas Actual data sets contained in these schemas are referred to as instances Most real-world ontologies have very few or no instances at all Ontology Engineering • • The creation and maintenance of ontologies in the domain of interest In other words, it focuses on the methodologies by which to build ontologies • To ① ② ③ • Languages to represent ontologies in computer systems ① OWL (Web Ontology Language) – most popular one ② Open Biological and Biomedical Ontologies (OBO) ③ Knowledge Interchange Format (KIF) ④ Open Knowledge Base Connectivity (OKBC) • GUI tools for ontology engineering ① Protégé (by Stanford) – most popular one ② CmapTools (by IHMC) ③ OntoEdit (by Ontoprise) create an ontology, three different approaches can be applied Top-down approach (knowledge driven) Bottom-up approach (data/inference driven) Combination of top-down and bottom-up Ontology Engineering (Protégé GUI – Upper Bio Ontology) Ontology Engineering (Example OWL File – Upper Bio Ontology) Ontology Heterogeneity • Heterogeneity is an important, inherent characteristic of ontologies developed by different parties for the same (or similar) domains • This is due to the fact that ontologies reflect their designers’ different conceptual models for some domain • The heterogeneous semantics may occur in different ways ① different terms could be used for the same concept; ② an identical term could be adopted for different concepts; ③ properties and relationships could be different As a result, Ontology Matching has become an increasingly active topic Ontology Matching • “Ontology Matching” is short for “Ontology Schema Matching” • Also known as “Ontology Alignment” or “Ontology Mapping” • It refers to the process of determining correspondences between concepts from heterogeneous ontologies • It aims to handle the aforementioned challenge in ontology heterogeneity • Many different relationships will be involved ① ② ③ ④ ⑤ equivalentWith subClassOf superClassOf siblings and so on… Current Ontology-Matching Algorithms Rule-Based Matching ① Consider schema information alone ② Specify a set of rules ③ Apply them to schema information Learning-Based Matching ① Consider both schema and instances ② Apply different machine learning techniques Brief Introduction of Machine Learning ① A scientific discipline that is concerned with the design and development of some special algorithms ② These algorithms allow computers to change behavior based on “training data” ③ The major focus is to recognize complex patterns and make intelligent decisions Pros and Cons for Current Approaches Rule-Based Matching ① Is relatively fast () ② Ignores instance information () ③ Uses ad hoc predefined weights () concept semantics: name + properties + relationships Learning-Based Matching ① Obtains extra clues from instances () ② Runs longer () ③ Has difficulty in getting sufficient instances () most real-world ontologies do not have instances Presentation Outline • Research Motivation • Ontologies and Ontological Techniques • Apply Ontological Techniques into Biological and Medical Research • Ongoing Research – OMIT Project Ontological Techniques in Bio Research • Ontological techniques have been widely applied to medical and biological research • The most successful example is the Gene Ontology (GO) project • Unified Medical Language System (UMLS) and the National Center for Biomedical Ontology (NCBO) are two other successful examples • Besides, efforts have been carried out for ontology-based data integration in bioinformatics and medical informatics Why Gene Ontology (GO) Project? • • • • • • • Biologists have wasted a lot of time and effort in searching for all of the available information about each small area of research It is further hampered by the wide variations in terminology that may be common usage at any given time A simple example: if you were searching for new targets for antibiotics, you might want to find all the gene products that are involved in bacterial protein synthesis Suppose that one database describes these molecules as being involved in “translation”, whereas another uses the phrase “protein synthesis” It will then be difficult for human to find functionally equivalent terms, let alone any computer software As an effort to address the need for consistent descriptions of gene products in different databases, the GO began as a collaboration between three model organism databases (Flies, Saccharomyces, and Mouse) in 1998 The GO Consortium has grown to include many databases, including several of the world’s major repositories for plant, animal, and microbial genomes Three Sub-Ontologies in the GO • Cellular Component, Biological Process, and Molecular Function • A gene product might: ① ② ③ be associated with or located in one or more cellular components; be active in one or more biological processes; during which it performs one or more molecular functions Example The gene product, cytochrome c , can be described by: ① ② ③ the molecular function term “oxidoreductase activity” the biological process terms “oxidative phosphorylation” and “induction of cell death” the cellular component terms “mitochondrial matrix” and “mitochondrial inner membrane” GO Structure • • The GO ontology is essentially a Hierarchy-Like DAG In other words, each node is a GO term, and each arc represents a relationship between two GO terms • Directed feature For example, a mitochondrion is an organelle, but not vice versa Acyclic feature (cycles are not allowed) For example, it is inappropriate to specify that “A1 is an A2” “A2 is an A3” … “Ai is an A1” Hierarchy-Like feature (generalized-specialized relationship plus possibly multiple parents) For example, the biological process term hexose biosynthetic process has two parents, hexose metabolic process and monosaccharide biosynthetic process (biosynthetic process is a type of metabolic process and a hexose is a type of monosaccharide) • • An Example GO Diagram Three Relationships in the GO • The GO ontology defines three different relationships among terms ① is a , a.k.a. is a subtype of, represented as ② part of , represented as ③ regulates , represented as ; ; and Note that regulates includes two sub-relationships, i.e., negatively regulates and positively regulates, represented as and , respectively is a Relationship in the GO • If A is a B, it means that A is a subtype of B ① For example, mitotic cell cycle is a cell cycle ② Another example, lyase activity is a catalytic activity • The difference between is a relationship and “is an instance of” (meaning that a specific example of something), for example: ① A cat is a mammal ② George is an instance of a cat, therefore, the claim that “George is a cat” is incorrect ③ However, it is safe to claim that every one of the instances of a cat is also an instance of a mammal Reasoning over is a Relationship • The is a relationship is transitive: • Example part of Relationship in the GO • • • • B is part of A, meaning that the presence of B implies the presence of A But not vice versa, i.e., given the presence of A, we cannot conclude the presence of B In other words ① all B are part of A ② but only some A have part B Example Reasoning over part of Relationship (1) • The part of relationship is also transitive: • Example Reasoning over part of Relationship (2) • part of followed by is a : • Example Reasoning over part of Relationship (3) • part of following is a : • Example Reasoning over part of Relationship (4) • • The aforementioned logical rules regarding the part of and is a relationships hold no matter how many intervening is a and part of relationships are there Example regulates Relationship in the GO • • • • B regulates A, meaning that the presence of B implies the presence of A But not vice versa, i.e., given the presence of A, we cannot conclude the presence of B In other words ① all B regulate A ② but only some A are regulated by B Example Reasoning over regulates Relationship (1) • Both negatively regulates and positively regulates imply regulates • Example Reasoning over regulates Relationship (2) Reasoning over regulates Relationship (3) Reasoning over regulates Relationship (4) Reasoning over regulates Relationship (5) • Example Reasoning over regulates Relationship (6) • Example Reasoning over regulates Relationship (7) • Example Presentation Outline • Research Motivation • Ontologies and Ontological Techniques • Apply Ontological Techniques into Biological and Medical Research • Ongoing Research – OMIT Project Ongoing Research: OMIT Project http://omit.cis.usouthal.edu/ Besides Sun Lab at CUHK, there are five other collaborating labs from around the world Project Overview • An innovative computing framework based on the Ontology for • • • • MicroRNA Target Prediction (OMIT) to handle the aforementioned challenge in predicting miRNAs’ target genes The OMIT is a domain-specific ontology upon which it is possible to facilitate knowledge discovery and sharing from existing sources The long-term research objective of the OMIT framework is to assist biologists in unraveling important roles of miRNAs in human cancer, and thus to help clinicians in making sound decisions when treating cancer patients We aim to synthesize data from existing source miRNA databases into a comprehensive conceptual model that permits an emphasis on data semantics Consequently, a more accurate, complete view of miRNAs’ biological functions can be acquired We thus provide users with a single query engine that takes their needs in a nonprocedural specification format System Framework Five Tasks in the OMIT Project ① To develop a miRNA-domain-specific ontology that contains a set of OMIT concepts, along with the relationships among these concepts ② To align the OMIT with the GO so that gene-related information can be automatically acquired and integrated ③ To annotate source miRNA databases with OMIT concepts for existing databases to be enriched with formal semantics ④ To integrate OMIT-annotated miRNA databases into a centralized RDF data warehouse ⑤ To perform complicated search/query in a unified style so that deep knowledge can be obtained out of a wealth of miRNA data An Example Research Scenario Suppose a cancer biologist is interested in investigating the chemosensitivity of breast cancer cells • By comparing chemosensitive and chemoresistant cancer cells it is demonstrated that miR-125b, a specific miRNA, may confer the increased chemosensitivity of cancer cells • After the OMIT system obtains candidate targets for miR-125b, the gene information of these targets will be further acquired, including cellular localization (e.g., in mitochondria) and biological process (e.g., apoptosis) • The availability of such integrated knowledge will make it much easier for the cancer biologist to deduct the actual targets for miR-125b • As a result, a breakthrough in breast cancer treatment may be granted A Typical Knowledge Acquisition Cycle • Steps 1-3: the user initiates a search/query; recognized miRNA concept is used to query the RDF data warehouse • Steps 4-5: miRNA targets are retrieved and utilized to acquire more gene information • Steps 6-8: miRNA targets and their related gene information are returned to the user Corresponding RDF-based query: SELECT DISTINCT OMIT:targetGene FROM OMIT:miRNA, GO-CC:cellComponent, GO-BP:bioProcess WHERE OMIT:miRNA ID = “miR-125b” AND OMIT:miRNA targetID = GO-CC:cellComponent geneID AND OMIT:miRNA targetID = GO-BP:bioProcess geneID AND GO-CC:cellComponent localization = “mitochondria” AND GO-CC:cellComponent permeabilityIncrease = “yes” AND GO-BP:bioProcess apoptosisIncrease = “yes” USING NAMESPACE OMIT = <http://omit.cis.usouthal.edu/ontology/OMIT.owl>, GO-CC = <http://www.geneontology.org/formats/oboInOwl#>, GO-BP = <http://www.geneontology.org/formats/oboInOwl#>. Top-Level OMIT Concepts Expanded View of OMIT Concepts (Portion) Linkage between the OMIT and the GO • Some OMIT concepts are directly inherited and extended from GO concepts For example, OMIT concept GeneExpression is designed to describe miRNAs’ regulation of gene expression. This concept is inherited from concept gene expression in the BiologicalProcess ontology. This way, subclasses of gene expression, such as negative regulation of gene expression, are then accessible in the OMIT for describing the negative gene regulation of miRNAs in question • Some OMIT concepts are equivalent to (or similar to) GO concepts For example, OMIT concept PathologicalEvent and its subclasses are designed to describe biological processes that are disturbed when a cell becomes cancerous. Although not immediately inherited from any specific GO concepts, these OMIT concepts do match up with certain concepts in the BiologicalProcess ontology. OMIT concepts TargetGene and Protein are two other examples, which correspond to individual genes and individual gene products, respectively, in the GO OMIT GUI Design OMIT Summary • It is an innovative computing framework based on the miRNA-domainspecific ontology • It aims to handle the challenge of predicting miRNAs’ target genes • The OMIT is the very first ontology in the miRNA domain • It will assist biologists in unraveling important roles of miRNAs in human cancer, and thus help clinicians in making sound decisions when treating cancer patients • Such long-term research goal will be achieved via facilitating knowledge discovery and sharing from existing sources • The first version OMIT ontology has been added into NCBO BioPortal (http://bioportal.bioontology.org/ontologies/42873) • Updates are available at the project website: http://omit.cis.usouthal.edu/ Presentation Outline • Research Motivation • Ontologies and Ontological Techniques • Apply Ontological Techniques into Biological and Medical Research • Ongoing Research – OMIT Project Summary • Knowledge discovery and sharing is critical in biological and medical research • As a formal knowledge representation model, ontologies render great help in defining formal semantics • Ontological techniques have been widely applied in the bioinformatics and medical informatics • The most successful example is the Gene Ontology (GO) project • Our ongoing project, OMIT, aims to investigate the challenging issue of miRNA target prediction in human cancer • Suggestions? • Comments? • Questions? Thank you!!!