Last Edited: Paula de Matos (November 2009) Understanding the ChEBI ontology This tutorial covers an introduction to the purpose and structure of ontologies within biosciences, and the ChEBI ontology in particular. You will learn about the organisation of the ChEBI ontology into its four sub-ontologies, and you will learn about the particular ontology relationships which are used within ChEBI. This is the third of four training blocks in the ChEBI training course. Block 1 – Introduction to ChEBI Block 2 – Searching and browsing ChEBI Block 3 – Understanding the ChEBI ontology Block 4 – Download and programmatic access Contents Understanding the ChEBI ontology .................................................................................................. 1 Contents ........................................................................................................................................ 1 Introduction to Ontologies in Bioinformatics ................................................................................... 2 What is an ontology? .................................................................................................................... 2 Bioinformatics Data ...................................................................................................................... 4 Ontologies in Bioinformatics ........................................................................................................ 5 The ChEBI ontology ......................................................................................................................... 6 Exploring the ChEBI sub-ontologies .................................................................................... 7 ChEBI ontology relationships ............................................................................................... 8 Viewing the ChEBI ontology online ................................................................................... 11 Ontology Lookup Service (OLS) ................................................................................................ 14 Worked example: tryptophan .............................................................................................. 15 The OBO File format ...................................................................................................................... 17 For more information ...................................................................................................................... 18 Exercises ......................................................................................................................................... 19 This work is licensed under the Creative Commons Attribution-Share Alike 3.0 License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. 1 Introduction to Ontologies in Bioinformatics What is an ontology? The term ‘ontology’ derives from a branch of Philosophy, in which it means the theory or study of being as such. The word ontology comes from the Greek ontos for being and logos for word. It is a relatively new term in the long history of philosophy, introduced to distinguish the study of being as such from the study of various kinds of being in the natural sciences. The term in common use before was Aristotle's word category, which he used for classifying anything that can be asserted about anything. Below is an extract of Aristotle’s classification of animals which he arrived at through careful observation (Figure 1). Figure 1 Aristotle's classification of animals Within the field of computer science, the term ontology is used to mean an explicit specification of a conceptualisation (Tom Gruber, 1993). Within the world of the computer, what can ‘be’ is really what can be ‘specified’. The explicit specification of a conceptualisation within a particular domain is usually represented by a set of objects of that domain (instances) and their attributes, reflected in a representational vocabulary, organised by relationships into a classification which may take the form of a hierarchy or graph. As illustrated in the example below, we can define a vocabulary in the domain of ‘animals’ which represents the instances and some classes and attributes within the domain, and then define the relationships between the objects, classes and attributes. By so doing, we arrive at an explicit representation of knowledge, from which we can derive conceptual meaning. Figure 2 Ontology as an explicit specification of a conceptualisation The significance of the hierarchical classification within an ontology is that a class at a higher level subsumes a class at a lower level. That is, any attribute of the class at the higher level is an attribute of a class or instance at a lower level. By such organisation, each attribute usually only has to be specified once, at the highest level of the ontology at which it is relevant, and it is then understood to apply to all child classes and instances. This results in a very powerful 2 representational structure. Moving towards meaning: Semantics The current World Wide Web has allowed the distribution and dissemination of information at previously unheard-of levels. Whereas in the past, knowledge building happened largely in parallel in various different locations, and dissemination was slow, with the advent of the World Wide Web, global dissemination may be instantaneous, and knowledge building is increasingly happening as a global enterprise. However, while the information represented on the web is usually easily understandable by humans, it may not be so for computers. For example, consider the following table: Which may be represented on a standard Web page in standard HTML as: When asked to answer the following question, “What are the names of all the animals of type Mammal?” humans can answer it easily, but computers may not be able to without very contingent programming based on accidents of layout, which would be broken if the layout changed even slightly. Tim Berners-Lee proposed a vision of the semantic web as “An extension of the World Wide Web in which information is given a well-defined meaning, better enabling computers and people to work in cooperation”. And when it comes to dealing with very large volumes of data, such as those found in the field of Bioinformatics, most tasks (such as finding all instances of a certain type of data which correspond to a certain category) require the cooperation of humans and computers to perform adequately. To better enable this, in contrast to the above example, data needs to be associated with semantics. The association of semantics (in this case through the use of XML tags) allows computers to resolve the meaning of the data in addition to humans. Moving towards meaning: Standards Even if we associate semantics with our data as discussed above, there is still a problem of interoperability if different datasets which actually represent the same or similar data are associated with semantics in different ways. For example, 3 In this situation, a computer will not immediately be able to determine any relationship between the two datasets, nor answer questions accurately across the two datasets (such as performing a search across both datasets). It is clear that we need not only semantics associated with our data, but also to agree on standard ways in which to represent those semantics. An ontology is an attempt to explicitly specify terms and their associated semantic meaning, which represent an agreement or standardisation within a particular field of application or domain of knowledge. This has particular relevance in the field of Bioinformatics, where the volumes of data are so vast and being generated in such geographically dispersed efforts, that the interpretation and representation of knowledge relies heavily on the cooperation of humans and computers. Bioinformatics Data The core bioinformatics data is made up of protein and nucleotide sequences, along with measurements of three-dimensional structures. There are many levels above this core data, which consist of the representation of knowledge at various different levels, such as the knowledge about whole genomes, gene expression patterns, protein families and interactions, systems biology and pathway models. Figure 3 Bioinformatics Data At each different level and in each different database, knowledge is represented and enhanced by means of annotations, which are usually captured in the form of human-readable free text. 4 Annotation of bioinformatics data Annotation of bioinformatics data is essential for capturing and transmitting the knowledge associated with data in bioinformatics databases. All data other than the core data (sequences etc) which is present in the databases, such as names, descriptions, or literature references, is annotation. Annotations are often captured in the form of free text, which is easy for a human audience to read and understand, but is difficult for computers to parse, and can vary in quality from database to database, and can use different terminology to mean the same thing (even within the same database, if for example different human annotators used different terminology). An example of annotation present within the UniProt knowledgebase, taken from the UniProt accession number P00325, alcohol dehydrogenase 1B (http://beta.uniprot.org/uniprot/P00325): Figure 4 Annotation in free text Additionally, in many databases, efforts are underway for the assistance by computers in the extension of the annotation process by automatic annotation. This usually involves the extension of the human annotation of a core subset of data to a larger set of data based on computer algorithms for determining the applicability of the human annotations to similar data in the larger dataset. Several ontologies within the field of bioinformatics have been created in order to address the need for the standardisation and specification of the meaning of terminology which is used in annotation. Ontologies in Bioinformatics Some examples of common ontologies within the field of bioinformatics are discussed below. NCBI Taxonomy The NCBI Taxonomy database is a curated controlled vocabulary of the Linnaean names of organisms which have been genetically sequenced, organised into a hierarchy using Is A relationships. For example, the abbreviated NCBI taxonomy of Homo Sapiens can be represented as: 5 Figure 5 Taxonomy of Homo Sapiens In this case, the taxonomy forms a strict hierarchy of parent-child relationships. However, in general, ontologies need not be confined to this format, and indeed, structures which allow relationships to multiple parents (such as the Directed Acyclic Graph structure) are common. Enzyme Taxonomy Enzyme classification also takes the form of a hierarchical taxonomy in which each enzyme is classified at four levels of depth. For example, classification of the enzyme Flavonol 3sulfotransferase is given below. Figure 6 Enzyme Taxonomy Gene Ontology The Gene Ontology Consortium (http://www.geneontology.org/) develops and maintains the Gene Ontology, which provides a controlled vocabulary to describe gene and gene product attributes in any organism. It is organised by three organising principles, namely ‘molecular function’, ‘biological process’ and ‘cellular component’. The ChEBI ontology The ChEBI ontology is an ontology for biologically interesting chemistry. It consists of three subontologies, namely Molecular Structure, in which molecular entities or parts thereof are classified according to their structure; Role, in which entities are classified on the basis of their role within a biological context, e.g. as antibiotics, antiviral agents, coenzymes, enzyme inhibitors, or on the basis of their intended use by humans, e.g. as pesticides, detergents, healthcare products, fuel; and Subatomic Particle, in which are classified particles which are smaller than atoms. 6 Figure 7 ChEBI ontology for (R)-adrenaline Exploring the ChEBI sub-ontologies We will take a brief look at the kinds of data you will find classified under each of the three subontologies. (By “classified under”, we mean “has an unbroken is a relationship path with”). Molecular structure Molecular entities with defined connectivity are classified under the molecular structure subontology. These include the chemical compounds which themselves could exist in some form in the real world, such as drugs, vitamins, insecticides, and different forms of alcohol. In addition, classes of molecular entities are classified under the molecular structure sub-ontology. Classes may be structurally defined, but do not represent a single structural definition, but rather a generalisation of the structural features which all members of that class share. It is often useful to define the interesting parts of molecular entities as groups. Groups have a defined connectivity with one or more specified attachment points. 7 Figure 8 Molecular structure ontology Role The role sub-ontology is further divided into two distinct types of role, namely biological role and application. Roles do not themselves have structures, but rather it is the case that items in the role ontology are linked to the molecular entities which have those roles. Figure 9 Role ontology Subatomic particle The subatomic particle sub-ontology is the smallest sub-ontology graph, consisting only of those particles which are smaller than an atom. ChEBI ontology relationships The ChEBI ontology uses two generic ontology relationships, namely 8 • Is a: Entity A is an instance of Entity B. For example, chloroform is a chloromethanes. • Has part: Indicates relationship between part and whole, tetracyanonickelate(2−) is part of potassium tetracyanonickelate(2−). for example, Figure 10 ChEBI ontology generic relationships In addition, the ChEBI ontology contains several chemistry-specific relationships which are used to convey additional semantic information about the entities in the ontology. These are: • Is conjugate base of and is conjugate acid of: Cyclic relationships used to connect acids with their conjugate bases, for example, pyruvic acid is the conjugate acid of the pyruvate anion, while pyruvate is the conjugate base of the acid. • Is tautomer of: Cyclic relationship used to show relationship between two tautomers, for example, L-serine and its zwitterion are tautomers. • Is enantiomer of: Cyclic relationship used in instances when two entities are mirror images and non-superposable upon each other. For example, D-alanine is enantiomer of L-alanine and vice versa. • Has functional parent: Denotes the relationship between two molecular entities or classes, one of which possesses one or more characteristic groups from which the other can be derived by functional modification. For example, 16α-hydroxyprogesterone can be derived by functional modification (i.e. 16α-hydroxylation) of progesterone. • Has parent hydride: Denotes the relationship between an entity and its parent hydride, for example, 1,4-napthoquinone has parent hydride naphthalene. • Is substituent group from: Indicates the relationship between a substituent group/atom and its parent molecular entity, for example, the L-valino group is derived by a proton loss from the N atom of L-valine. • Has role: Denotes the relationship between a molecular entity and the particular behaviour which the entity may exhibit either by nature or by human application, for example, morphine has role opioid analgesic. Figure 11 ChEBI chemistry-specific relationships The structural meaning of the chemical ontology relationships are further illustrated below. 9 Is Conjugate Base Of Is Conjugate Acid Of Is Tautomer Of Is Enantiomer Of 10 Has Functional Parent Has Parent Hydride Is Substituent Group From A set of family relationships Viewing the ChEBI ontology online 11 The ChEBI ontology is part of the main entry view of a ChEBI entry. For example, the ChEBI entry for L-cysteine is shown below. Figure 12 ChEBI entry for L-cysteine Scrolling down reveals the section titled ‘ChEBI Ontology’, in which the parent and children relationships are listed. The default view is the parents and children view, however, clicking on the link marked ‘Tree View’ results in the display of the full hierarchical tree of relationships to this term. Tree view Figure 13 Ontology in parents and children view 12 Figure 14 Ontology in tree view The ChEBI ontology may also be browsed. To access the browse facility, select the ‘browse’ link from the main left-hand menu bar. Browse the ontology This link leads to the Ontology Lookup Service, an ontology browsing and searching utility which provides access to several different ontologies within the bioinformatics field. 13 Ontology Lookup Service (OLS) The Ontology Lookup Service is a facility which provides a centralised query interface for ontology and controlled vocabulary lookup. It is available online at http://www.ebi.ac.uk/ontology-lookup/. The link to browse the ChEBI ontology opens the following screen: Browse the three ChEBI sub-ontologies Figure 15 Ontology Lookup Service - ChEBI ontology The Ontology Lookup Service can integrate any ontology which is available in OBO format, and at present (as at the last release) contains 61 ontologies, including • GO • ChEBI • Molecular interaction (PSI MI) • Pathway ontology (PW) • Human disease (DOID) • and many more… OLS provides facilities for the searching and browsing of ontologies, as well as displaying a graph of terms and relationships between terms similarly to what AmiGO displays for GO. 14 Figure 16 DOID term 'Mental Retardation' and graph Worked example: tryptophan In this example we will browse the ChEBI ontology surrounding the chemical tryptophan by using the Ontology Lookup Service. Step 1: Search Open the Ontology Lookup Service home page at http://www.ebi.ac.uk/ontology-lookup/. Select the Chemical Entities of Biological Interest ontology from the ontology selection drop-down box. Type in the search box the first few letters of the term ‘tryptophan’. You will see that the search is performed in the background and a list of matching terms are displayed in a drop-down. 2. Type first few letters of search term 1. Select ChEBI ontology Step 2: Select term Select the term Tryptophan [CHEBI:27897]. Additional information about the term is displayed below the search box. 15 4. Click ‘Browse’ 3. Additional information defined in ChEBI OBO file Step 3: Browse ontology After selecting the term, click ‘browse’. You are taken to the ontology browser with the relevant term displayed as the root. 5. Browse full ontology 6. Browse with tryptophan as root Step 4: Viewing graphical tree for this entry Scrolling down and to the right reveals the graphical tree for this entry. 7. Tryptophan 16 The OBO File format The Open Biomedical Ontologies (OBO) is an umbrella organisation for ontologies and structured shared controlled vocabularies for use across all biological and biomedical domains (http://sourceforge.net/projects/obo). This organisation has defined the OBO file format, which is an ontology representation format designed specifically with the following goals in mind: • Human readability • Ease of parsing • Extensibility • Minimal redundancy The OBO format models a subset of the concepts modelled in OWL (Web Ontology Language) with extensions for metadata (such as synonyms). An extract from the ChEBI Ontology downloaded in OBO format illustrates the overall layout of data within the OBO file. The first paragraph in file contains general header information about that particular file, and is then followed by one or more terms separated by blank lines. Each term contains an identifier, a name, a definition, relationships to other terms within the ontology, and may contain additional metadata such as synonyms. format-version: 1.2 date: 28:01:2009 05:57 saved-by: pmatos default-namespace: chebi_ontology General header remark: ChEBI subsumes and replaces the Chemical Ontology first information remark: developed by Michael Ashburner & Pankaj Jaiswal. remark: Author: ChEBI curation team remark: ChEBI Release version 53 remark: For any queries contact chebi-help@ebi.ac.uk synonymtypedef: IUPAC_NAME "IUPAC NAME" synonymtypedef: FORMULA "FORMULA" Synonym types synonymtypedef: SMILES "SMILES" used in terms synonymtypedef: InChI "InChI" synonymtypedef: InChIKey "InChIKey" synonymtypedef: BRAND_NAME "BRAND NAME" synonymtypedef: INN "INN" [Term] id: CHEBI:24431 name: molecular structure def: "A description of the molecular entity or part thereof based on its composition and/or the connectivity between its constituent atoms." [] Synonyms in OBO format may [Term] be ‘related’ or ‘exact’. In the id: CHEBI:23367 ChEBI OBO file, IUPAC name: molecular entities names are considered ‘exact’ def: "A molecular entity is any constitutionally or and isotopically synonyms all others distinct atom, molecule, ion, ion pair, radical, radical ion, ‘related’. complex, conformer etc., identifiable as a separately distinguishable entity." [] synonym: "entidad molecular" RELATED [IUPAC:] synonym: "entidades moleculares" RELATED [IUPAC:] synonym: "entite moleculaire" RELATED [IUPAC:] synonym: "molecular entity" EXACT IUPAC_NAME [IUPAC:] synonym: "molekulare Entitaet" RELATED [ChEBI:] Relationships to is_a: CHEBI:24431 other terms 17 [Term] id: CHEBI:24870 name: ions def: "An ion is a molecular entity having a net electric charge." [] synonym: "ion" EXACT IUPAC_NAME [IUPAC:] is_a: CHEBI:23367 The OBO Foundry “The OBO Foundry is a collaborative experiment involving developers of science-based ontologies who are establishing a set of principles for ontology development with the goal of creating a suite of orthogonal interoperable reference ontologies in the biomedical domain.” The OBO foundry can be found at http://www.obofoundry.org/ For more information For further information, email the ChEBI team at: chebi-help@ebi.ac.uk, or log on to the SourceForge forum at https://sourceforge.net/projects/chebi/. Additional information about using ChEBI can be found by examining the User Manual at http://www.ebi.ac.uk/chebi/userManualForward.do. The latest news, updates and developments are announced via a RSS Feed. 18 Exercises You will need access to ChEBI online at http://www.ebi.ac.uk/chebi to complete these exercises. 1. Dichlorvos (CHEBI:34690) is a well known insecticide. Open the ChEBI entry for dichlorvos (CHEBI:34690). Scroll down the entry to the “ChEBI ontology”. Can you determine whether it can also be used as a fungicide? ______________________________________________________________________ 2. On the same entry as above (CHEBI:34690) click on the “Tree View” to display the entire ontology tree. Follow the tree path from dichlorvos to its parent organophosphate insecticide (CHEBI:25708). Click on this parent organophosphate insecticide (CHEBI:25708). This brings you to the ontology view of the parent. From looking at the children of this entry can you write down any other insecticide? ______________________________________________________________________ 3. The following term appears in the ChEBI OBO file. [Term] id: CHEBI:32762 name: L-tyrosinium synonym: "(1S)-1-carboxy-2-(4-hydroxyphenyl)ethanaminium" RELATED [IUPAC:] synonym: "L-tyrosine cation" RELATED [JCBN:] synonym: "L-tyrosinium" EXACT IUPAC_NAME [IUPAC:] synonym: "C9H12NO3" RELATED FORMULA [ChEBI:] synonym: "[NH3+][C@@H](Cc1ccc(O)cc1)C(O)=O" RELATED SMILES [ChEBI:] synonym: "InChI=1/C9H11NO3/c10-8(9(12)13)5-6-1-3-7(11)4-2-6/h14,8,11H,5,10H2,(H,12,13)/p+1/t8-/m0/s1/fC9H12NO3/h10,12H/q+1" RELATED InChI [ChEBI:] xref: Gmelin:1150138 "Gmelin Registry Number" is_a: CHEBI:32786 relationship: is_enantiomer_of CHEBI:32775 relationship: is_conjugate_acid_of CHEBI:17895 Can you describe the relationships that this term has to other entities within the ontology? ______________________________________________________________________ ______________________________________________________________________ ______________________________________________________________________ 4. Using the Advanced Search can you find all entries which have the application ‘pharmaceutical’ (CHEBI:52217) ? How many entries are there? Can you name one? ______________________________________________________________________ ______________________________________________________________________ ______________________________________________________________________ 5. Using the Advanced Search can you find all entries which have the role ‘epitope’ (CHEBI:53000) ? How many entries are there? Can you name one? ______________________________________________________________________ ______________________________________________________________________ ______________________________________________________________________ 19