E2c ChEBI

advertisement
Last Edited: Paula de Matos (November 2009)
Understanding the ChEBI ontology
This tutorial covers an introduction to the purpose and structure of ontologies within biosciences,
and the ChEBI ontology in particular. You will learn about the organisation of the ChEBI
ontology into its four sub-ontologies, and you will learn about the particular ontology relationships
which are used within ChEBI.
This is the third of four training blocks in the ChEBI training course.
Block 1 – Introduction to ChEBI
Block 2 – Searching and browsing ChEBI
Block 3 – Understanding the ChEBI ontology
Block 4 – Download and programmatic access
Contents
Understanding the ChEBI ontology .................................................................................................. 1
Contents ........................................................................................................................................ 1
Introduction to Ontologies in Bioinformatics ................................................................................... 2
What is an ontology? .................................................................................................................... 2
Bioinformatics Data ...................................................................................................................... 4
Ontologies in Bioinformatics ........................................................................................................ 5
The ChEBI ontology ......................................................................................................................... 6
Exploring the ChEBI sub-ontologies .................................................................................... 7
ChEBI ontology relationships ............................................................................................... 8
Viewing the ChEBI ontology online ................................................................................... 11
Ontology Lookup Service (OLS) ................................................................................................ 14
Worked example: tryptophan .............................................................................................. 15
The OBO File format ...................................................................................................................... 17
For more information ...................................................................................................................... 18
Exercises ......................................................................................................................................... 19
This work is licensed under the Creative Commons Attribution-Share Alike 3.0 License. To view a copy
of this license, visit http://creativecommons.org/licenses/by-sa/3.0/ or send a letter to Creative Commons, 543
Howard Street, 5th Floor, San Francisco, California, 94105, USA.
1
Introduction to Ontologies in Bioinformatics
What is an ontology?
The term ‘ontology’ derives from a branch of Philosophy, in which it means the theory or study of
being as such. The word ontology comes from the Greek ontos for being and logos for word. It is a
relatively new term in the long history of philosophy, introduced to distinguish the study of being
as such from the study of various kinds of being in the natural sciences. The term in common use
before was Aristotle's word category, which he used for classifying anything that can be asserted
about anything. Below is an extract of Aristotle’s classification of animals which he arrived at
through careful observation (Figure 1).
Figure 1 Aristotle's classification of animals
Within the field of computer science, the term ontology is used to mean an explicit specification of
a conceptualisation (Tom Gruber, 1993). Within the world of the computer, what can ‘be’ is really
what can be ‘specified’. The explicit specification of a conceptualisation within a particular
domain is usually represented by a set of objects of that domain (instances) and their attributes,
reflected in a representational vocabulary, organised by relationships into a classification which
may take the form of a hierarchy or graph.
As illustrated in the example below, we can define a vocabulary in the domain of ‘animals’ which
represents the instances and some classes and attributes within the domain, and then define the
relationships between the objects, classes and attributes. By so doing, we arrive at an explicit
representation of knowledge, from which we can derive conceptual meaning.
Figure 2 Ontology as an explicit specification of a conceptualisation
The significance of the hierarchical classification within an ontology is that a class at a higher
level subsumes a class at a lower level. That is, any attribute of the class at the higher level is an
attribute of a class or instance at a lower level. By such organisation, each attribute usually only
has to be specified once, at the highest level of the ontology at which it is relevant, and it is then
understood to apply to all child classes and instances. This results in a very powerful
2
representational structure.
Moving towards meaning: Semantics
The current World Wide Web has allowed the distribution and dissemination of information at
previously unheard-of levels. Whereas in the past, knowledge building happened largely in
parallel in various different locations, and dissemination was slow, with the advent of the World
Wide Web, global dissemination may be instantaneous, and knowledge building is increasingly
happening as a global enterprise. However, while the information represented on the web is
usually easily understandable by humans, it may not be so for computers. For example, consider
the following table:
Which may be represented on a standard Web page in standard HTML as:
When asked to answer the following question,
“What are the names of all the animals of type Mammal?”
humans can answer it easily, but computers may not be able to without very contingent
programming based on accidents of layout, which would be broken if the layout changed even
slightly.
Tim Berners-Lee proposed a vision of the semantic web as “An extension of the World Wide Web
in which information is given a well-defined meaning, better enabling computers and people to
work in cooperation”. And when it comes to dealing with very large volumes of data, such as
those found in the field of Bioinformatics, most tasks (such as finding all instances of a certain
type of data which correspond to a certain category) require the cooperation of humans and
computers to perform adequately. To better enable this, in contrast to the above example, data
needs to be associated with semantics.
The association of semantics (in this case through the use of XML tags) allows computers to
resolve the meaning of the data in addition to humans.
Moving towards meaning: Standards
Even if we associate semantics with our data as discussed above, there is still a problem of
interoperability if different datasets which actually represent the same or similar data are
associated with semantics in different ways. For example,
3
In this situation, a computer will not immediately be able to determine any relationship between
the two datasets, nor answer questions accurately across the two datasets (such as performing a
search across both datasets). It is clear that we need not only semantics associated with our data,
but also to agree on standard ways in which to represent those semantics.
An ontology is an attempt to explicitly specify terms and their associated semantic meaning, which
represent an agreement or standardisation within a particular field of application or domain of
knowledge. This has particular relevance in the field of Bioinformatics, where the volumes of data
are so vast and being generated in such geographically dispersed efforts, that the interpretation and
representation of knowledge relies heavily on the cooperation of humans and computers.
Bioinformatics Data
The core bioinformatics data is made up of protein and nucleotide sequences, along with
measurements of three-dimensional structures. There are many levels above this core data, which
consist of the representation of knowledge at various different levels, such as the knowledge about
whole genomes, gene expression patterns, protein families and interactions, systems biology and
pathway models.
Figure 3 Bioinformatics Data
At each different level and in each different database, knowledge is represented and enhanced by
means of annotations, which are usually captured in the form of human-readable free text.
4
Annotation of bioinformatics data
Annotation of bioinformatics data is essential for capturing and transmitting the knowledge
associated with data in bioinformatics databases. All data other than the core data (sequences etc)
which is present in the databases, such as names, descriptions, or literature references, is
annotation.
Annotations are often captured in the form of free text, which is easy for a human audience to read
and understand, but is difficult for computers to parse, and can vary in quality from database to
database, and can use different terminology to mean the same thing (even within the same
database, if for example different human annotators used different terminology).
An example of annotation present within the UniProt knowledgebase, taken from the UniProt
accession number P00325, alcohol dehydrogenase 1B (http://beta.uniprot.org/uniprot/P00325):
Figure 4 Annotation in free text
Additionally, in many databases, efforts are underway for the assistance by computers in the
extension of the annotation process by automatic annotation. This usually involves the extension
of the human annotation of a core subset of data to a larger set of data based on computer
algorithms for determining the applicability of the human annotations to similar data in the larger
dataset.
Several ontologies within the field of bioinformatics have been created in order to address the need
for the standardisation and specification of the meaning of terminology which is used in
annotation.
Ontologies in Bioinformatics
Some examples of common ontologies within the field of bioinformatics are discussed below.
NCBI Taxonomy
The NCBI Taxonomy database is a curated controlled vocabulary of the Linnaean names of
organisms which have been genetically sequenced, organised into a hierarchy using Is A
relationships. For example, the abbreviated NCBI taxonomy of Homo Sapiens can be represented
as:
5
Figure 5 Taxonomy of Homo Sapiens
In this case, the taxonomy forms a strict hierarchy of parent-child relationships. However, in
general, ontologies need not be confined to this format, and indeed, structures which allow
relationships to multiple parents (such as the Directed Acyclic Graph structure) are common.
Enzyme Taxonomy
Enzyme classification also takes the form of a hierarchical taxonomy in which each enzyme is
classified at four levels of depth. For example, classification of the enzyme Flavonol 3sulfotransferase is given below.
Figure 6 Enzyme Taxonomy
Gene Ontology
The Gene Ontology Consortium (http://www.geneontology.org/) develops and maintains the Gene
Ontology, which provides a controlled vocabulary to describe gene and gene product attributes in
any organism. It is organised by three organising principles, namely ‘molecular function’,
‘biological process’ and ‘cellular component’.
The ChEBI ontology
The ChEBI ontology is an ontology for biologically interesting chemistry. It consists of three subontologies, namely

Molecular Structure, in which molecular entities or parts thereof are classified according to
their structure;

Role, in which entities are classified on the basis of their role within a biological context, e.g.
as antibiotics, antiviral agents, coenzymes, enzyme inhibitors, or on the basis of their intended
use by humans, e.g. as pesticides, detergents, healthcare products, fuel; and

Subatomic Particle, in which are classified particles which are smaller than atoms.
6
Figure 7 ChEBI ontology for (R)-adrenaline
Exploring the ChEBI sub-ontologies
We will take a brief look at the kinds of data you will find classified under each of the three subontologies. (By “classified under”, we mean “has an unbroken is a relationship path with”).
Molecular structure
Molecular entities with defined connectivity are classified under the molecular structure subontology. These include the chemical compounds which themselves could exist in some form in
the real world, such as drugs, vitamins, insecticides, and different forms of alcohol.
In addition, classes of molecular entities are classified under the molecular structure sub-ontology.
Classes may be structurally defined, but do not represent a single structural definition, but rather a
generalisation of the structural features which all members of that class share.
It is often useful to define the interesting parts of molecular entities as groups. Groups have a
defined connectivity with one or more specified attachment points.
7
Figure 8 Molecular structure ontology
Role
The role sub-ontology is further divided into two distinct types of role, namely biological role and
application. Roles do not themselves have structures, but rather it is the case that items in the role
ontology are linked to the molecular entities which have those roles.
Figure 9 Role ontology
Subatomic particle
The subatomic particle sub-ontology is the smallest sub-ontology graph, consisting only of those
particles which are smaller than an atom.
ChEBI ontology relationships
The ChEBI ontology uses two generic ontology relationships, namely
8
•
Is a: Entity A is an instance of Entity B. For example, chloroform is a chloromethanes.
•
Has part: Indicates relationship between part and whole,
tetracyanonickelate(2−) is part of potassium tetracyanonickelate(2−).
for
example,
Figure 10 ChEBI ontology generic relationships
In addition, the ChEBI ontology contains several chemistry-specific relationships which are used
to convey additional semantic information about the entities in the ontology. These are:
•
Is conjugate base of and is conjugate acid of: Cyclic relationships used to connect acids
with their conjugate bases, for example, pyruvic acid is the conjugate acid of the pyruvate
anion, while pyruvate is the conjugate base of the acid.
•
Is tautomer of: Cyclic relationship used to show relationship between two tautomers, for
example, L-serine and its zwitterion are tautomers.
•
Is enantiomer of: Cyclic relationship used in instances when two entities are mirror
images and non-superposable upon each other. For example, D-alanine is enantiomer of
L-alanine and vice versa.
•
Has functional parent: Denotes the relationship between two molecular entities or
classes, one of which possesses one or more characteristic groups from which the other
can be derived by functional modification. For example, 16α-hydroxyprogesterone can be
derived by functional modification (i.e. 16α-hydroxylation) of progesterone.
•
Has parent hydride: Denotes the relationship between an entity and its parent hydride,
for example, 1,4-napthoquinone has parent hydride naphthalene.
•
Is substituent group from: Indicates the relationship between a substituent group/atom
and its parent molecular entity, for example, the L-valino group is derived by a proton
loss from the N atom of L-valine.
•
Has role: Denotes the relationship between a molecular entity and the particular
behaviour which the entity may exhibit either by nature or by human application, for
example, morphine has role opioid analgesic.
Figure 11 ChEBI chemistry-specific relationships
The structural meaning of the chemical ontology relationships are further illustrated below.
9
Is Conjugate Base Of
Is Conjugate Acid Of
Is Tautomer Of
Is Enantiomer Of
10
Has Functional Parent
Has Parent Hydride
Is Substituent Group From
A set of family relationships
Viewing the ChEBI ontology online
11
The ChEBI ontology is part of the main entry view of a ChEBI entry. For example, the ChEBI
entry for L-cysteine is shown below.
Figure 12 ChEBI entry for L-cysteine
Scrolling down reveals the section titled ‘ChEBI Ontology’, in which the parent and children
relationships are listed. The default view is the parents and children view, however, clicking on the
link marked ‘Tree View’ results in the display of the full hierarchical tree of relationships to this
term.
Tree view
Figure 13 Ontology in parents and children view
12
Figure 14 Ontology in tree view
The ChEBI ontology may also be browsed. To access the browse facility, select the ‘browse’ link
from the main left-hand menu bar.
Browse the ontology
This link leads to the Ontology Lookup Service, an ontology browsing and searching utility which
provides access to several different ontologies within the bioinformatics field.
13
Ontology Lookup Service (OLS)
The Ontology Lookup Service is a facility which provides a centralised query interface for
ontology
and
controlled
vocabulary
lookup.
It
is
available
online
at
http://www.ebi.ac.uk/ontology-lookup/. The link to browse the ChEBI ontology opens the
following screen:
Browse the three
ChEBI sub-ontologies
Figure 15 Ontology Lookup Service - ChEBI ontology
The Ontology Lookup Service can integrate any ontology which is available in OBO format, and
at present (as at the last release) contains 61 ontologies, including
•
GO
•
ChEBI
•
Molecular interaction (PSI MI)
•
Pathway ontology (PW)
•
Human disease (DOID)
•
and many more…
OLS provides facilities for the searching and browsing of ontologies, as well as displaying a graph
of terms and relationships between terms similarly to what AmiGO displays for GO.
14
Figure 16 DOID term 'Mental Retardation' and graph
Worked example: tryptophan
In this example we will browse the ChEBI ontology surrounding the chemical tryptophan by using
the Ontology Lookup Service.
Step 1: Search
Open the Ontology Lookup Service home page at http://www.ebi.ac.uk/ontology-lookup/. Select
the Chemical Entities of Biological Interest ontology from the ontology selection drop-down box.
Type in the search box the first few letters of the term ‘tryptophan’. You will see that the search is
performed in the background and a list of matching terms are displayed in a drop-down.
2. Type first few
letters of search term
1. Select ChEBI
ontology
Step 2: Select term
Select the term Tryptophan [CHEBI:27897]. Additional information about the term is displayed
below the search box.
15
4. Click ‘Browse’
3. Additional information
defined in ChEBI OBO
file
Step 3: Browse ontology
After selecting the term, click ‘browse’. You are taken to the ontology browser with the relevant
term displayed as the root.
5. Browse full ontology
6. Browse with
tryptophan as root
Step 4: Viewing graphical tree for this entry
Scrolling down and to the right reveals the graphical tree for this entry.
7. Tryptophan
16
The OBO File format
The Open Biomedical Ontologies (OBO) is an umbrella organisation for ontologies and structured
shared controlled vocabularies for use across all biological and biomedical domains
(http://sourceforge.net/projects/obo). This organisation has defined the OBO file format, which is
an ontology representation format designed specifically with the following goals in mind:
•
Human readability
•
Ease of parsing
•
Extensibility
•
Minimal redundancy
The OBO format models a subset of the concepts modelled in OWL (Web Ontology Language)
with extensions for metadata (such as synonyms).
An extract from the ChEBI Ontology downloaded in OBO format illustrates the overall layout of
data within the OBO file. The first paragraph in file contains general header information about that
particular file, and is then followed by one or more terms separated by blank lines. Each term
contains an identifier, a name, a definition, relationships to other terms within the ontology, and
may contain additional metadata such as synonyms.
format-version: 1.2
date: 28:01:2009 05:57
saved-by: pmatos
default-namespace: chebi_ontology
General header
remark: ChEBI subsumes and replaces the Chemical Ontology first
information
remark: developed by Michael Ashburner & Pankaj Jaiswal.
remark: Author: ChEBI curation team
remark: ChEBI Release version 53
remark: For any queries contact chebi-help@ebi.ac.uk
synonymtypedef: IUPAC_NAME "IUPAC NAME"
synonymtypedef: FORMULA "FORMULA"
Synonym types
synonymtypedef: SMILES "SMILES"
used in terms
synonymtypedef: InChI "InChI"
synonymtypedef: InChIKey "InChIKey"
synonymtypedef: BRAND_NAME "BRAND NAME"
synonymtypedef: INN "INN"
[Term]
id: CHEBI:24431
name: molecular structure
def: "A description of the molecular entity or part thereof based
on its composition and/or the connectivity between its constituent
atoms." []
Synonyms in OBO format may
[Term]
be ‘related’ or ‘exact’. In the
id: CHEBI:23367
ChEBI OBO file, IUPAC
name: molecular entities
names are considered ‘exact’
def: "A molecular entity is any constitutionally
or and
isotopically
synonyms
all others
distinct atom, molecule, ion, ion pair, radical,
radical
ion,
‘related’.
complex, conformer etc., identifiable as a separately
distinguishable entity." []
synonym: "entidad molecular" RELATED [IUPAC:]
synonym: "entidades moleculares" RELATED [IUPAC:]
synonym: "entite moleculaire" RELATED [IUPAC:]
synonym: "molecular entity" EXACT IUPAC_NAME [IUPAC:]
synonym: "molekulare Entitaet" RELATED [ChEBI:]
Relationships to
is_a: CHEBI:24431
other terms
17
[Term]
id: CHEBI:24870
name: ions
def: "An ion is a molecular entity having a net electric charge."
[]
synonym: "ion" EXACT IUPAC_NAME [IUPAC:]
is_a: CHEBI:23367
The OBO Foundry
“The OBO Foundry is a collaborative experiment involving developers of science-based
ontologies who are establishing a set of principles for ontology development with the goal of
creating a suite of orthogonal interoperable reference ontologies in the biomedical domain.” The
OBO foundry can be found at http://www.obofoundry.org/
For more information
For further information, email the ChEBI team at: chebi-help@ebi.ac.uk, or log on to the
SourceForge forum at https://sourceforge.net/projects/chebi/.
Additional information about using ChEBI can be found by examining the User Manual at
http://www.ebi.ac.uk/chebi/userManualForward.do.
The latest news, updates and developments are announced via a RSS Feed.
18
Exercises
You will need access to ChEBI online at http://www.ebi.ac.uk/chebi to complete these exercises.
1.
Dichlorvos (CHEBI:34690) is a well known insecticide. Open the ChEBI entry for dichlorvos
(CHEBI:34690). Scroll down the entry to the “ChEBI ontology”. Can you determine whether
it can also be used as a fungicide?
______________________________________________________________________
2.
On the same entry as above (CHEBI:34690) click on the “Tree View” to display the entire
ontology tree. Follow the tree path from dichlorvos to its parent organophosphate insecticide
(CHEBI:25708). Click on this parent organophosphate insecticide (CHEBI:25708). This
brings you to the ontology view of the parent. From looking at the children of this entry can
you write down any other insecticide?
______________________________________________________________________
3.
The following term appears in the ChEBI OBO file.
[Term]
id: CHEBI:32762
name: L-tyrosinium
synonym: "(1S)-1-carboxy-2-(4-hydroxyphenyl)ethanaminium" RELATED [IUPAC:]
synonym: "L-tyrosine cation" RELATED [JCBN:]
synonym: "L-tyrosinium" EXACT IUPAC_NAME [IUPAC:]
synonym: "C9H12NO3" RELATED FORMULA [ChEBI:]
synonym: "[NH3+][C@@H](Cc1ccc(O)cc1)C(O)=O" RELATED SMILES [ChEBI:]
synonym: "InChI=1/C9H11NO3/c10-8(9(12)13)5-6-1-3-7(11)4-2-6/h14,8,11H,5,10H2,(H,12,13)/p+1/t8-/m0/s1/fC9H12NO3/h10,12H/q+1" RELATED InChI
[ChEBI:]
xref: Gmelin:1150138 "Gmelin Registry Number"
is_a: CHEBI:32786
relationship: is_enantiomer_of CHEBI:32775
relationship: is_conjugate_acid_of CHEBI:17895
Can you describe the relationships that this term has to other entities within the ontology?
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
4. Using the Advanced Search can you find all entries which have the application
‘pharmaceutical’ (CHEBI:52217) ? How many entries are there? Can you name one?
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
5. Using the Advanced Search can you find all entries which have the role ‘epitope’
(CHEBI:53000) ? How many entries are there? Can you name one?
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
19
Download