What is ChEBI?

advertisement
Last modified by: Paula de Matos (May 2011)
Introduction to ChEBI
This tutorial covers an introduction to the ChEBI database. You will learn about the background
and motivation for the creation of a database of chemical entities of biological interest at the EBI,
as well as the details of the data contained in ChEBI and an overview of how this data is
maintained.
This is the first of four training blocks in the ChEBI training course.
Block 1 – Introduction to ChEBI
Block 2 – Searching and browsing ChEBI
Block 3 – Understanding the ChEBI ontology
Block 4 – Download and programmatic access
Contents
Introduction to ChEBI ...................................................................................................................... 1
Contents ........................................................................................................................................ 1
Introduction to ChEBI ...................................................................................................................... 1
What is ChEBI? ............................................................................................................................ 2
Why a chemistry database (at a bioinformatics institution)? ........................................................ 2
How is ChEBI maintained? .......................................................................................................... 5
What information does ChEBI contain? ....................................................................................... 7
More about chemical structures .................................................................................................... 9
For more information ...................................................................................................................... 11
Exercises ......................................................................................................................................... 12
This work is licensed under the Creative Commons Attribution-Share Alike 3.0 License. To view a copy
of this license, visit http://creativecommons.org/licenses/by-sa/3.0/ or send a letter to Creative Commons, 543
Howard Street, 5th Floor, San Francisco, California, 94105, USA.
1
Introduction to ChEBI
What is ChEBI?
Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular
entities focused on ‘small’ chemical entities. Molecules directly encoded by the genome (such as
nucleic acids, proteins and peptides derived from proteins by cleavage) are not as a rule included
in ChEBI, as these are amply represented in other databases.
ChEBI provides standardised descriptions of molecular entities that enable other databases at the
EMBL-EBI and worldwide to annotate their entries in a consistent fashion. ChEBI focuses on high
quality manual annotation, non-redundancy, and provision of a chemical ontology rather than full
coverage of the vast chemical space. In addition to molecular entities, ChEBI contains groups
(parts of molecular entities) and classes of entities. A major feature of ChEBI is that it includes a
chemical ontology, which allows the relationships between molecular entities or classes of entities
and their parents and/or children to be specified in a structured way.
ChEBI uses nomenclature, symbolism and terminology endorsed by the International Union of
Pure and Applied Chemistry (IUPAC) and the Nomenclature Committee of the International
Union of Biochemistry and Molecular Biology (NC-IUBMB). All the data in ChEBI is nonproprietary or derived from a non-proprietary source and is therefore freely available to anyone. In
addition, each data item is fully traceable and explicitly referenced to the original source.
It is available online at http://www.ebi.ac.uk/chebi/ (illustrated below).
2
Why a chemistry database (at a bioinformatics institution)?
Within the various bioinformatics databases, chemical entities are referred to frequently. Some
examples are discussed below.
Protein interactions (IntAct)
Chemical entities interact with proteins. Below is the detail from an entry in the IntAct protein
interactions database showing a protein interaction with a chemical entity.
Figure 1 IntAct entry EBI-1607326
Reactions catalysed by enzymes (IntEnz)
Chemical entities participate in the reactions which are catalysed by enzymes. Below is the detail
from an entry in the IntEnz enzyme database showing a biochemical reaction catalysed by an
enzyme.
Figure 2 IntEnz entry EC 1.10.3.2
3
Gene expression patterns (ArrayExpress)
Chemical entities affect patterns of gene expressions and are often used in microarray experiments.
Below is the detail from an entry in the ArrayExpress Repository database showing a chemical
entity used as part of a microarray experiment.
Figure 3 ArrayExpress Repository entry E-TOXM-30
3-Dimensional structures (MSD)
Chemical entities are often bound to proteins when the three-dimensional structure of the protein is
measured. Below is an entry from the Macromolecular Structure Database (MSD) showing PDB
entry 2jfz bound to D-glutamic acid residue (CHEBI:48096).
Figure 4 MSD atlas PDB entry 2jfz
Many more examples could be found. Clearly, chemical entities play a large supporting role in the
biological databases. However, since they are for the most part not the core data in these databases,
they are often present as free text in annotations.
Free text annotations are easy for a human audience to read and understand, but are difficult for
computers to parse, can vary in quality from database to database, and can use different
terminology to mean the same thing (even within the same database, if for example different
human annotators used different terminology).
Ambiguity of names
In addition to other problems associated with free text annotations, chemical entities pose a
particularly difficult problem for annotation. Chemical names, particularly common names, may
contain ambiguity as to the exact chemical which is intended by the use of the name.
For example, the term ‘adrenaline’ may refer to either one of these two compounds:
Figure 5 (S)-adrenaline and (R)-adrenaline
4
Multiple valid names
Another problem which is exaggerated in the field of chemical nomenclature is that there may be
several different, valid, names for a particular chemical. This derives from the use of different
naming systems.
For example, the following molecule:
Figure 6 Acetaminophen
may be referred to by any of the following names: paracetamol, acetaminophen, 4acetamidophenol, N-(4-hydroxyphenyl)acetamide, …
ChEBI aims to address these problems of chemical annotation within biological databases by
providing a definitive reference controlled vocabulary and ontology of chemical entities which are
of relevance to the biological community (and thus likely to appear within biological databases).
How is ChEBI maintained?
1 star entries
ChEBI systematically combines information on small molecular entities which are automatically
loaded as preliminary data from three main sources:
1.
IntEnz database of enzymes (EMBL-EBI),
2.
KEGG COMPOUND database, and
3.
MSDchem database of ligands (also EMBL-EBI).
Data from the three databases are classified as 1 star entries (preliminary entries) and are not
publicly searchable until they have been manually annotated and checked, however, they may be
directly accessed if the identifier is known or browsed if they are linked to the ontology. They are
clearly indicated as preliminary entries in the interface:
5
Preliminary entry
Figure 7 ChEBI 1 star entry
2 star entries
In August 2009 a large scale manually annotated database of small molecule potential drug
interactions called ChEMBL (www.ebi.ac.uk/chembl) was indexed into the ChEBI database.
These entries are manually annotated from the literature and normally have a chemical structure
associated with them as well as a name and the publication information. These entries are
indicated by 2 stars in the ChEBI interface. 2 star entries can also be submitted entries.
Figure 8: ChEBI 2 star entries
3 star entries
Each preliminary entity is then manually checked and annotated. A unique and unambiguous name
is selected as the recommended ChEBI name, the structure is created or checked, an IUPAC name
is assigned, and relevant synonyms and database links are annotated.
A number of subsidiary freely accessible sources are manually annotated and integrated, such as
ChemIDplus (NIH, http://chem.sis.nlm.nih.gov/chemidplus/), the NIST Chemistry WebBook
(http://webbook.nist.gov/), COMe and RESID (EMBL-EBI).
User requests
Users of ChEBI send requests for additions to the dataset via the ChEBI Submission Tool. The
ChEBI submission tool is a web interface (www.ebi.ac.uk/chebi/submissions) which allows users
to provide names, chemical structure, database links and classification within the ontology. All
annotation is then checked by ChEBI annotators and released into the public domain. If an entry is
unable to be annotated in time for a ChEBI release then it will remain a 2 star entry until a ChEBI
curator has been able to annotate it.
6
What information does ChEBI contain?
ChEBI database entries contain
•
a unique, unambiguous, recommended ChEBI name and an associated stable unique
identifier
•
An illustration of the chemical structure where appropriate (compounds and groups, but
generally not classes), as well as secondary structures such as InChI and related chemical
data such as formula
•
A definition where appropriate (mostly classes)
•
A collection of synonyms, including the IUPAC recommended name for the entity where
appropriate, and brand names and INNs for drugs
•
A collection of cross-references to other databases (where these are sourced from nonproprietary origins)
•
Links to the ChEBI ontology
•
Citation information where the chemical has been cited in publication
Recommended ChEBI
name
Illustration
Chemical structure
searches
Additional chemical
data
Links to the ChEBI
ontology
7
Synonyms
Cross-references to
other databases
Automatic Cross-references
Other databases are automatically cross-referenced to ChEBI entities for each release, and these
automatic cross-references are found in the Automatic Xrefs page which is accessed via the tab at
the top of the main entry view screen.
Automatic crossreferences
The links on the automatic cross-references page take you to the entry in the relevant database.
More about chemical structures
ChEBI entities are illustrated by means of a diagram of the chemical structure. Following best
practices, the default illustration is an unambiguous, 2-dimensional representation of the structure
of the entity. Additional structures may be present, for example, a 3-dimensional structure may
also be present. Where available, this can be accessed by clicking on ‘More structures >>’ on the
main entity page.
8
More structures
Molfile format
Structures are stored as Molfiles within ChEBI. The Molfile format is owned by the Elsevier MDL
company (http://www.mdl.com/company/about/history.jsp), and is the most commonly used
format for chemical data exchange. It can be accessed by clicking the link entitled ‘Molfile’ under
a structure image, and opening the resulting file in a text editor.
This is the Molfile for the 2-dimensional representation of paracetamol:
Marvin
03190821382D
11 11 0
0.7145
0.7145
0.0000
0.0000
-0.7145
-0.7145
1.4288
0.7145
0.7145
0.0000
0.0000
2 1 1
1 3 2
10 3 1
4 2 2
4 11 1
4 5 1
6 5 2
3 6 1
7 8 1
8 9 2
8 10 1
M END
0
0 0
0.4125
-0.4125
0.8250
-0.8250
-0.4125
0.4125
1.6500
Coordinates
2.0624
2.8874
1.6500
-1.6499
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
Atom count
999 V2000
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 O
0 0
0.0000 N
0 0
0.0000 O
0 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Atoms
Elements
Line numbers of bonded
atoms in above atom table
Bonds
Bond order
InChI
“The IUPAC International Chemical Identifier (InChI TM) is a non-proprietary identifier for
chemical substances that can be used in printed and electronic data sources thus enabling easier
linking of diverse data compilations.” (http://www.iupac.org/inchi/).
It is intended to provide a unique identifier for a chemical structure independently of how it is
drawn (in contrast to the Molfile for a particular structure, which does differ depending on how the
structure is drawn).
Limitations of the current InChI format are that in some cases, the generated InChI is not unique,
that is, the same InChI is generated for different entities, and that it can be difficult to convert back
from the InChI to the chemical structure.
9
Non-unique InChIs occur because the InChI format does not differentiate between stereoisomers
other than tetrahedral and trigonal planar. An example of square planar stereoisomers which have
the same InChI is the pair cisplatin (CHEBI:27899) and transplatin (CHEBI:35852).
cisplatin
transplatin
Another example of a set of entities which have the same InChI but different stereochemistry is the
following octahedral stereoisomers:
Non-stereospecific:
CHEBI:36409
Delta-isomer:
CHEBI:36410
Lambda-isomer:
CHEBI:36411
The InChI for paracetamol is:
InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)
SMILES
SMILES (Simplified Molecular Input Line Entry System) is a line notation (a typographical
method using printable characters) for entering and representing molecules and reactions. The
original SMILES specification was developed by Arthur Weininger and David Weininger in the
late 1980s. It has since been modified and extended by others, most notably by Daylight Chemical
Information Systems Inc. (http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html).
The SMILES format caters for unique representations of distinct chemical entities, but a drawback
is that the SMILES output can depend on the program that generates it.
The SMILES for paracetamol is:
CC(=O)Nc1ccc(O)cc1
Interactive Viewer
The MarvinView applet can be accessed within ChEBI by selecting the “applet” checkbox next to
the structure image. This allows interactive exploration of the given chemical structure.
Applet
10
MarvinView allows the user to quickly and easily control many aspects of the display, such as
molecule display format, colour scheme, dimension, and point of view.
It is possible to move the displayed molecule by translation, dragging, changing the zoom, or
rotating in three dimensions. It is also possible to animate the display. The format of the display
can be switched between common formats such as wireframe, ball and stick, and spacefill. The
colour scheme can be changed. Also, the display of implicit and explicit hydrogens in the image
can be altered.
The MarvinView applet is provided by ChemAxon Marvin (http://www.chemaxon.com/marvin/).
For more information
For further information, email the ChEBI team at: chebi-help@ebi.ac.uk, or log on to the
SourceForge forum at https://sourceforge.net/projects/chebi/.
Additional information about using ChEBI can be found by examining the User Manual at
http://www.ebi.ac.uk/chebi/userManualForward.do.
The latest news, updates and developments are announced via a RSS Feed.
Exercises
You will need access to ChEBI online at http://www.ebi.ac.uk/chebi to complete these exercises.
1.
Go to ChEBI entry CHEBI:3647. What is it?
____________________________________________________________________
2.
What are some of the alternative names you might find referring to this entity within text?
____________________________________________________________________
____________________________________________________________________
____________________________________________________________________
3.
Is this chemical used as a drug?
____________________________________________________________________
4.
If so, what it is used for, and what brand names might it be found under?
11
____________________________________________________________________
5.
Go to ChEBI entry CHEBI:45783. What is it? What can it be used for?
____________________________________________________________________
____________________________________________________________________
6.
View the additional structure for this entity (CHEBI:45783). What does it show that you
cannot see in the default structure?
____________________________________________________________________
7.
Open the entry for quinolinate(1-) (CHEBI:46828). Click on the “Automatic Xrefs” tab just
below the header. Scroll down to the sub section “Reactions and Pathways”. Click on the
Reactome identifier, REACT_11092. This will take you to the Reactome entry for this
reaction. Can you name the enzyme which catalyses this reaction?
____________________________________________________________________
8.
Given the following 2D molfile, can you sketch the molecule it represents?
Marvin
02080816422D
14 15 0
-0.7145
-0.7145
0.7145
0.7145
0.0000
0.0000
1.4992
1.4992
1.9841
-1.4289
0.0001
0.0001
1.7541
-1.4289
10 1 2
1 2 1
14 2 1
8 3 1
4 3 2
7 4 1
1 5 1
5 3 1
12 5 1
6 2 1
6 4 1
11 6 2
0
0
0
0
0
0
0
0
0
0
0
0
0
0 0
-0.4125
0.4125
-0.4125
0.4125
-0.8250
0.8250
0.6674
-0.6675
0.0000
-0.8250
1.6500
-1.6500
1.4520
0.8250
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
999 V2000
0.0000 C
0 0
0.0000 N
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 N
0 0
0.0000 C
0 0
0.0000 N
0 0
0.0000 N
0 0
0.0000 C
0 0
0.0000 O
0 0
0.0000 O
0 0
0.0000 C
0 0
0.0000 C
0 0
0.0000 C
0 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
12
9 7
13 7
9 8
M END
1
1
2
0
0
0
0
0
0
0
0
0
0
0
0
13
Download