E2b ChEBI

advertisement
Last modified by: Paula de Matos (April 2011)
Searching and browsing ChEBI
This tutorial covers methods of searching and browsing the data contained in the ChEBI database
via the online public user interface. You will learn how to perform simple and advanced text
searches in ChEBI as well as chemical structure searches which may be exact, substructure, or
similarity searches. You will also learn how to navigate within and browse the ChEBI data.
This is the second of four training blocks in the ChEBI training course.
Block 1 – Introduction to ChEBI
Block 2 – Searching and browsing ChEBI
Block 3 – Understanding the ChEBI ontology
Block 4 – Download and programmatic access
Contents
Searching and browsing ChEBI ........................................................................................................ 1
Contents ........................................................................................................................................ 1
Searching ChEBI .............................................................................................................................. 2
Simple Search ............................................................................................................................... 2
Advanced Search .......................................................................................................................... 4
Structure-based Search ................................................................................................................. 6
Fingerprints and chemical structure searching .......................................................................... 9
Search options ......................................................................................................................... 11
Browsing and navigating within ChEBI ......................................................................................... 14
Periodic Table ............................................................................................................................. 14
For more information ...................................................................................................................... 15
Exercises ......................................................................................................................................... 15
This work is licensed under the Creative Commons Attribution-Share Alike 3.0 License. To view a copy
of this license, visit http://creativecommons.org/licenses/by-sa/3.0/ or send a letter to Creative Commons, 543
Howard Street, 5th Floor, San Francisco, California, 94105, USA.
1
Searching ChEBI
ChEBI can be searched via a simple or advanced text search, or via a structure-based search.
Simple Search
To access the simple search, simply enter your search query into the search box located at the top
left of the ChEBI front page. The search query may be any data associated with an entity, such as
names, synonyms, formulae, CAS or Beilstein Registry numbers, or InChI’s.
Simple search
When there are multiple results returned from a search, the search takes you to a search results
page. Search results may be downloaded for import into other applications.
2
Figure 1 Search results for 'water'
When there is only one result found the search takes you directly to the entity result page,
bypassing the search results table.
Using wildcards in searches
Wildcards are available for both the simple and the advanced search. The wildcard character is ‘*’.
A wildcard character allows you to find compounds by typing in a partial name. The search engine
will then try to find names matching the pattern you have specified.
To match words starting with your search term add the wildcard character to the end of your
search term. For example, searching for aceto* will find compounds such as acetochlor,
acetophenazine, and acetophenazine maleate.
To match words ending with your search term add the wildcard character to the start of your
search term. For example, searching for *azine will find compounds such as 2(pentaprenyloxy)dihydrophenazine,
acetophenazine,
and
4-(ethylamino)-2-hydroxy-6(isopropylamino)-1,3,5-triazine.
To match words containing a search term, add the wildcard character to the start and the end of
your search term. For example, searching for *propyl* will find compounds such as (R)-2hydroxypropyl-CoM, 2-isopropylmaleic acid, and 2-methyl-1-hydroxypropyl-TPP.
Any number of wildcard characters may be used within a search term, thus making the search
facility very powerful.
3
Note on Special Characters
Data containing Unicode characters may be searched for by using Unicode UTF-8 characters or by
using their ASCII representation. For example, when searching for subscripts such as H2O, the
ASCII representation is H2O.
Advanced Search
The advanced text search provides for additional granularity by allowing you to specify which
category to search in, as well as providing the option of using Boolean operations when searching.
The advanced search page also contains the structure-based search which may be combined with
the text-based search.
To access the advanced search screen, select the “Advanced Search” link from the left-hand menu.
Alternatively, if a simple search fails to return any results, you will be taken there automatically.
Structure
search facility
described in
next section
Advanced text search
Figure 2 ChEBI Advanced Search screen
This screen allows searching by structure and/or by text. The structure searching facility is
described in the next section of this tutorial. The remainder of this section refers to the advanced
text search facility.
4
Search terms are entered into various text boxes divided by a plain text search, formula search,
molecular weight and charge range searches. Then you can filter the result set using databases and
ontology terms. Further filtering can be done on chemical structures and the ChEBI starring
system.
Operators
ChEBI provides the standard Boolean operators when searching for compounds.
AND
This operator allows you to find a compound which contains all of your search terms. For
example, if you are searching for a pyruvic acid with formula C5H6O4, specifying *pyruvic acid
C5H6O4 as the search term and selecting AND as the search option will retrieve acetylpyruvic
acid.
OR
This operator allows you to type two or more words. It then tries to find a compound which
contains at least any one of these words. For example, if you wanted to find all compounds
containing iron in the database, you could type in the search string iron fe Fe2 Fe3.
BUT NOT
Sometimes common words can be a problem when searching, as they can provide too many
results. The 'BUT NOT' operator can be used to limit the result set. For example, if you were
looking for a compound related to chlorine but excluding acidic compounds, you could specify
%chlor as your search string but qualify the search by specifying acid in the BUT NOT operator.
Searching in categories
This option allows you to narrow down your search by using the categories provided. Below is a
summary of these categories.













All - this allows you to search all the categories.
ChEBI ID - allows searching for specific ChEBI identifiers.
ChEBI name - will search only for ChEBI names matching your search term.
Definition – the ChEBI definitions
All Names - will search in all the synonyms, ChEBI Names and IUPAC names available
for this compound.
IUPAC name - will search for IUPAC names.
Database Links - allows searching for accession numbers from other sources.
Formula - will search for formula.
Mass – the molecular weight.
Charge – allows searching for the charge.
CAS Registry Number - will search for CAS Registry Numbers matching your search
criteria.
InChI/InChIKey - will search for InChI's matching your search criteria.
SMILES - will search for SMILES matching your search criteria.
Categories can be used with any combination of operators described above.
Using filters
Filters can be used to refine search terms. The following filters are available:

Filter by Ontology Term – will allow you to filter your search to include entities which
5
are related to an ontology term based on relationship type. For example one can type the
term ‘cofactor’, terms that match your criteria will appear in a drop down list. Clicking on
one of the optional terms provided will select it. The next step is to select the relationship
that needs to be filtered on. In this example we are using a biological role term hence to
find all structures the ‘has role’ relationship should be selected. If the ChEBI id is
predetermined.

Filter by Database –users can filter the search term by any of the databases which ChEBI
cross-references.

Filter by Stars – to search only ChEBI manually annotated entries one can select 3 star
entries only.

Filter by Chemical Structure – to search exclusively for entries which contain chemical
structures i.e. entries which are not exclusively ontological classes.
Structure-based Search
The structure-based search facility within ChEBI allows you to search for structures in the
database based on a provided structure which may be drawn or uploaded.
Structure search
options
Applet for entering
search structure
The structure searching facility is powered by OrChem. More information can be found at
6
http://orchem.sf.net/.
Using the structure applet
The applet used to enter structures is provided by JChemPaint. More information can be found at
http://jchempaint.sf.net.
File menu allows
loading from file
Structure templates –
right click to select
Bond selection
Open full periodic
table to select
additional atoms
The top menu bar allows access to most of the available functionality of the sketching applet,
grouped into menus for file manipulation, general editing functionality such as copy/paste, view
manipulations such as display and colour options, and various structure-drawing utilities.
In addition to the top menu bar, various structure-drawing tools and utilities are available from the
left hand graphical button menu bar, various atom options from the bottom button menu bar, and
structure templates from the right-hand-side button menu bar.
Results display
The search results are displayed in a grid, shown below. The results are paginated if more than 15
results are retrieved, and if only one result is returned then the entry page for that entity is loaded
directly.
7
Figure 3 Grid search results
Clicking on the relevant ChEBI accession hyperlinked under the search result image takes you to
the entry page for that entity. In addition, you can do zoom into your structures by hovering over
the relevant compound.
Hover-over menu for
zooming
8
Fingerprints and chemical structure searching
Chemical substructure and similarity searches are based on fingerprints. A fingerprint of a
chemical structure is a way of representing special characteristics of that structure in an easily
searchable form.
Why do we need fingerprints?
The problem of finding whether a given chemical structure is a substructure of, or is similar to,
another structure, is a computationally very expensive problem. In fact, in the worst case, the time
taken will increase exponentially with the number of atoms. This makes running these searches
across whole databases almost completely intractable.
Fortunately, a general substructure or similarity search algorithm does not have to be used across
the full database. Various different heuristics can be used to drastically narrow the number of
candidates for the algorithm to be applied to. For example, the chemical formula could be used as
one such heuristic. Let’s say we want to search for all structures which have paracetamol as a
substructure.
The chemical formula of paracetamol is C8H9NO2. This means that we could
immediately eliminate from the search candidates any structures which don’t
contain at least those quantities of carbon, hydrogen, nitrogen and oxygen. This
simple heuristic acts as a screen which cuts down the number of structures
required to perform the full substructure search against, by a large percentage.
Fingerprints are designed to operate similarly as a screening device, however,
they are more generalized and abstract, allowing more information about the structure to be
encoded, thus eliminating a far larger percentage of search candidates.
What are fingerprints?
A fingerprint is a boolean array, or bitmap, in which the characteristic features of the structural
pattern is encoded. The fingerprint is created by an algorithm which generates patterns for

each atom in the structure,

then each atom and its nearest neighbours, including the bonds between them,

then each group of atoms connected by paths up to two bonds long,

… continuing, with paths of lengths 3, 4, 5, 6, 7, and 8 bonds long.
For example, the water molecule generates the following patterns:
water (HOH)
0-bond paths
H
O
1-bond paths
HO
OH
2-bond paths
HOH
H
Every pattern in the molecule up to 8 bonds length is generated. Each generated pattern is hashed
to create a bit string output, then, to create the final bitmap representation of the fingerprint, the
individual bit strings are added together (using logical OR) to create the final 1024 bit fingerprint.
For example, considering the above set of patterns for the structure of water, but assuming for
purposes of illustration that we have only a 10-bit result to create the fingerprint, the algorithm
works something like this:
Pattern
H
O
HO
OH
Hashed bitmap
0000010000
0010000000
1010000000
0000100010
9
HOH
Result:
0000000101
1010110111
Since the final result represents “infinitely” many structural possibilities (as there are infinitely
many chemical structures, at least in theory) in a fixed length bitmap, it is inevitable that collisions
will occur – bits may be set already when they appear in a subsequent pattern. Thus, fingerprints
do not uniquely represent chemical structures. However, fingerprints do have the very useful
property that every bit set in the fingerprint of a substructure of a given structure, will also be set
in the fingerprint of the full structure. So, for example, the pattern for the water substructure
hydroxide (OH) might look like this:
Pattern
H
O
OH
Result:
Hashed bitmap
0000010000
0010000000
0000100010
0010110010
It is easy to see with a quick scan that every bit set in this substructure fingerprint is also set in the
full HOH structure fingerprint.
How are fingerprints used?
For substructure searching, fingerprints are used as effective screening devices to narrow the set of
candidates for a full substructure search. If all bits in a query fingerprint are also present in the
target fingerprint of a stored database structure, this structure is subjected to the computationally
expensive subgraph matching algorithm. Those bit operations are very fast and independent of the
number of atoms in a structure due to the fixed length of the fingerprint.
For similarity searching, fingerprints are used as input to the calculation of the similarity of two
molecules in the form of the Tanimoto coefficient, described below in the section on similarity
searching.
It is important to remember that fingerprints have limitations: they are good at indicating that a
particular structure feature is not present but they can only indicate a structure feature's presence
with some probability.
For more information on fingerprints and chemical structure searching, see the Daylight Theory
Manual (http://www.daylight.com/dayhtml/doc/theory/theory.finger.html).
10
Search options
There are three main structure search options.

Structures that contain a structure (substructure search) allows you to find all structures
which contain the structure you have drawn.

Structures that look like a structure (similarity search) allows you to find all structures
that look similar to yours but do not necessarily have the same subgraph.

Structures identical to (identity) allow you to find identical compounds.
Other options allow you to specify whether your searches should incorporate stereochemistry. All
these options are discussed below.
Identity
Identity searches are based on InChI, which means that an InChI is generated from the drawn or
uploaded structure, and the database is then searched for exact matches to that InChI. This means
that the identity search is subject to the same limitations as the uniqueness of InChIs (discussed in
Block 1 of the ChEBI training course). For example, searching for structures identical to cisplatin
returns both cisplatin and transplatin (illustrated below).
However, in most cases an identity search will take you directly to the entry page for the single
structure you have drawn, if it exists in the database.
Structures that contain (Substructure)
The substructure search retrieves entities in the database of which the given search structure forms
a substructure (that is, the search structure is contained within the structure of the entity). For
example, naphthalene is a substructure of several entries in the database.
11
Figure 4 Search results for substructure naphthalene
Fingerprints are used to eliminate candidates for further examination in substructure searching. For
molecule A to be a substructure of molecule B then all bits set in the fingerprint of molecule A
should be present in molecule B. Once this initial screening is performed, the potential
substructure candidates are subjected to a more rigorous inspection to determine whether molecule
A is a substructure of molecule B.
Similarity
The similarity search retrieves entries in the database to which the given search structure is
similar. The measure of similarity is the Tanimoto coefficient, which is calculated as the ratio
T(a,b) = c/(a + b - c)
where c is the count of bits “on” (i.e. 1 not 0) in the same position in both the two fingerprints, a is
the count of bits on in object A, and b is the count of bits on in object B.
For example, suppose we are looking for the similarity between the two bitmap fingerprints shown
below.
Object
Object A
Object B
c (bits on in both)
a (bits on in Obj A)
b (bits on in Obj B)
Tanimoto (c/(a+b-c))
Fingerprint
0010110010
1010110001
3
4
5
3/(4+5-3) = 0.5
The Tanimoto coefficient varies in the range 0.0 – 1.0, with a score of 1.0 indicating that the two
structures are very similar (i.e. their fingerprints are the same).
Note that as the fingerprints are calculated on a chemical structure path depth of maximum eight,
and are of limited size thus necessitating that collisions do occur, it means that some structures
12
will have similar fingerprints and thus very high similarity scores even though they might not be
quite as structurally similar.
13
Browsing and navigating within ChEBI
Periodic Table
The ChEBI Periodic Table allows browsing of the dataset by the familiar periodic table. The
periodic table may be browsed for molecular entities containing those elements, or for the
elements themselves.
Elements
Molecular
Entities
Clicking one of the links on the molecular entities periodic table browser takes you to the entry
page for the particular molecular entities class within ChEBI. For example, clicking the link on
sodium (Na) results in:
Navigate data via
links in ontology
14
‘Sodium molecular entities’ is a class within the ChEBI ontology. You can navigate further within
the data by clicking one of the links in the ChEBI ontology.
More information about browsing and navigating the ChEBI dataset via the ontology will be
provided in the next block of the training course – Block 3 ‘Understanding the ChEBI ontology’.
For more information
For further information, email the ChEBI team at: chebi-help@ebi.ac.uk, or log on to the
SourceForge forum at https://sourceforge.net/projects/chebi/.
Additional information about using ChEBI can be found by examining the User Manual at
http://www.ebi.ac.uk/chebi/userManualForward.do.
The latest news, updates and developments are announced via a RSS Feed.
Exercises
You will need access to ChEBI online at http://www.ebi.ac.uk/chebi to complete these exercises.
1.
Find all the entities which contain a reference to the term “phenol”. How many entities are
retrieved?
________________________________________________________________
2.
Find all the entities with molecular formula C10H18O.
________________________________________________________________
3.
Find all the entities which have formula C6H6O4 which are not carboxylic acids.
________________________________________________________________
4.
Find all the entities where the IUPAC name contains the string “pyrimidin”. (Hint: use
wildcards.)
________________________________________________________________
5.
Find all the entities which contain phenol as a substructure:
Which entity is the first
result?
____________________________________________________________________
15
6.
Find all the entities which have a similar structure to paracetamol:
Excluding
paracetamol itself from the result set, can you find a highly similar result which is also used as
a drug? What is the INN for this drug?
____________________________________________________________________
7.
Find all the entities which have molecular formula C6H6O4 and contain the substructure
. How many entities are retrieved? Note you need to increase the result size.
____________________________________________________________________
8.
Find all the entities which have a mass range or between 300 and 500 amu and contain the
substructure
. How many entities are retrieved? Note you need to increase the result
size.
____________________________________________________________________
9.
Can you find all the chemical structures which have 60 carbons in them. Tip use the formula
in the Advanced Search.
____________________________________________________________________
10. Can you find all the entries which has a biological role ‘cofactor’? Name one of them? Tip use
the Filter by Ontology term section in the Advanced Search. Can you find only the chemical
structures which have a role ‘cofactor’? Name one of them?
____________________________________________________________________
11. Can you find chemical entities which have a charge in the range of -8 to -5? How many of
them are there?
____________________________________________________________________
16
Download