Last modified by: Paula de Matos (April 2011) Searching and browsing ChEBI This tutorial covers methods of searching and browsing the data contained in the ChEBI database via the online public user interface. You will learn how to perform simple and advanced text searches in ChEBI as well as chemical structure searches which may be exact, substructure, or similarity searches. You will also learn how to navigate within and browse the ChEBI data. This is the second of four training blocks in the ChEBI training course. Block 1 – Introduction to ChEBI Block 2 – Searching and browsing ChEBI Block 3 – Understanding the ChEBI ontology Block 4 – Download and programmatic access Contents Searching and browsing ChEBI ........................................................................................................ 1 Contents ........................................................................................................................................ 1 Searching ChEBI .............................................................................................................................. 2 Simple Search ............................................................................................................................... 2 Advanced Search .......................................................................................................................... 4 Structure-based Search ................................................................................................................. 6 Fingerprints and chemical structure searching .......................................................................... 9 Search options ......................................................................................................................... 11 Browsing and navigating within ChEBI ......................................................................................... 14 Periodic Table ............................................................................................................................. 14 For more information ...................................................................................................................... 15 Exercises ......................................................................................................................................... 15 This work is licensed under the Creative Commons Attribution-Share Alike 3.0 License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. 1 Searching ChEBI ChEBI can be searched via a simple or advanced text search, or via a structure-based search. Simple Search To access the simple search, simply enter your search query into the search box located at the top left of the ChEBI front page. The search query may be any data associated with an entity, such as names, synonyms, formulae, CAS or Beilstein Registry numbers, or InChI’s. Simple search When there are multiple results returned from a search, the search takes you to a search results page. Search results may be downloaded for import into other applications. 2 Figure 1 Search results for 'water' When there is only one result found the search takes you directly to the entity result page, bypassing the search results table. Using wildcards in searches Wildcards are available for both the simple and the advanced search. The wildcard character is ‘*’. A wildcard character allows you to find compounds by typing in a partial name. The search engine will then try to find names matching the pattern you have specified. To match words starting with your search term add the wildcard character to the end of your search term. For example, searching for aceto* will find compounds such as acetochlor, acetophenazine, and acetophenazine maleate. To match words ending with your search term add the wildcard character to the start of your search term. For example, searching for *azine will find compounds such as 2(pentaprenyloxy)dihydrophenazine, acetophenazine, and 4-(ethylamino)-2-hydroxy-6(isopropylamino)-1,3,5-triazine. To match words containing a search term, add the wildcard character to the start and the end of your search term. For example, searching for *propyl* will find compounds such as (R)-2hydroxypropyl-CoM, 2-isopropylmaleic acid, and 2-methyl-1-hydroxypropyl-TPP. Any number of wildcard characters may be used within a search term, thus making the search facility very powerful. 3 Note on Special Characters Data containing Unicode characters may be searched for by using Unicode UTF-8 characters or by using their ASCII representation. For example, when searching for subscripts such as H2O, the ASCII representation is H2O. Advanced Search The advanced text search provides for additional granularity by allowing you to specify which category to search in, as well as providing the option of using Boolean operations when searching. The advanced search page also contains the structure-based search which may be combined with the text-based search. To access the advanced search screen, select the “Advanced Search” link from the left-hand menu. Alternatively, if a simple search fails to return any results, you will be taken there automatically. Structure search facility described in next section Advanced text search Figure 2 ChEBI Advanced Search screen This screen allows searching by structure and/or by text. The structure searching facility is described in the next section of this tutorial. The remainder of this section refers to the advanced text search facility. 4 Search terms are entered into various text boxes divided by a plain text search, formula search, molecular weight and charge range searches. Then you can filter the result set using databases and ontology terms. Further filtering can be done on chemical structures and the ChEBI starring system. Operators ChEBI provides the standard Boolean operators when searching for compounds. AND This operator allows you to find a compound which contains all of your search terms. For example, if you are searching for a pyruvic acid with formula C5H6O4, specifying *pyruvic acid C5H6O4 as the search term and selecting AND as the search option will retrieve acetylpyruvic acid. OR This operator allows you to type two or more words. It then tries to find a compound which contains at least any one of these words. For example, if you wanted to find all compounds containing iron in the database, you could type in the search string iron fe Fe2 Fe3. BUT NOT Sometimes common words can be a problem when searching, as they can provide too many results. The 'BUT NOT' operator can be used to limit the result set. For example, if you were looking for a compound related to chlorine but excluding acidic compounds, you could specify %chlor as your search string but qualify the search by specifying acid in the BUT NOT operator. Searching in categories This option allows you to narrow down your search by using the categories provided. Below is a summary of these categories. All - this allows you to search all the categories. ChEBI ID - allows searching for specific ChEBI identifiers. ChEBI name - will search only for ChEBI names matching your search term. Definition – the ChEBI definitions All Names - will search in all the synonyms, ChEBI Names and IUPAC names available for this compound. IUPAC name - will search for IUPAC names. Database Links - allows searching for accession numbers from other sources. Formula - will search for formula. Mass – the molecular weight. Charge – allows searching for the charge. CAS Registry Number - will search for CAS Registry Numbers matching your search criteria. InChI/InChIKey - will search for InChI's matching your search criteria. SMILES - will search for SMILES matching your search criteria. Categories can be used with any combination of operators described above. Using filters Filters can be used to refine search terms. The following filters are available: Filter by Ontology Term – will allow you to filter your search to include entities which 5 are related to an ontology term based on relationship type. For example one can type the term ‘cofactor’, terms that match your criteria will appear in a drop down list. Clicking on one of the optional terms provided will select it. The next step is to select the relationship that needs to be filtered on. In this example we are using a biological role term hence to find all structures the ‘has role’ relationship should be selected. If the ChEBI id is predetermined. Filter by Database –users can filter the search term by any of the databases which ChEBI cross-references. Filter by Stars – to search only ChEBI manually annotated entries one can select 3 star entries only. Filter by Chemical Structure – to search exclusively for entries which contain chemical structures i.e. entries which are not exclusively ontological classes. Structure-based Search The structure-based search facility within ChEBI allows you to search for structures in the database based on a provided structure which may be drawn or uploaded. Structure search options Applet for entering search structure The structure searching facility is powered by OrChem. More information can be found at 6 http://orchem.sf.net/. Using the structure applet The applet used to enter structures is provided by JChemPaint. More information can be found at http://jchempaint.sf.net. File menu allows loading from file Structure templates – right click to select Bond selection Open full periodic table to select additional atoms The top menu bar allows access to most of the available functionality of the sketching applet, grouped into menus for file manipulation, general editing functionality such as copy/paste, view manipulations such as display and colour options, and various structure-drawing utilities. In addition to the top menu bar, various structure-drawing tools and utilities are available from the left hand graphical button menu bar, various atom options from the bottom button menu bar, and structure templates from the right-hand-side button menu bar. Results display The search results are displayed in a grid, shown below. The results are paginated if more than 15 results are retrieved, and if only one result is returned then the entry page for that entity is loaded directly. 7 Figure 3 Grid search results Clicking on the relevant ChEBI accession hyperlinked under the search result image takes you to the entry page for that entity. In addition, you can do zoom into your structures by hovering over the relevant compound. Hover-over menu for zooming 8 Fingerprints and chemical structure searching Chemical substructure and similarity searches are based on fingerprints. A fingerprint of a chemical structure is a way of representing special characteristics of that structure in an easily searchable form. Why do we need fingerprints? The problem of finding whether a given chemical structure is a substructure of, or is similar to, another structure, is a computationally very expensive problem. In fact, in the worst case, the time taken will increase exponentially with the number of atoms. This makes running these searches across whole databases almost completely intractable. Fortunately, a general substructure or similarity search algorithm does not have to be used across the full database. Various different heuristics can be used to drastically narrow the number of candidates for the algorithm to be applied to. For example, the chemical formula could be used as one such heuristic. Let’s say we want to search for all structures which have paracetamol as a substructure. The chemical formula of paracetamol is C8H9NO2. This means that we could immediately eliminate from the search candidates any structures which don’t contain at least those quantities of carbon, hydrogen, nitrogen and oxygen. This simple heuristic acts as a screen which cuts down the number of structures required to perform the full substructure search against, by a large percentage. Fingerprints are designed to operate similarly as a screening device, however, they are more generalized and abstract, allowing more information about the structure to be encoded, thus eliminating a far larger percentage of search candidates. What are fingerprints? A fingerprint is a boolean array, or bitmap, in which the characteristic features of the structural pattern is encoded. The fingerprint is created by an algorithm which generates patterns for each atom in the structure, then each atom and its nearest neighbours, including the bonds between them, then each group of atoms connected by paths up to two bonds long, … continuing, with paths of lengths 3, 4, 5, 6, 7, and 8 bonds long. For example, the water molecule generates the following patterns: water (HOH) 0-bond paths H O 1-bond paths HO OH 2-bond paths HOH H Every pattern in the molecule up to 8 bonds length is generated. Each generated pattern is hashed to create a bit string output, then, to create the final bitmap representation of the fingerprint, the individual bit strings are added together (using logical OR) to create the final 1024 bit fingerprint. For example, considering the above set of patterns for the structure of water, but assuming for purposes of illustration that we have only a 10-bit result to create the fingerprint, the algorithm works something like this: Pattern H O HO OH Hashed bitmap 0000010000 0010000000 1010000000 0000100010 9 HOH Result: 0000000101 1010110111 Since the final result represents “infinitely” many structural possibilities (as there are infinitely many chemical structures, at least in theory) in a fixed length bitmap, it is inevitable that collisions will occur – bits may be set already when they appear in a subsequent pattern. Thus, fingerprints do not uniquely represent chemical structures. However, fingerprints do have the very useful property that every bit set in the fingerprint of a substructure of a given structure, will also be set in the fingerprint of the full structure. So, for example, the pattern for the water substructure hydroxide (OH) might look like this: Pattern H O OH Result: Hashed bitmap 0000010000 0010000000 0000100010 0010110010 It is easy to see with a quick scan that every bit set in this substructure fingerprint is also set in the full HOH structure fingerprint. How are fingerprints used? For substructure searching, fingerprints are used as effective screening devices to narrow the set of candidates for a full substructure search. If all bits in a query fingerprint are also present in the target fingerprint of a stored database structure, this structure is subjected to the computationally expensive subgraph matching algorithm. Those bit operations are very fast and independent of the number of atoms in a structure due to the fixed length of the fingerprint. For similarity searching, fingerprints are used as input to the calculation of the similarity of two molecules in the form of the Tanimoto coefficient, described below in the section on similarity searching. It is important to remember that fingerprints have limitations: they are good at indicating that a particular structure feature is not present but they can only indicate a structure feature's presence with some probability. For more information on fingerprints and chemical structure searching, see the Daylight Theory Manual (http://www.daylight.com/dayhtml/doc/theory/theory.finger.html). 10 Search options There are three main structure search options. Structures that contain a structure (substructure search) allows you to find all structures which contain the structure you have drawn. Structures that look like a structure (similarity search) allows you to find all structures that look similar to yours but do not necessarily have the same subgraph. Structures identical to (identity) allow you to find identical compounds. Other options allow you to specify whether your searches should incorporate stereochemistry. All these options are discussed below. Identity Identity searches are based on InChI, which means that an InChI is generated from the drawn or uploaded structure, and the database is then searched for exact matches to that InChI. This means that the identity search is subject to the same limitations as the uniqueness of InChIs (discussed in Block 1 of the ChEBI training course). For example, searching for structures identical to cisplatin returns both cisplatin and transplatin (illustrated below). However, in most cases an identity search will take you directly to the entry page for the single structure you have drawn, if it exists in the database. Structures that contain (Substructure) The substructure search retrieves entities in the database of which the given search structure forms a substructure (that is, the search structure is contained within the structure of the entity). For example, naphthalene is a substructure of several entries in the database. 11 Figure 4 Search results for substructure naphthalene Fingerprints are used to eliminate candidates for further examination in substructure searching. For molecule A to be a substructure of molecule B then all bits set in the fingerprint of molecule A should be present in molecule B. Once this initial screening is performed, the potential substructure candidates are subjected to a more rigorous inspection to determine whether molecule A is a substructure of molecule B. Similarity The similarity search retrieves entries in the database to which the given search structure is similar. The measure of similarity is the Tanimoto coefficient, which is calculated as the ratio T(a,b) = c/(a + b - c) where c is the count of bits “on” (i.e. 1 not 0) in the same position in both the two fingerprints, a is the count of bits on in object A, and b is the count of bits on in object B. For example, suppose we are looking for the similarity between the two bitmap fingerprints shown below. Object Object A Object B c (bits on in both) a (bits on in Obj A) b (bits on in Obj B) Tanimoto (c/(a+b-c)) Fingerprint 0010110010 1010110001 3 4 5 3/(4+5-3) = 0.5 The Tanimoto coefficient varies in the range 0.0 – 1.0, with a score of 1.0 indicating that the two structures are very similar (i.e. their fingerprints are the same). Note that as the fingerprints are calculated on a chemical structure path depth of maximum eight, and are of limited size thus necessitating that collisions do occur, it means that some structures 12 will have similar fingerprints and thus very high similarity scores even though they might not be quite as structurally similar. 13 Browsing and navigating within ChEBI Periodic Table The ChEBI Periodic Table allows browsing of the dataset by the familiar periodic table. The periodic table may be browsed for molecular entities containing those elements, or for the elements themselves. Elements Molecular Entities Clicking one of the links on the molecular entities periodic table browser takes you to the entry page for the particular molecular entities class within ChEBI. For example, clicking the link on sodium (Na) results in: Navigate data via links in ontology 14 ‘Sodium molecular entities’ is a class within the ChEBI ontology. You can navigate further within the data by clicking one of the links in the ChEBI ontology. More information about browsing and navigating the ChEBI dataset via the ontology will be provided in the next block of the training course – Block 3 ‘Understanding the ChEBI ontology’. For more information For further information, email the ChEBI team at: chebi-help@ebi.ac.uk, or log on to the SourceForge forum at https://sourceforge.net/projects/chebi/. Additional information about using ChEBI can be found by examining the User Manual at http://www.ebi.ac.uk/chebi/userManualForward.do. The latest news, updates and developments are announced via a RSS Feed. Exercises You will need access to ChEBI online at http://www.ebi.ac.uk/chebi to complete these exercises. 1. Find all the entities which contain a reference to the term “phenol”. How many entities are retrieved? ________________________________________________________________ 2. Find all the entities with molecular formula C10H18O. ________________________________________________________________ 3. Find all the entities which have formula C6H6O4 which are not carboxylic acids. ________________________________________________________________ 4. Find all the entities where the IUPAC name contains the string “pyrimidin”. (Hint: use wildcards.) ________________________________________________________________ 5. Find all the entities which contain phenol as a substructure: Which entity is the first result? ____________________________________________________________________ 15 6. Find all the entities which have a similar structure to paracetamol: Excluding paracetamol itself from the result set, can you find a highly similar result which is also used as a drug? What is the INN for this drug? ____________________________________________________________________ 7. Find all the entities which have molecular formula C6H6O4 and contain the substructure . How many entities are retrieved? Note you need to increase the result size. ____________________________________________________________________ 8. Find all the entities which have a mass range or between 300 and 500 amu and contain the substructure . How many entities are retrieved? Note you need to increase the result size. ____________________________________________________________________ 9. Can you find all the chemical structures which have 60 carbons in them. Tip use the formula in the Advanced Search. ____________________________________________________________________ 10. Can you find all the entries which has a biological role ‘cofactor’? Name one of them? Tip use the Filter by Ontology term section in the Advanced Search. Can you find only the chemical structures which have a role ‘cofactor’? Name one of them? ____________________________________________________________________ 11. Can you find chemical entities which have a charge in the range of -8 to -5? How many of them are there? ____________________________________________________________________ 16