StarOmics training agenda Chemicals Reactions Enzymes Pathways 1, StarOmics course,Lausanne, Monday November 19th Outline Part I : • Introduction • Major chemical classes and functional groups • Isomery • Protonation states • Compound naming Part II: • Coding chemical structure Formula, mol file, SMILES, InChI • Chemical resources ChEBI (+classification), KEGG, MetaCyc ChEMBL, PubChem, etc Part III: Chemoinformatics tools: ChemAxon exercises 2, StarOmics course,Lausanne, Monday November 19th Chemicals A chemical compound is a pure chemical substance consisting of two or more different chemical elements. • Elements (atoms) in a compound are present in a fixed ratio. Ex: 2 atoms of hydrogen + 1 atom of oxygen becomes 1 molecule of compound-water. • Atoms are held together in a defined spatial arrangement by chemical bonds. Chemical compounds have a unique and defined chemical structure acetic acid Formula: C2H4O2 3, StarOmics course,Lausanne, Monday November 19th Identifying compounds CAS Registry Numbers are unique numerical identifiers assigned by the Chemical Abstracts Service to every chemical described in the open scientific literature The CAS number of acetic acid is 64-19-7 The IUPAC nomenclature of organic chemistry is a systematic method of naming organic chemical compounds as recommended by the International Union of Pure and Applied Chemistry (IUPAC). Ideally, every possible organic compound should have a name from which an unambiguous 2D structure can be created. Ex: acetic acid is the IUPAC name 4, StarOmics course,Lausanne, Monday November 19th Acetic acid The IUPAC name is acetic acid. L-lysine The IUPAC name is (2S)-2,6-diaminohexanoic acid. 5, StarOmics course,Lausanne, Monday November 19th The IUPAC name is 18-bromo-12-butyl-11-chloro-4,8-diethyl-5-hydroxy-15-methoxytricos-6,13-diene19-yne-3,9-dione. http://en.wikipedia.org/wiki/IUPAC_nomenclature_of_organic_chemistry IUPAC names are rarely used by the biologist community! 6, StarOmics course,Lausanne, Monday November 19th Chemical classes C: carbon O: oxygen S: sulfur N: nitrogen P: phosphate 7, StarOmics course,Lausanne, Monday November 19th Functional groups Functional groups are specific groups of atoms or bonds that are responsible for the characteristic chemical reactions of those molecules. The same functional group will undergo the same or similar chemical reaction(s) regardless of the size of the molecule it is a part of. But its relative reactivity can be modified by nearby functional groups. Combining the names of functional groups with the names of the parent alkanes generates a powerful systematic nomenclature for naming organic compounds. 8, StarOmics course,Lausanne, Monday November 19th Hydrocarbons: common functional groups 9, StarOmics course,Lausanne, Monday November 19th Hydrocarbons: common functional groups to get more functional groups http://en.wikipedia.org/wiki/Functional_group 10, StarOmics course,Lausanne, Monday November 19th Main functional groups 11, StarOmics course,Lausanne, Monday November 19th Isomers same molecular formula different structural formula Isomery Stereoisomers same atom connectivity different arrangement in space Constitutional isomers different atom connectivities Enantiomers mirror images Z/E or cis/trans isomers 12, StarOmics course,Lausanne, Monday November 19th Diastereomers not mirror images Epimers Anomers Others Enantiomers • A carbon atom with four different groups is a tetrahedral stereogenic center or chiral center or asymmetric carbon atom. • Asymmetric carbons give rise to stereoisomerism. • A and B are stereoisomers, and more precisely, they are enantiomers. • In medicinal chemistry and biochemistry, enantiomers are a special concern because they may possess quite different biological activity. 13, StarOmics course,Lausanne, Monday November 19th Enantiomers / Diastereomers • Compounds with 2 asymmetric carbons: 14, StarOmics course,Lausanne, Monday November 19th Naming Enantiomers • Since enantiomers are two different compounds, they need to be distinguished by name. • Naming conventions: By configuration: R- and SFor chemists, the R / S system is the most important nomenclature system for denoting enantiomers. By optical activity: (+)- and (-) By configuration: D- and L- 15, StarOmics course,Lausanne, Monday November 19th Enantiomers: naming by absolute configuration R- and S• This system labels each chiral center R or S according to a system by which its substituents are each assigned a priority, according to the Cahn-Ingold-Prelog rules (CIP), based on atomic number. Examine the atoms directly attached to the stereogenic carbon. Groups attached with atoms of higher atomic number receive higher priority, (e.g. O > N > C > H). When the attached atoms are identical, move down the next branching bond of the highest priority, and repeat until a difference is found, (e.g. -C(CH3)3 > -CH(CH3)2 > -CH2CH3 > -CH3). 16, StarOmics course,Lausanne, Monday November 19th Enantiomers: naming by absolute configuration R- and S• If the center is oriented so that the lowest-priority of the four is pointed away from a viewer, the viewer will then see two possibilities: If the priority of the remaining three substituents decreases in clockwise direction, it is labeled R (for Rectus, Latin for right), If it decreases in counterclockwise direction, it is S (for Sinister, Latin for left) (2S)-2,6-diaminohexanoic acid (2S,3R)-3-hydroxy-2-methylpentanoic acid 17, StarOmics course,Lausanne, Monday November 19th Enantiomers: naming by optical activity (+)- and (-)• An enantiomer can be named by the direction in which it rotates the plane of polarized light. If it rotates the light clockwise (as seen by a viewer towards whom the light is traveling), that enantiomer is labeled (+). Is dextrorotatory. Its mirror-image is labeled (−). Is levorotatory. • Enantiomers are also known as optical isomers due to the fact that two enantiomers will rotate plane-polarized light in equal, but opposite directions. 18, StarOmics course,Lausanne, Monday November 19th Enantiomers: naming by relative configuration D- and L• An optical isomer can be named by the spatial configuration of its atoms. • The D/L system does this by relating the molecule to glyceraldehyde. Glyceraldehyde has 2 possible configurations labeled D and L (from the Latin laevus and dexter, meaning left and right, respectively): • All other molecules are assigned the D- or L- configuration if the chiral center can be formally obtained from glyceraldehyde by substitution. For this reason the D- or L- naming scheme is called relative configuration. • Problem: this naming system can be ambiguous, in contrast to the R/S system. 19, StarOmics course,Lausanne, Monday November 19th Enantiomers: naming by relative configuration D- and L• Similarity to glyceraldehyde is used to designate configuration The amino acids in proteins are exclusively L stereoisomers. 20, StarOmics course,Lausanne, Monday November 19th Enantiomers: naming by relative configuration D- and L• The D/L system is still very much used for naming sugars. • When there is more than 1 asymmetric carbon, the D and L are used to relate configuration of the chiral center most distant from the reducing group (C=O). The convention is to arrange the Fischer projection with the carbonyl group at the top for aldoses and closest to the top for ketoses. The carbons are numbered from top to bottom. If the OH is on the right in the Fischer projection, then it is D If the OH is on the left, then it is L 21, StarOmics course,Lausanne, Monday November 19th Enantiomers: relations between the different system namings • The R/S system has no fixed relation to the (+)/(-) system. An R isomer can be either dextrorotatory or levorotatory, depending on its exact ligands. • The R/S system also has no fixed relation to the D/L system. • The D/L labeling is unrelated to (+)/(-); it does not indicate which enantiomer is dextrorotatory and which is levorotatory. Rather, it says that the compound's stereochemistry is related to that of the dextrorotatory or levorotatory enantiomer of glyceraldehyde; the dextrorotatory isomer of glyceraldehyde is, in fact, the D- isomer. • The D/L system remains in common use in certain areas, such as amino acid and carbohydrate chemistry. It is convenient to have all of the common amino acids of higher organisms labeled the same way. In D/L, they are all L. In R/S, they are not, conversely, all S — most are, but cysteine, for example, is R, because of sulfur's higher atomic number. 22, StarOmics course,Lausanne, Monday November 19th Alkenes: E/Z (or cis/trans) nomenclature • The C=C bond cannot rotate and is a most common cause of diastereomerism. • When an alkene has more than one substituent, the double bond geometry is described using the labels E and Z. These labels come from the German words "entgegen," meaning "opposite," and "zusammen," meaning "together." Alkenes with the higher priority groups (as determined by CIP rules) on the same side of the double bond have these groups together and are designated Z. Alkenes with the higher priority groups on opposite sides are designated E. • A mnemonic to remember this: Z notation has the higher priority groups on "ze zame zide." 23, StarOmics course,Lausanne, Monday November 19th Alkenes: E/Z (or cis/trans) nomenclature (cis)-but-2-ene 24, StarOmics course,Lausanne, Monday November 19th (trans)-but-2-ene Chemical representation: Fischer projection • Fischer projections (after the German chemist Hermann Emil Fischer) is an ingenious means for representing configurations of carbon atoms. • Taking in consideration a carbon center, place horizontally the bonds extending towards the observer. The backward bonds will be vertical. This position is then shorthanded as two lines: the horizontal (forward) and the vertical, as showed in the figure : 25, StarOmics course,Lausanne, Monday November 19th H C CH2OH HO H CHO HO C CH2OH H CHO CHO CHO HO OH H OH CH2OH CH2OH CHO CHO H CH2OH HO H CH2OH Epimers Epimers : stereoisomers that differ only in configuration about one chiral center. H HO H H CHO OH H OH OH CH2OH D-glucose HO HO H H CHO H H OH OH CH2OH D-mannose epimers 26, StarOmics course,Lausanne, Monday November 19th Anomers and Haworth projection • Pentoses and hexoses can cyclize in solution. • The hemiacetal or hemiketal carbon of the cyclic form of carbohydrates is the anomeric carbon (C1 below). • Carbohydrate isomers that differ only in the stereochemistry of the anomeric carbon are called anomers. Anomers are thus epimers at C1. • The - and -anomers are in equilibrium, and interconvert through the open form. This process is named mutarotation. D-glucose 27, StarOmics course,Lausanne, Monday November 19th Isomers of D-arabinose: 28, StarOmics course,Lausanne, Monday November 19th Tautomery L-ascorbic acid CHEBI:29073 29, StarOmics course,Lausanne, Monday November 19th L-xylo-hex-3ulonolactone CHEBI:28745 Tautomery spontaneous L-ascorbic acid CHEBI:29073 L-xylo-hex-3ulonolactone CHEBI:28745 Tautomers are isomers of organic compounds that readily interconvert by a chemical reaction called tautomerization. This reaction commonly results in the formal migration of a hydrogen atom or proton, accompanied by a switch of a single bond and adjacent double bond. Because of the rapid interconversion, tautomers are generally considered to be the same chemical compound. 30, StarOmics course,Lausanne, Monday November 19th Acid / Base Arrhenius: an acid is a substance that dissociates in aqueous solution, releasing a proton (H+) dissociation reaction Ka: dissociation constant acid-base reaction Brønsted and Lowry: generalization to a proton exchange reaction conjugate base acid base 31, StarOmics course,Lausanne, Monday November 19th conjugate acid Dissociation constant Most molecules contain some specific functional groups likely to lose or gain proton under specific circumstances. Each ionization equilibrium between the protonated and deprotonated forms of the molecule can be described with a constant value called pKa The dissociation constant Ka is usually written as a quotient of the equilibrium concentrations (in mol/L), denoted by [HA], [A−] and [H+]: acid conjugate base The logarithmic measure of the acid dissociation constant is more commonly used in practice: 32, StarOmics course,Lausanne, Monday November 19th Acetic acid [HA] = [A−] when pH = pKa 33, StarOmics course,Lausanne, Monday November 19th At neutral pH, the major species is acetate ion acetic acid is conjugate acid of acetate 34, StarOmics course,Lausanne, Monday November 19th L-lysine 35, StarOmics course,Lausanne, Monday November 19th Naming compounds As the IUPAC name is not commonly used, there are some issues to reference compounds: Same name for different structures Different names for same structure 36, StarOmics course,Lausanne, Monday November 19th Compound naming ambiguity ferulic acid + CoASH + ATP = trans-feruloyl-CoA + products of ATP breakdown (EC 6.2.1.34) phenylglyoxylate + NAD+ + CoA-SH = benzoyl-S-CoA + CO2 + NADH (EC 1.2.1.58) acyl-CoA + H2O = CoA + a carboxylate (EC 4.1.1.9) 37, StarOmics course,Lausanne, Monday November 19th Compound naming ambiguity EC 1.1.1.159 CHEBI:29747 EC 3.1.2.27 need to use structural data 38, StarOmics course,Lausanne, Monday November 19th Name:cholate Compounds: coding 2D structure 39, StarOmics course,Lausanne, Monday November 19th Compounds: coding 2D structure Formula:C6H11O9P Net charge: -2 • mol file, SDF file • SMILES, • InChI 40, StarOmics course,Lausanne, Monday November 19th Molfile / SDF file • An MDL Molfile is a file format created by MDL (now Symyx who have merged with Accelrys), for holding information about the atoms, bonds, connectivity and coordinates of a molecule. 41, StarOmics course,Lausanne, Monday November 19th Header Atom block Bond block 42, StarOmics course,Lausanne, Monday November 19th Charge, repeated unit,… 42 Molfile / SDF file • Molfile format An MDL Molfile is a file format created by MDL (now Symyx who have merged with Accelrys), for holding information about the atoms, bonds, connectivity and coordinates of a molecule. • SDF (Structure-Data File) format SDF files actually wrap the Molfile format. Multiple compounds are delimited by lines consisting of four dollar signs ($$$$). A feature of the SDF format is its ability to include associated data. 43, StarOmics course,Lausanne, Monday November 19th molecule n-1 molecule n associated data molecule n+1 44, StarOmics course,Lausanne, Monday November 19th SMILES: Simplified Molecular-Input Line-Entry System Line notation for describing the structure of chemical molecules using short ASCII strings. Generation of SMILES: Break cycles, then write as branches off a main backbone. http://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system 45, StarOmics course,Lausanne, Monday November 19th A number of equally valid SMILES can be written for a molecule. Ex: CCO, OCC and C(O)C all specify the structure of ethanol. Algorithms have been developed to ensure the same SMILES is generated for a molecule regardless of the order of atoms in the structure. The SMILES is unique for each structure (although dependent on the canonicalization algorithm used to generate it) = the Canonical SMILES isomeric SMILES: string with information about double bond configuration and chirality 46, StarOmics course,Lausanne, Monday November 19th 46 InChI: The IUPAC International Chemical Identifier http://www.iupac.org/inchi/ 47, StarOmics course,Lausanne, Monday November 19th 48, StarOmics course,Lausanne, Monday November 19th 48 Compounds: chemical resources 49, StarOmics course,Lausanne, Monday November 19th Chemical resources: generic resources ChEBI http://www.ebi.ac.uk/chebi/ MetaCyc http://metacyc.org/ KEGG http://www.genome.jp/kegg/compound/ ChEMBL https://www.ebi.ac.uk/chembl/ PubChem http://www.ncbi.nlm.nih.gov/sites/entrez?db=pccompound ChemProt http://www.cbs.dtu.dk/services/ChemProt/ 50, StarOmics course,Lausanne, Monday November 19th 50 ChEBI – Chemical Entities of Biological Interest http://www.ebi.ac.uk/chebi/ 51, StarOmics course,Lausanne, Monday November 19th ChEBI (Chemical Entities of Biological Interest) Non-redundant database of manually annotated chemical compounds, chemical groups (parts of molecular entities) and classes of entities ChEBI provides a chemical ontology which allows to describe the relationships between molecular entities or classes of entities ChEBI focuses on small molecules, i.e molecules such as nucleic acids, proteins and peptides derived from proteins by cleavage are not (will not be) included in ChEBI 52, StarOmics course,Lausanne, Monday November 19th 52 ChEBI (Chemical Entities of Biological Interest) Each ChEBI 3-stars entity contain: A unique, unambiguous, recommended ChEBI name and associated stable unique identifier A 2D chemical structure when appropriate (small compounds and groups, but not for classes of compounds) A definition where appropriate A collection of synonyms including the IUPAC recommended name for the entity where appropriate, and brand names and INNs for drugs A collection of manually and automatically generated crossreferences to other databases Links to the ChEBI ontology Citation information where the chemical has been cited in publication 53, StarOmics course,Lausanne, Monday November 19th 53 54, StarOmics course,Lausanne, Monday November 19th 54 Link to MarvinView Applet Molfile Additional chemical information Additional chemical structures identifiers Links to the ChEBI ontology (Structured controlled vocabulary) Manually curated cross-references 55, StarOmics course,Lausanne, Monday November 19th 55 ChEBI classification 56, StarOmics course,Lausanne, Monday November 19th ChEBI classification L-lysine (CHEBI:18019) L-lysinate (CHEBI:32550) L-lysinium(1+) (CHEBI:32551) D-lysine (CHEBI:16855) ChEBI ontology Janna Hastings, Thursday Nov. 22th 57, StarOmics course,Lausanne, Monday November 19th Searching ChEBI 1. Simple search Wild cards available for both Simple and Advanced search (* character): Starting with (Example: aceto*) Ending with (Example: *amine) Containing (Example: *propyl*) 58, StarOmics course,Lausanne, Monday November 19th 58 Searching ChEBI 2. Advanced search Structure-based search Text-based search Narrow the search by: Specific annotation fields Compounds sharing common ontology Compounds with specific chemical properties (formula, mass/charge range) 59, StarOmics course,Lausanne, Monday November 19th 59 Searching ChEBI Example: Searching 3-star compounds with GTP in compound name Same compound in different protonation states 60, StarOmics course,Lausanne, Monday November 19th 60 ChEBI download 61, StarOmics course,Lausanne, Monday November 19th 61 Submission to ChEBI 62, StarOmics course,Lausanne, Monday November 19th 62 ChEBI (Chemical Entities of Biological Interest) ChEBI statistics (Release 96; October 2012) 63, StarOmics course,Lausanne, Monday November 19th 63 MetaCyc compounds Curation in MetaCyc covers also chemical compounds. Scope: compounds involved in reactions or cofactors Compound structures in MetaCyc have been protonated at pH 7.3 to represent a consistent and biologically relevant protonation state. Extensive crossreferences to other compound resources (ChEBI, KEGG, PUBCHEM) Compounds in MetaCyc organized in an internal class hierarchy. This allows to define groups of compounds or navigate and retrieve sets of compounds sharing functional groups or metabolic purposes The full list of MetaCyc compounds available at http://metacyc.org/META/classinstances?object=Compounds 64, StarOmics course,Lausanne, Monday November 19th Stats of MetaCyc V16.1 MetaCyc compounds MetaCyc compound ontology (http://metacyc.org/META/class-tree?object=Compounds) 65, StarOmics course,Lausanne, Monday November 19th MetaCyc compounds Example MetaCyc compound: N-acetyl-β-D-glucosamine Compound ID Compound ontology Additional chemical data Additional chemical structure identifiers Cross-references Reactions/Pathways including the compounds Molfile is not directly available from the public web site 66, StarOmics course,Lausanne, Monday November 19th MetaCyc compounds Searching MetaCyc compounds 1. Quick Search 67, StarOmics course,Lausanne, Monday November 19th Quick search box (upper right-hand corner of every MetaCyc page) MetaCyc compounds Searching MetaCyc compounds 2. Simple search (compound name/ID) 68, StarOmics course,Lausanne, Monday November 19th 3. Advanced search KEGG compounds KEGG LIGAND (http://www.genome.jp/kegg/ligand.html ) is a composite database that summarizes the knowledge on chemical compounds and reactions stored in KEGG KEGG COMPOUND is the KEGG database of small molecules, biopolymers and other entities relevant to biological systems Compounds in KEGG COMPOUNDS are the basic building blocks of chemical reactions stored in KEGG REACTION 69, StarOmics course,Lausanne, Monday November 19th KEGG COMPOUNDS stores extensive cross-reference to other KEGG databases and external resources, including ChEBI KEGG COMPOUNDS defined at fully protonated form KEGG compounds Statistics 70, StarOmics course,Lausanne, Monday November 19th KEGG compounds Example: L-Lysine Compound ID Cross-references KEGG/External resources SIMCOMP search KEGG REACTIONS including the compound KEGG pathways including the compound EC numbers (Official IUBMB)associated to KEGG reactions including the compound ChEBI 71, StarOmics course,Lausanne, Monday November 19th cross-references Cross-References compound databases KEGG compounds Example: L-Lysine Uncharged compound This chemical entity does not exist. (no pH consideration) 72, StarOmics course,Lausanne, Monday November 19th L-lysine 73, StarOmics course,Lausanne, Monday November 19th KEGG compounds Searching KEGG COMPOUND 1. Text-based search (Compound ID or compound name) http://www.kegg.jp/dbget-bin/www_bfind?compound 2. SIMCOMP/SUBCOMP search: Graph-based methods to compare chemical structures http://www.genome.jp/tools/simcomp/ http://www.genome.jp/tools/subcomp/ SIMCOMP (SIMilar COMPounds): Atom-atom alignments between compound graphs SUBCOMP (SUBstructure matching of COMPounds): Common substructure detection in compound graphs Input:Compound ID or Molfile 74, StarOmics course,Lausanne, Monday November 19th KEGG compounds Example: Search compounds similar to L-lysine (C00043) with SIMCOMP Map KEGG compounds to KEGG pathways of KEGG hierarchies 75, StarOmics course,Lausanne, Monday November 19th ChEBI / KEGG / MetaCyc compounds http://www.unipathway.org/compound 76, StarOmics course,Lausanne, Monday November 19th Chemical resources: carbohydrates http://www.science.co.il/biomedical/Carbohydrate-Databases.asp 77, StarOmics course,Lausanne, Monday November 19th 77 Chemical resources: specialized resources http://www.science.co.il/biomedical/Lipid-Databases.asp SystemX.ch: LipidX (http://lipidx.org) 78, StarOmics course,Lausanne, Monday November 19th 78 LipidX • Insert examples… template => SDF 79, StarOmics course,Lausanne, Monday November 19th ChEBI-izing LipidX • Insert examples… template => SDF 80, StarOmics course,Lausanne, Monday November 19th ChEBI-izing LipidX • Insert examples… template => SDF 81, StarOmics course,Lausanne, Monday November 19th Exercises http://education.expasy.org/cours/StarOmics2012/ 82, StarOmics course,Lausanne, Monday November 19th Compound structure analysis: Marvin tools http://www.chemaxon.com Molfiles can be visualized: Through MarvinApplet integrated in ChEBI (Limited editing options) Though local instalation of Marvin tools (Full editing options) Marvin suite is a collection of tools developed by ChemAxon (http://www.chemaxon.com/ ) for drawing, displaying and analyzing 2D/3D structures of chemical compounds, macromolecules and reactions Calculator Plugins (« Calculations » menu) allow to calculate physico-chemical properties of chemical structures: Elemental analysis Name generator Protonation plugins: pKa plugin Major microspecie plugin Isoelectric point pluggin 2D structures generated locally with Marvin (Molfiles) can be used as input to the ChEBI 83, StarOmics course,Lausanne, Monday November 19th submission tool Compound structure analysis: Marvin tools Elemental analysis plugin Provides basic molecular values related with the elemental composition of the molecule Naming plugin Allows computation of the IUPAC name or the Traditional name of any compound 84, StarOmics course,Lausanne, Monday November 19th Compound structure analysis: Marvin tools Protonation plugins: pKa (Calculations Protonation pKa) Calculates the pKa values of all proton gaining or losing atoms on the basis of the partial charge distribution The chart shows the microspecies distribution curves vs. pH The table display the relative abundance of different microspecies across pH range 85, StarOmics course,Lausanne, Monday November 19th pKa values Compound structure analysis: Marvin tools Protonation plugins: Major Microspecies (Calculations Protonation Major Microspecies) Determines the major protonation form at a specified pH Protonation plugins: Isoelectric point (Calculations Protonation Isoelectric Point) Calculates gross charge distribution of a molecule as function of pH. 86, StarOmics course,Lausanne, Monday November 19th