ChEBI Kirill Degtyarenko, EMBL-EBI / EPO The team • • • • • • • • • • • Rafael Alcántara Michael Ashburner * Volker Ast * Michael Darsow * Paula de Matos Marcus Ennis Janna Hastings Alan McNaught * Inma Spiteri Christoph Steinbeck Martin Zbinden * ChEBI: What is it? Chemical Entities of Biological Interest – an EBI database/dictionary of ‘biochemical compounds’ What are the ‘biochemical compounds’? Can be defined as consisting of “molecules not directly encoded by the genome ... that are either the products of nature or are synthetic products used ... to intervene in the processes of living organisms” [Michael Ashburner] Molecular entity “Any constitutionally or isotopically distinct atom, molecule, ion, ion pair, radical, radical ion, complex, conformer etc., identifiable as a separately distinguishable entity” [IUPAC “Gold Book”] In fact, ChEBI contains • Molecular entities trans-vaccenic acid • Groups trans-vaccenoyl group • Classes fatty acids ‘Small molecules’? Yes, but big molecules as well! • alumina • amylose • metaborate • poly(vinyl alcohol) Current status (17.12.08) ChEBI entries 16,618 Synonyms 43,880 IUPAC names 14,847 Registry Numbers 15,773 Formulae 13,163 Database Links 9,196 Structures 14,274 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000 1-D ChEBI • Numeric ID • Carefully checked terminology • Unambiguous ChEBI name • IUPAC names • Cross-references to free resources Unambiguous ChEBI name CHEBI:28918 L-adrenaline not just ‘adrenaline’ Systematic Name (IUPAC) 2-{[3-(trifluoromethyl)phenyl]amino}benzoic acid O 6 5 OH 1 4 2 3 NH 1 6 2 5 4 F 3 F F Common Name • • • • • O OH NH F F F flufenamic acid (INN English) acide flufénamique (INN French) ácido flufenámico (INN Spanish) acidum flufenamicum (INN Latin) Flufenaminsäure (German) The Unpronounceables O CHEBI:48935 (E)-roxithromycin O N O H3C HO CH3 CH3 H3C CH3 CH3 OH OH O H3C O O N OH O O CH3 IUPAC name: H3C H3C CH3 CH3 CH3 O CH3 O OH (3R,4S,5S,6R,7R,9R,10E,11S,12R,13S,14R)-4-(2,6-dideoxy-3C-methyl-3-O-methyl-α-L-ribo-hexopyranosyloxy)-14ethyl-7,12,13-trihydroxy-10-{[(2methoxyethoxy)methoxy]imino}-6-[3,4,6-trideoxy-3(dimethylamino)-β-D-xylo-hexopyranosyloxy]3,5,7,9,11,13-hexamethyloxacyclotetradecan-2-one What is the common name of roxithromycin? CHEBI:32109 (Z)-roxithromycin H3C O CHEBI:48935 (E)-roxithromycin INN: roxithromycin O O O N H3C HO CH3 CH3 HO N OH O CH3 H3C H3C CH3 OH CH3 H3C O H3C O O N OH O O CH3 O O CH3 H3C H3C CH3 CH3 CH3 OH OH O O CH3 CH3 O H3C O O H3C CH3 CH3 OH OH H3C O N O CH3 O OH CH3 CH3 CHEBI:48844 roxithromycin O O N O H3C HO CH3 CH3 H3C CH3 CH3 OH OH O H3C O O N OH O O CH3 H3C O H3C O O N H3C H3C HO CH3 CH3 H3C CH3 O O H3C N OH CH3 CH3 CH3 CH3 H3C CH3 O H3C O O OH (Z)-roxithromycin N OH O O H3C H3C CH3 CH3 CH3 OH CH3 O O O OH O CH3 O N OH HO O O H3C O O H3C O H3C CH3 CH3 CH3 CH3 OH OH CH3 CH3 CH3 O CH3 O OH (E)-roxithromycin What is thiamine? CHEBI:18385 thiamine(1+) aka thiamine H3C N S CHEBI:33283 thiamine(1+) chloride INN: thiamine OH H3C + N N NH2 N + N CH3 N CH3 NH2 CHEBI:49105 thiamine(2+) dichloride aka thiamine chloride hydrochloride aka thiamine hydrochloride H3C OH S Cl - Cl N + N + NH3 - - S N Cl CH3 OH Need for 2-D • “Better to see the face than to hear the name” (Zen proverb) • Structures and identifiers based on structures offer new ways of crosslinking to other databases • Structure search Connection table ChEBI 9 10 0 11.8219 11.8219 12.6074 11.1072 12.6039 11.1072 13.0886 10.3923 10.3888 1 2 2 1 3 1 1 4 1 2 5 1 2 6 1 3 7 1 4 8 2 6 9 2 5 7 2 8 9 1 M END 0 0 0 0 0 0 0 0 0 0 0 0 0 -7.2713 -8.0922 -7.0165 -6.8574 -8.3505 -8.5027 -7.6818 -7.2713 -8.0922 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 999 V2000 0.0000 C 0 0 0.0000 C 0 0 0.0000 N 0 0 0.0000 C 0 0 0.0000 N 0 0 0.0000 N 0 0 0.0000 C 0 0 0.0000 N 0 0 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 H N N N N 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2-D ChEBI • One or more 2-D (or 3-D) connection tables • One is default • Autogenerated images (PNG) • Default diagrams should be unambiguous The Fine Art of chemical drawing Linear forms of monosaccharides CHO H OH HO H HO H H OH H O H O H OH HO H HO H H OH OH HO OH HO CH 2OH OH HO Pyranose forms of monosaccharides CH 2OH O HO H OH H H H OH OH CH OH 2 H OH O OH HO O HO HO OH OH OH OH Fused systems (R)-camphor H3C CH3 CH3 H3C H3C O O CH3 ambiguous unambiguous Square planar geometry cisplatin transplatin H H H H H N Cl Cl Pt Pt H N H H N Cl H N H H Cl H From 2-D back to 1-D SMILES InChI SMILES (1) • Simplified Molecular Input Line Entry Specification • Developed by David Weininger in 1988 • Extended by others (e.g. Daylight) • String of standard ASCII characters • A number of valid SMILES can be produced for the same molecule SMILES H N N N N N1C=NC2=C1C=NC=N2 c1ncc2ncnc2n1 C=1N\C=N/C\2=N/C=N\C=1/2 c1ncnc2/N=C\Nc12 n1cc2c(nc1)ncn2 [H]c1nc([H])c2n([H])c([H])nc2n1 (2) InChI (1) • IUPAC International Chemical Identifier or InChI • Open source • Developed by Stein, Heller, Tchekhovskoi and McNaught • Used by NIST, PubChem, CML… and ChEBI InChI (2) H N N N N InChI=1/C5H4N4/c1-4-5(8-2-6-1)9-3-7-4/h1-3H,(H,6,7,8,9)/f/h7H InChIKey=KDCGOANMDULRCW-QDQILVOLCG Limitations • Stereochemistry other than sp3 tetrahedral and sp2 trigonal planar • Polymers • Conformers • Radicals/different spin state • Topological isomers • Mixtures • Markush structures (1) Limitations cisplatin (2) transplatin H H H H H N Cl Cl N H H H Pt Pt H N Cl H N Cl H H InChI=1/2ClH.2H3N.Pt/h2*1H;2*1H3;/q;;;;+2/p-2 3-D ChEBI cisplatin Uncertainty and ambiguity in chemistry Compositional uncertainty Positional uncertainty Configurational uncertainty Conformational uncertainty Compositional uncertainty Examples an alkali metal cation vanadate(V) anion [2H]ethanol Positional uncertainty Examples L-bromohistidine residue pteroic acid (several tautomers) Configurational uncertainty Examples androstane rel-(2R,3R)-2-amino-3-methylpentanoic acid tetradec-11-enoic acid Conformational uncertainty Examples cyclohexane: chair, boat, twist protein secondary structure: , , … ChEBI ontology • Molecular structure ontology • Subatomic particle ontology • Role ontology Biological role Application L-adrenaline Molecular structure ontology catecholamines Biological role hormone Application antiglaucoma bronchodilator cardiostimulant The family relations L-cystein-S-yl L-cysteine(•) L-cysteine cysteine D-cysteine L-cysteine L-cysteino L-cysteinium L-cysteinyl L-cysteine L-cysteinate zwitterion residue residue L-cysteinate(1–) L-cysteinate(2–) Relationships in ChEBI ∆ ⋄ Is A Has Part generic generic ♯ ♭ ℛ ℋ ℱ Is Conjugate Acid Of Is Conjugate Base Of Is Enantiomer Of Is Tautomer Of Is Substituent Group From Has Parent Hydride Has Functional Parent Has Role specific specific specific specific specific specific specific generic? Is A relationship O O ∆ HS OH HS NH2 L-cysteine OH NH2 is a cysteine Is Enantiomer Of O HS OH NH2 ∆ ∆ O O HS OH HS NH2 L-cysteine OH NH2 is enantiomer of D-cysteine Has Part has part O HS O OH ⋄ HS + Cl - + NH3 L-cysteinium OH NH3 is part of L-cysteine hydrochloride Is Conjugate Acid Of O HS O - OH S + NH3 O - NH2 L-cysteinium L-cysteinate(2–) ♯ ♯ O O ♯ HS OH HS NH2 L-cysteine O - NH2 is conjugate acid of L-cysteinate(1–) Is Conjugate Base Of O HS O - OH S + NH3 L-cysteinate(2–) ♭ ♭ O O OH NH2 L-cysteine - NH2 L-cysteinium HS O ♭ HS O - NH2 L-cysteinate(1–) Acid/base relationships O HS O - OH S + NH3 L-cysteinate(2–) ♭ ♯ ♯ ♭ O O OH NH2 L-cysteine - NH2 L-cysteinium HS O ♯ ♭ HS O - NH2 L-cysteinate(1–) Is Tautomer Of O O HS OH HS + NH2 L-cysteine O NH3 is tautomer of L-cysteine zwitterion - Is Tautomer Of H N 1H-pyrrole N 2H-pyrrole N 3H-pyrrole Has Parent Hydride is parent hydride of O H3C HO N H3C CH3 ℋ N H H O OH salutaridinol has parent hydride morphinan Has Functional Parent is functional parent of O H3C O H3C HO N H3C CH3 HO ℱ N O H3C H3C O CH3 O OH O 7-O-acetylsalutaridinol has functional parent salutaridinol Is Substituent Group From O L-cysteine HS OH NH2 ℛ O HS O ℛ ℛ OH HS O NH NH2 * L-cysteino HS * NH * L-cysteine * residue L-cysteinyl The family relations L-cystein-S-yl L-cysteine(•) cysteine ∆ ℱ ∆ D-cysteine ♯ ♭ ℛ ℛ L-cysteinyl ♭ L-cysteine zwitterion ♯ ♭ ♭ L-cysteinate(1–) residue ♯♭ L-cysteinate ♯ ♯ L-cysteine ℛ L-cysteino L-cysteine ℛ L-cysteinium residue ℛ ♯ ♭ L-cysteinate(2–) Ontology of L-cysteine Ontology of L-cysteine (1) Ontology of L-cysteine (2) Thank you