Current trends & hot topics in Chemoinformatics Traditional areas of application • Pharmaceutical & life science industry – particularly in early stage drug design • Databases of available chemicals • Electronic publishing – including searchable chemical structure information in journals, etc. • Government and patent databases The theory so far (1960’s to present) … • How do you represent 2D and 3D chemical structures? – Not just a pretty picture • How do you search databases of chemical structures? – Google doesn’t help (much, but it might do soon…) • How do you organize large amounts of chemical information? • How do you visualize chemical structures & proteins? • Can computers predict how chemicals are going to behave – … in the test tube? – … in the body? Current trends & hot topics • The move of chemical informatics into the public domain (PubChem, MLI, eScience, open source) • Service-oriented architectures • Packaging & processing large volumes of complex information for human consumption • Integration with other –ics (bioinformatics, genomics, proteomics, systems biology) What does it mean for the bench chemist? • An increasing number of web tools and databases available which can aid in compound acquisition, synthesis, and biological profiling • A trend towards more (and more effective) use of computers in the lab - not just for email • A need for most synthetic chemists (and all medicinal chemists) to be aware of computational techniques and how they can assist in the compound synthesis and drug discovery processes • An opportunity to combine an interest in chemistry with an interest in computers Chemoinformatics software vendors • Accelrys-Large chemoinformatics company • ACD/Labs - analytical informatics & predictions • Digital Chemistry - 2D fingerprinting, clustering toolkits & software • Cambridgesoft - 2D drawing tools & E-notebooks • CAS - produce Scifinder Scholar searching software • ChemAxon - Java based toolkits and software • Daylight - 2D representation & searching software • Leadscope - 2D structure and property tools • Lion Bioscience - produce LeadNavigator • MDL - Large chemoinformatics company • Mesa Analytics and Computing - Educational & Statistical tools • Openeye - Fast 3D docking, structure generation, toolkits • Quantum Pharmaceuticals - prediction, docking, screening • Sage Informatics - ChemTK 2D analysis software • Tripos - Large chemoinformatics company Main academic sites • “Pure” Chemoinformatics – University of Sheffield, UK (Willett / Gillet) • http://www.shef.ac.uk/uni/academic/IM/is/research/cirg.html – Erlangen, Germany (Gasteiger) • http://www2.chemie.uni-erlangen.de/ – Cambridge Unilever Center • http://www-ucc.ch.cam.ac.uk/ – Indiana University School of Informatics • http://www.informatics.indiana.edu/ Main academic sites • Related (computational chemistry, etc.) – UCSF (Kuntz) • http://mdi.ucsf.edu/ – University of Texas (Pearlman) • http://www.utexas.edu/pharmacy/divisions/pharmaceu tics/faculty/pearlman.html – Yale (Jorgensen) • http://zarbi.chem.yale.edu/ – University of Michigan (Crippen) • http://www.umich.edu/~pharmacy/MedChem/faculty/c rippen/ “Traditional” Journals • Journal of Chemical Information & Modeling (formerly JCICS) – http://pubs.acs.org/journals/jcisd8/index.html • Journal of Computer-Aided Molecular Design – http://www.kluweronline.com/issn/0920-654X • Journal of Molecular Graphics and Modeling – http://www.elsevier.com/inca/publications/store/5/2/5/0/1/2/ • Journal of Computational Chemistry – http://www3.interscience.wiley.com/cgi-bin/jhome/33822 • Journal of Chemical Theory and Computation – http://pubs.acs.org/journals/jctcce/ • Journal of Medicinal Chemistry – http://pubs.acs.org/journals/jmcmar/ “Informal” publications • Network Science (online) – http://www.netsci.org/Science/index.html • Chemical & Engineering News – http://pubs.acs.org/cen/ • Drug Discovery Today – http://www.drugdiscoverytoday.com/ • Scientific Computing World – http://www.scientific-computing.com/ • Bio-IT World – http://www.bio-itworld.com/ Yahoo! Chemoinformatics Discussion List • For – – – – Job postings Ideas exchange Questions Industry – Student connections To join, go to http://groups.yahoo.com/group/chemoinf Or send an email to chemoinf-subscribe@yahoogroups.com Impacting Industry Example 1 High-Throughput Screening Testing perhaps millions of compounds in a corporate collection to see if any show activity against a certain disease protein High-Throughput Screening • Traditionally, small numbers of compounds were tested for a particular project or therapeutic area • About 10 years ago, technology developed that enabled large numbers of compounds to be assayed quickly • High-throughput screening can now test 100,000 compounds a day for activity against a protein target • Maybe tens of thousands of these compounds will show some activity for the protein • The chemist needs to intelligently select the 2 - 3 classes of compounds that show the most promise for being drugs to follow-up Informatics Implications • Need to be able to store chemical structure and biological data for millions of data points – Computational representation of 2D structure • Need to be able to organize thousands of active compounds into meaningful groups – Group similar structures together and relate to activity • Need to learn as much information as possible (data mining) – Apply statistical methods to the structures and related information Tools for mining the data Tripos Benchware HTS Dataminer (formerly SAR Navigator), www.tripos.com Example 2: 3D Visualization & Docking 3D Visualization of interactions between compounds and proteins “Docking” compounds into proteins computationally 3D Visualization • X-ray crystallography and NMR Spectroscopy can reveal 3D structure of protein and bound compounds • Visualization of these “complexes” of proteins and potential drugs can help scientists understand the mechanism of action of the drug and to improve the design of a drug • Visualization uses computational “ball and stick” model of atoms and bonds, as well as surfaces • Stereoscopic visualization available Accelrys Discovery Studio Docking algorithms • Require 3D atomic structure for protein, and 3D structure for compound (“ligand”) • May require initial rough positioning for the ligand • Will use an optimization method to try and find the best rotation and translation of the ligand in the protein, for optimal binding affinity Genetic Algorithms • Create a “population” of possible solutions, encoded as “chromosomes” • Use “fitness function” to score solutions • Good solutions are combined together (“crossover”) and altered (“mutation”) to provide new solutions • The process repeats until the population “converges” on a solution Sample GOLD output GMP into RNaseT1 Something fun… Screensaver that docks molecules while your computer is idle at http://www.grid.org/projects/cancer/ Representing 2D structures with SMILES Historical ways of representing chemicals • Trivial name, e.g. Baking Soda, Aspirin, Citric Acid, etc. Identifies the compound, but gives no (or little) information about what it consists of • Chemical formula, e.g. C6H12O6. Specifies the type and quantity of the atoms in the compound, but not its structure (i.e. how the atoms are connected by bonds) • Systematic name, e.g. 1,2-dibromo-3-chloropropane. Identifies the atoms present and how they are connected by bonds. Trivial and Systematic Names O HO NH2 CH CH2 OH Trivial name: – tyrosine Systematic names: – -(p-hydroxyphenyl)alanine – -amino-p-hydroxyhydrocinnamic acid Historical ways of representing chemicals 2D structure diagram shows atoms present and how they are connected by bonds 3D structure diagram, shows how atoms are related to each other in 3D space. Can take a variety of forms. Accurate models only really possible since Xray crystallography and computers… but ball and stick models have been around a long time! David Wild – Research Overview July 2006. Page 27 Early computer representations • How do we communicate structural information between humans and the computer? – Line notations, e.g. Wiswesser Line Notation (and later SMILES) • How do we represent the atoms and bonds in a molecule internally in a computer? – Atom lookup and connection tables Linear notations • Represent the atoms, bonds and connectivity of a molecule in a linear text string • Consise representation • Originally designed for manual command line entry into text-only systems • Now an excellent format for file and database storage (e.g. can be held in a spreadsheet cell, on one line of a text file, or in an Oracle database text field) Wiswesser Line Notation (obsolete) O HO NH2 CH CH2 OH • WLN for this structure is QVYZ1R DQ • Uses text symbolic representation of function groups, e.g.: – Q = OH, V = -CO-, Z = -NH2, R = benzene • Other symbols represent branching, e.g. Y SMILES O HO NH2 CH CH2 OH • (one possible) SMILES for this structure is OC(=O)C(N)CC1=CC=C(O)C=C1 Dave Weininger, Daylight www.daylight.com • Can identify any chemical structure • There can be several ways of writing the same strucutre in SMILES (although a system of generating canonical SMILES) exists SMILES – Atoms & Bonds • Atoms represented by their chemical symbol (C, N, S, O, Br, etc). Uppercase for aliphatic, lowercase for aromatic • Adjacent atoms implicitly single bonded, or = for double bond, or # for triple bond • Hydrogens usually implicit Propane CCC SMILES – Atoms & Bonds • Atoms represented by their chemical symbol (C, N, S, O, Br, etc). Uppercase for aliphatic, lowercase for aromatic • Adjacent atoms implicitly single bonded, or = for double bond, or # for triple bond • Hydrogens usually implicit 1-Propanol CCCO Or OCCC ! SMILES – Atoms & Bonds • Atoms represented by their chemical symbol (C, N, S, O, Br, etc). Uppercase for aliphatic, lowercase for aromatic • Adjacent atoms implicitly single bonded, or = for double bond, or # for triple bond • Hydrogens usually implicit Propene C=CC Or CC=C ! SMILES – Branching & Rings • Parentheses represent branching • Ring enclosures represented by using numbers to signify attachment points 2-Propanol CC(O)C SMILES – Branching & Rings • Parentheses represent branching • Ring enclosures represented by using numbers to signify attachment points Cyclohexane C1CCCCC1 SMILES – Branching & Rings • Parentheses represent branching • Ring enclosures represented by using numbers to signify attachment points Benzene c1ccccc1 SMILES – Branching & Rings • Parentheses represent branching • Ring enclosures represented by using numbers to signify attachment points Bromobenzene c1cc(Cl)ccc1 SMILES – Acetaminophen (Tylenol) Acetaminophen c1c(O)ccc(NC(=O)C)c1 SMILES – multiple ring structure Indole c1ccc2[nH]ccc2c1 Other SMILES notes • All Hydrogen atoms are implicit unless declared otherwise • Non-organic (i.e. not C,N,S,O,Cl,Br), Hydrogens and modified atoms neet to be placed in square brackets, e.g. [Pb], [Xe] • Charged species indicated by a + or – (and square brackets), e.g. [Na+], [N+], [O-], [Ca++] • Unknown atoms can be represented by a * (but watch out for confusion with SMARTS!) • Stereochemistry can be indicated using @@ • “Canonical SMILES” can be created SMILES Homepage http://www.daylight.com/smiles/ Official Syntax Guide • Tutorial • Examples • Resources Other Line Notations 5 3 O 1 HO NH2 12 11 6 13 CH CH2 OH 4 8 9 • ROSDAL - Beilstein Representation Of Structure Diagram Arranged Linearly 1O-2=3O,2-4-5N,4-6-7=-12-7,10-13O • Sybyl Line Notation (SLN) - Tripos OHC(=O)CH(NH2)CH2C[1]=CHCH=C(OH )CH=CH@1 Example free online web resources For more links, see http://www.chemoinf.com/ Pubchem http://pubchem.ncbi.nlm.nih.gov/ MolInspiration Property Calculations http://www.molinspiration.com/cgi-bin/properties