BCCE Chemoinformatics Workshop July 2006 Introducing Chemoinformatics Gary Wiggins, David Wild Indiana University School of Informatics Indiana University School of David Wild – Research Overview July 2006. Page 1 Chemoinformatics is … • Also known as cheminformatics or chemical informatics • Very differently defined, reflecting its cross-disciplinary nature – – – – – – Librarian Chemist (synthetic, medicinal, theoretical) Biologist / Bioinformatician Molecular modeler Pharmaceutical or Chemical Engineer Computer Scientist / Informatician Indiana University School of David Wild – Research Overview July 2006. Page 2 A working definition of chemoinformatics Chemoinformatics (a.k.a. chemical informatics) is the branch of informatics dealing with all aspects of the representation and use of chemical structures, proteins, and related information, on computer. … It is an interdisciplinary field of that regularly pushes the boundaries of computer science, statistics, visualization methods, computing power and scientific technique. The subject covers a wide variety of applications and specialties, particularly in the pharmaceutical industry, where the rapid increase in new technologies in drug discovery puts chemoinformatics at the forefront of drug design. It is foundational to such diverse applications as 3D molecular modeling, artificial intelligence biological activity prediction methods, patent and chemical database searching, and high throughput screening data analysis. Indiana University School of David Wild – Research Overview July 2006. Page 3 More definitions • Computational Chemistry – The application of mathematical and computational methods to particularly to theoretical chemistry • Molecular Modeling – Using 3D graphics and optimization techniques to help understand the nature and action of compounds and proteins • Computer-Aided Drug Design – The discipline of using computational techniques (including chemical informatics) to assist in the discovery and design of drugs. Indiana University School of David Wild – Research Overview July 2006. Page 4 Chemoinformatics hits on Google 400000 Cheminformatics 350000 Chemoinformatics Total 300000 Dec 2005 348,100 250000 200000 April 2005 125,600 150000 100000 50000 July 2000 723 Ju l- 0 N 0 ov -0 M 0 ar -0 1 Ju l- 0 N 1 ov -0 M 1 ar -0 2 Ju l- 0 N 2 ov -0 M 2 ar -0 3 Ju l- 0 N 3 ov -0 M 3 ar -0 4 Ju l- 0 N 4 ov -0 M 4 ar -0 5 Ju l- 0 N 5 ov -0 5 0 Number of word occurrences on Google, Taken from http://www.molinspiration.com/chemoinformatics.html Indiana University School of David Wild – Research Overview July 2006. Page 5 Hits on Chemoinf.com, August 15 – 29, 2005 (sitemeter.com) Indiana University School of David Wild – Research Overview July 2006. Page 6 Traditional areas of application • Pharmaceutical & life science industry – particularly in early stage drug design • Databases of available chemicals • Electronic publishing – including searchable chemical structure information in journals, etc. • Government and patent databases Indiana University School of David Wild – Research Overview July 2006. Page 7 The theory so far (1960’s to present) … • How do you represent 2D and 3D chemical structures? – Not just a pretty picture • How do you search databases of chemical structures? – Google doesn’t help (much, but it might do soon…) • How do you organize large amounts of chemical information? • How do you visualize chemical structures & proteins? • Can computers predict how chemicals are going to behave – … in the test tube? – … in the body? Indiana University School of David Wild – Research Overview July 2006. Page 8 Current trends & hot topics • The move of chemical informatics into the public domain (PubChem, MLI, eScience, open source) • Service-oriented architectures • Packaging & processing large volumes of complex information for human consumption • Integration with other –ics (bioinformatics, genomics, proteomics, systems biology) Indiana University School of David Wild – Research Overview July 2006. Page 9 What does it mean for the bench chemist? • An increasing number of web tools and databases available which can aid in compound acquisition, synthesis, and biological profiling • A trend towards more (and more effective) use of computers in the lab - not just for email • A need for most synthetic chemists (and all medicinal chemists) to be aware of computational techniques and how they can assist in the compound synthesis and drug discovery processes • An opportunity to combine an interest in chemistry with an interest in computers Indiana University School of David Wild – Research Overview July 2006. Page 10 Chemical Informatics Programs at IU • Graduate Certificate in Chemical Informatics – – – – I571 Chemical Information Technology I572 Computational Chemistry & Molecular Modeling I573 Programming for Chemical and Life Science Informatics Independent Study in Chemical Informatics • M.Sc. in Chemical Informatics • Ph.D. in Informatics (Chemical Informatics Track) Indiana University School of David Wild – Research Overview July 2006. Page 11 Chemoinformatics software vendors • Accelrys-Large chemoinformatics company • ACD/Labs - analytical informatics & predictions • Digital Chemistry - 2D fingerprinting, clustering toolkits & software • Cambridgesoft - 2D drawing tools & E-notebooks • CAS - produce Scifinder Scholar searching software • ChemAxon - Java based toolkits and software • Daylight - 2D representation & searching software • Leadscope - 2D structure and property tools • Lion Bioscience - produce LeadNavigator • MDL - Large chemoinformatics company • Mesa Analytics and Computing - Educational & Statistical tools • Openeye - Fast 3D docking, structure generation, toolkits • Quantum Pharmaceuticals - prediction, docking, screening • Sage Informatics - ChemTK 2D analysis software • Tripos - Large chemoinformatics company Indiana University School of David Wild – Research Overview July 2006. Page 12 Main academic sites • “Pure” Chemoinformatics – University of Sheffield, UK (Willett / Gillet) • – Erlangen, Germany (Gasteiger) • – http://www-ucc.ch.cam.ac.uk/ Indiana University School of Informatics • • http://www2.chemie.uni-erlangen.de/ Cambridge Unilever Center • – http://www.shef.ac.uk/uni/academic/I-M/is/research/cirg.html http://www.informatics.indiana.edu/ Related (computational chemistry, etc.) – UCSF (Kuntz) • – University of Texas (Pearlman) • – http://www.utexas.edu/pharmacy/divisions/pharmaceutics/faculty/pearlman.html Yale (Jorgensen) • – http://mdi.ucsf.edu/ http://zarbi.chem.yale.edu/ University of Michigan (Crippen) • http://www.umich.edu/~pharmacy/MedChem/faculty/crippen/ Indiana University School of David Wild – Research Overview July 2006. Page 13 “Traditional” Journals • Journal of Chemical Information & Modeling (formerly JCICS) – http://pubs.acs.org/journals/jcisd8/index.html • Journal of Computer-Aided Molecular Design – http://www.kluweronline.com/issn/0920-654X • Journal of Molecular Graphics and Modeling – http://www.elsevier.com/inca/publications/store/5/2/5/0/1/2/ • Journal of Computational Chemistry – http://www3.interscience.wiley.com/cgi-bin/jhome/33822 • Journal of Chemical Theory and Computation – http://pubs.acs.org/journals/jctcce/ • Journal of Medicinal Chemistry – http://pubs.acs.org/journals/jmcmar/ Indiana University School of David Wild – Research Overview July 2006. Page 14 “Informal” publications • Network Science (online) – http://www.netsci.org/Science/index.html • Chemical & Engineering News – http://pubs.acs.org/cen/ • Drug Discovery Today – http://www.drugdiscoverytoday.com/ • Scientific Computing World – http://www.scientific-computing.com/ • Bio-IT World – http://www.bio-itworld.com/ Indiana University School of David Wild – Research Overview July 2006. Page 15 CHMINF-L Distribution List • Chemical Information Sources Discussion List • Created by Gary Wiggins at IUB • http://listserv.indiana.edu/archives/chminf-l.html Indiana University School of David Wild – Research Overview July 2006. Page 16 Yahoo! Chemoinformatics Discussion List • For – – – – Job postings Ideas exchange Questions Industry – Student connections To join, go to http://groups.yahoo.com/group/chemoinf Or send an email to chemoinf-subscribe@yahoogroups.com Indiana University School of David Wild – Research Overview July 2006. Page 17 Impacting Industry Indiana University School of David Wild – Research Overview July 2006. Page 18 Example 1 High-Throughput Screening Testing perhaps millions of compounds in a corporate collection to see if any show activity against a certain disease protein Indiana University School of David Wild – Research Overview July 2006. Page 19 High-Throughput Screening • Traditionally, small numbers of compounds were tested for a particular project or therapeutic area • About 10 years ago, technology developed that enabled large numbers of compounds to be assayed quickly • High-throughput screening can now test 100,000 compounds a day for activity against a protein target • Maybe tens of thousands of these compounds will show some activity for the protein • The chemist needs to intelligently select the 2 - 3 classes of compounds that show the most promise for being drugs to follow-up Indiana University School of David Wild – Research Overview July 2006. Page 20 Informatics Implications • Need to be able to store chemical structure and biological data for millions of data points – Computational representation of 2D structure • Need to be able to organize thousands of active compounds into meaningful groups – Group similar structures together and relate to activity • Need to learn as much information as possible (data mining) – Apply statistical methods to the structures and related information Indiana University School of David Wild – Research Overview July 2006. Page 21 Tools for mining the data Tripos Benchware HTS Dataminer (formerly SAR Navigator), www.tripos.com Indiana University School of David Wild – Research Overview July 2006. Page 22 Example 2: 3D Visualization & Docking 3D Visualization of interactions between compounds and proteins “Docking” compounds into proteins computationally Indiana University School of David Wild – Research Overview July 2006. Page 23 3D Visualization • X-ray crystallography and NMR Spectroscopy can reveal 3D structure of protein and bound compounds • Visualization of these “complexes” of proteins and potential drugs can help scientists understand the mechanism of action of the drug and to improve the design of a drug • Visualization uses computational “ball and stick” model of atoms and bonds, as well as surfaces • Stereoscopic visualization available Indiana University School of David Wild – Research Overview July 2006. Page 24 Accelrys Discovery Studio Indiana University School of David Wild – Research Overview July 2006. Page 25 Docking algorithms • Require 3D atomic structure for protein, and 3D structure for compound (“ligand”) • May require initial rough positioning for the ligand • Will use an optimization method to try and find the best rotation and translation of the ligand in the protein, for optimal binding affinity Indiana University School of David Wild – Research Overview July 2006. Page 26 Genetic Algorithms • Create a “population” of possible solutions, encoded as “chromosomes” • Use “fitness function” to score solutions • Good solutions are combined together (“crossover”) and altered (“mutation”) to provide new solutions • The process repeats until the population “converges” on a solution Indiana University School of David Wild – Research Overview July 2006. Page 27 Sample GOLD output GMP into RNaseT1 Indiana University School of David Wild – Research Overview July 2006. Page 28 Something fun… Screensaver that docks molecules while your computer is idle at http://www.grid.org/projects/cancer/ Indiana University School of David Wild – Research Overview July 2006. Page 29 Representing 2D structures with SMILES Indiana University School of David Wild – Research Overview July 2006. Page 30 Historical ways of representing chemicals • Trivial name, e.g. Baking Soda, Aspirin, Citric Acid, etc. Identifies the compound, but gives no (or little) information about what it consists of • Chemical formula, e.g. C6H12O6. Specifies the type and quantity of the atoms in the compound, but not its structure (i.e. how the atoms are connected by bonds) • Systematic name, e.g. 1,2-dibromo-3-chloropropane. Identifies the atoms present and how they are connected by bonds. Indiana University School of David Wild – Research Overview July 2006. Page 31 Trivial and Systematic Names O HO NH2 CH CH2 OH Trivial name: – tyrosine Systematic names: – -(p-hydroxyphenyl)alanine – -amino-p-hydroxyhydrocinnamic acid Indiana University School of David Wild – Research Overview July 2006. Page 32 Historical ways of representing chemicals 2D structure diagram shows atoms present and how they are connected by bonds Indiana University School of 3D structure diagram, shows how atoms are related to each other in 3D space. Can take a variety of forms. Accurate models only really possible since X-ray crystallography and computers… but ball and stick models have been around a long time! David Wild – Research Overview July 2006. Page 33 Early computer representations • How do we communicate structural information between humans and the computer? – Line notations, e.g. Wiswesser Line Notation (and later SMILES) • How do we represent the atoms and bonds in a molecule internally in a computer? – Atom lookup and connection tables Indiana University School of David Wild – Research Overview July 2006. Page 34 Linear notations • Represent the atoms, bonds and connectivity of a molecule in a linear text string • Consise representation • Originally designed for manual command line entry into textonly systems • Now an excellent format for file and database storage (e.g. can be held in a spreadsheet cell, on one line of a text file, or in an Oracle database text field) Indiana University School of David Wild – Research Overview July 2006. Page 35 Wiswesser Line Notation (obsolete) O HO NH2 CH CH2 OH • WLN for this structure is QVYZ1R DQ • Uses text symbolic representation of function groups, e.g.: – Q = OH, V = -CO-, Z = -NH2, R = benzene • Other symbols represent branching, e.g. Y Indiana University School of David Wild – Research Overview July 2006. Page 36 SMILES O HO NH2 CH CH2 OH Dave Weininger, Daylight www.daylight.com • (one possible) SMILES for this structure is OC(=O)C(N)CC1=CC=C(O)C=C1 • Can identify any chemical structure • There can be several ways of writing the same strucutre in SMILES (although a system of generating canonical SMILES) exists Indiana University School of David Wild – Research Overview July 2006. Page 37 SMILES – Atoms & Bonds • Atoms represented by their chemical symbol (C, N, S, O, Br, etc). Uppercase for aliphatic, lowercase for aromatic • Adjacent atoms implicitly single bonded, or = for double bond, or # for triple bond • Hydrogens usually implicit Propane CCC Indiana University School of David Wild – Research Overview July 2006. Page 38 SMILES – Atoms & Bonds • Atoms represented by their chemical symbol (C, N, S, O, Br, etc). Uppercase for aliphatic, lowercase for aromatic • Adjacent atoms implicitly single bonded, or = for double bond, or # for triple bond • Hydrogens usually implicit 1-Propanol CCCO Or OCCC ! Indiana University School of David Wild – Research Overview July 2006. Page 39 SMILES – Atoms & Bonds • Atoms represented by their chemical symbol (C, N, S, O, Br, etc). Uppercase for aliphatic, lowercase for aromatic • Adjacent atoms implicitly single bonded, or = for double bond, or # for triple bond • Hydrogens usually implicit Propene C=CC Or CC=C ! Indiana University School of David Wild – Research Overview July 2006. Page 40 SMILES – Branching & Rings • Parentheses represent branching • Ring enclosures represented by using numbers to signify attachment points 2-Propanol CC(O)C Indiana University School of David Wild – Research Overview July 2006. Page 41 SMILES – Branching & Rings • Parentheses represent branching • Ring enclosures represented by using numbers to signify attachment points Cyclohexane C1CCCCC1 Indiana University School of David Wild – Research Overview July 2006. Page 42 SMILES – Branching & Rings • Parentheses represent branching • Ring enclosures represented by using numbers to signify attachment points Benzene c1ccccc1 Indiana University School of David Wild – Research Overview July 2006. Page 43 SMILES – Branching & Rings • Parentheses represent branching • Ring enclosures represented by using numbers to signify attachment points Bromobenzene c1cc(Cl)ccc1 Indiana University School of David Wild – Research Overview July 2006. Page 44 SMILES – Acetaminophen (Tylenol) Acetaminophen c1c(O)ccc(NC(=O)C)c1 Indiana University School of David Wild – Research Overview July 2006. Page 45 SMILES – multiple ring structure Indole c1ccc2[nH]ccc2c1 Indiana University School of David Wild – Research Overview July 2006. Page 46 Other SMILES notes • All Hydrogen atoms are implicit unless declared otherwise • Non-organic (i.e. not C,N,S,O,Cl,Br), Hydrogens and modified atoms neet to be placed in square brackets, e.g. [Pb], [Xe] • Charged species indicated by a + or – (and square brackets), e.g. [Na+], [N+], [O-], [Ca++] • Unknown atoms can be represented by a * (but watch out for confusion with SMARTS!) • Stereochemistry can be indicated using @@ • “Canonical SMILES” can be created Indiana University School of David Wild – Research Overview July 2006. Page 47 SMILES Homepage http://www.daylight.com/smiles/ Official Syntax Guide • Tutorial • Examples • Resources Indiana University School of David Wild – Research Overview July 2006. Page 48 Other Line Notations 5 3 O 1 HO NH2 12 11 6 13 CH CH2 OH 4 8 9 • ROSDAL - Beilstein Representation Of Structure Diagram Arranged Linearly 1O-2=3O,2-4-5N,4-6-7=-12-7,10-13O • Sybyl Line Notation (SLN) - Tripos OHC(=O)CH(NH2)CH2C[1]=CHCH=C(OH)CH=CH@1 Indiana University School of David Wild – Research Overview July 2006. Page 49 Example free online web resources For more links, see http://www.chemoinf.com/ Indiana University School of David Wild – Research Overview July 2006. Page 50 Pubchem http://pubchem.ncbi.nlm.nih.gov/ Indiana University School of David Wild – Research Overview July 2006. Page 51 MolInspiration Property Calculations http://www.molinspiration.com/cgi-bin/properties Indiana University School of David Wild – Research Overview July 2006. Page 52