Introducing Chemoinformatics

advertisement
BCCE Chemoinformatics Workshop July 2006
Introducing Chemoinformatics
Gary Wiggins, David Wild
Indiana University School of Informatics
Indiana University School of
David Wild – Research Overview July 2006. Page 1
Chemoinformatics is …
• Also known as cheminformatics or chemical informatics
• Very differently defined, reflecting its cross-disciplinary
nature
–
–
–
–
–
–
Librarian
Chemist (synthetic, medicinal, theoretical)
Biologist / Bioinformatician
Molecular modeler
Pharmaceutical or Chemical Engineer
Computer Scientist / Informatician
Indiana University School of
David Wild – Research Overview July 2006. Page 2
A working definition of chemoinformatics
Chemoinformatics (a.k.a. chemical informatics) is the branch of
informatics dealing with all aspects of the representation and
use of chemical structures, proteins, and related information,
on computer.
… It is an interdisciplinary field of that regularly pushes the
boundaries of computer science, statistics, visualization
methods, computing power and scientific technique. The
subject covers a wide variety of applications and specialties,
particularly in the pharmaceutical industry, where the rapid
increase in new technologies in drug discovery puts
chemoinformatics at the forefront of drug design. It is
foundational to such diverse applications as 3D molecular
modeling, artificial intelligence biological activity prediction
methods, patent and chemical database searching, and high
throughput screening data analysis.
Indiana University School of
David Wild – Research Overview July 2006. Page 3
More definitions
• Computational Chemistry – The application of mathematical
and computational methods to particularly to theoretical
chemistry
• Molecular Modeling – Using 3D graphics and optimization
techniques to help understand the nature and action of
compounds and proteins
• Computer-Aided Drug Design – The discipline of using
computational techniques (including chemical informatics) to
assist in the discovery and design of drugs.
Indiana University School of
David Wild – Research Overview July 2006. Page 4
Chemoinformatics hits on Google
400000
Cheminformatics
350000
Chemoinformatics
Total
300000
Dec 2005
348,100
250000
200000
April 2005
125,600
150000
100000
50000
July 2000
723
Ju
l- 0
N 0
ov
-0
M 0
ar
-0
1
Ju
l- 0
N 1
ov
-0
M 1
ar
-0
2
Ju
l- 0
N 2
ov
-0
M 2
ar
-0
3
Ju
l- 0
N 3
ov
-0
M 3
ar
-0
4
Ju
l- 0
N 4
ov
-0
M 4
ar
-0
5
Ju
l- 0
N 5
ov
-0
5
0
Number of word occurrences on Google, Taken from http://www.molinspiration.com/chemoinformatics.html
Indiana University School of
David Wild – Research Overview July 2006. Page 5
Hits on Chemoinf.com, August 15 – 29, 2005 (sitemeter.com)
Indiana University School of
David Wild – Research Overview July 2006. Page 6
Traditional areas of application
• Pharmaceutical & life science industry
– particularly in early stage drug design
• Databases of available chemicals
• Electronic publishing
– including searchable chemical structure information in
journals, etc.
• Government and patent databases
Indiana University School of
David Wild – Research Overview July 2006. Page 7
The theory so far (1960’s to present) …
• How do you represent 2D and 3D chemical structures?
– Not just a pretty picture
• How do you search databases of chemical structures?
– Google doesn’t help (much, but it might do soon…)
• How do you organize large amounts of chemical
information?
• How do you visualize chemical structures & proteins?
• Can computers predict how chemicals are going to behave
– … in the test tube?
– … in the body?
Indiana University School of
David Wild – Research Overview July 2006. Page 8
Current trends & hot topics
• The move of chemical informatics into the public domain
(PubChem, MLI, eScience, open source)
• Service-oriented architectures
• Packaging & processing large volumes of complex
information for human consumption
• Integration with other –ics (bioinformatics, genomics,
proteomics, systems biology)
Indiana University School of
David Wild – Research Overview July 2006. Page 9
What does it mean for the bench chemist?
• An increasing number of web tools and databases available
which can aid in compound acquisition, synthesis, and
biological profiling
• A trend towards more (and more effective) use of computers
in the lab - not just for email
• A need for most synthetic chemists (and all medicinal
chemists) to be aware of computational techniques and how
they can assist in the compound synthesis and drug
discovery processes
• An opportunity to combine an interest in chemistry with an
interest in computers
Indiana University School of
David Wild – Research Overview July 2006. Page 10
Chemical Informatics Programs at IU
• Graduate Certificate in Chemical Informatics
–
–
–
–
I571 Chemical Information Technology
I572 Computational Chemistry & Molecular Modeling
I573 Programming for Chemical and Life Science Informatics
Independent Study in Chemical Informatics
• M.Sc. in Chemical Informatics
• Ph.D. in Informatics (Chemical Informatics Track)
Indiana University School of
David Wild – Research Overview July 2006. Page 11
Chemoinformatics software vendors
• Accelrys-Large chemoinformatics company
• ACD/Labs - analytical informatics & predictions
• Digital Chemistry - 2D fingerprinting, clustering toolkits &
software
• Cambridgesoft - 2D drawing tools & E-notebooks
• CAS - produce Scifinder Scholar searching software
• ChemAxon - Java based toolkits and software
• Daylight - 2D representation & searching software
• Leadscope - 2D structure and property tools
• Lion Bioscience - produce LeadNavigator
• MDL - Large chemoinformatics company
• Mesa Analytics and Computing - Educational & Statistical
tools
• Openeye - Fast 3D docking, structure generation, toolkits
• Quantum Pharmaceuticals - prediction, docking, screening
• Sage Informatics - ChemTK 2D analysis software
• Tripos - Large chemoinformatics company
Indiana University School of
David Wild – Research Overview July 2006. Page 12
Main academic sites
•
“Pure” Chemoinformatics
–
University of Sheffield, UK (Willett / Gillet)
•
–
Erlangen, Germany (Gasteiger)
•
–
http://www-ucc.ch.cam.ac.uk/
Indiana University School of Informatics
•
•
http://www2.chemie.uni-erlangen.de/
Cambridge Unilever Center
•
–
http://www.shef.ac.uk/uni/academic/I-M/is/research/cirg.html
http://www.informatics.indiana.edu/
Related (computational chemistry, etc.)
–
UCSF (Kuntz)
•
–
University of Texas (Pearlman)
•
–
http://www.utexas.edu/pharmacy/divisions/pharmaceutics/faculty/pearlman.html
Yale (Jorgensen)
•
–
http://mdi.ucsf.edu/
http://zarbi.chem.yale.edu/
University of Michigan (Crippen)
•
http://www.umich.edu/~pharmacy/MedChem/faculty/crippen/
Indiana University School of
David Wild – Research Overview July 2006. Page 13
“Traditional” Journals
• Journal of Chemical Information & Modeling (formerly JCICS)
– http://pubs.acs.org/journals/jcisd8/index.html
• Journal of Computer-Aided Molecular Design
– http://www.kluweronline.com/issn/0920-654X
• Journal of Molecular Graphics and Modeling
– http://www.elsevier.com/inca/publications/store/5/2/5/0/1/2/
• Journal of Computational Chemistry
– http://www3.interscience.wiley.com/cgi-bin/jhome/33822
• Journal of Chemical Theory and Computation
– http://pubs.acs.org/journals/jctcce/
• Journal of Medicinal Chemistry
– http://pubs.acs.org/journals/jmcmar/
Indiana University School of
David Wild – Research Overview July 2006. Page 14
“Informal” publications
• Network Science (online)
– http://www.netsci.org/Science/index.html
• Chemical & Engineering News
– http://pubs.acs.org/cen/
• Drug Discovery Today
– http://www.drugdiscoverytoday.com/
• Scientific Computing World
– http://www.scientific-computing.com/
• Bio-IT World
– http://www.bio-itworld.com/
Indiana University School of
David Wild – Research Overview July 2006. Page 15
CHMINF-L Distribution List
• Chemical Information Sources Discussion List
• Created by Gary Wiggins at IUB
• http://listserv.indiana.edu/archives/chminf-l.html
Indiana University School of
David Wild – Research Overview July 2006. Page 16
Yahoo! Chemoinformatics Discussion List
• For
–
–
–
–
Job postings
Ideas exchange
Questions
Industry – Student connections
To join, go to http://groups.yahoo.com/group/chemoinf
Or send an email to chemoinf-subscribe@yahoogroups.com
Indiana University School of
David Wild – Research Overview July 2006. Page 17
Impacting Industry
Indiana University School of
David Wild – Research Overview July 2006. Page 18
Example 1
High-Throughput Screening
Testing perhaps millions of compounds
in a corporate collection to see if any
show activity against a certain disease
protein
Indiana University School of
David Wild – Research Overview July 2006. Page 19
High-Throughput Screening
• Traditionally, small numbers of compounds were tested for a
particular project or therapeutic area
• About 10 years ago, technology developed that enabled
large numbers of compounds to be assayed quickly
• High-throughput screening can now test 100,000
compounds a day for activity against a protein target
• Maybe tens of thousands of these compounds will show
some activity for the protein
• The chemist needs to intelligently select the 2 - 3 classes of
compounds that show the most promise for being drugs to
follow-up
Indiana University School of
David Wild – Research Overview July 2006. Page 20
Informatics Implications
• Need to be able to store chemical structure and biological
data for millions of data points
– Computational representation of 2D structure
• Need to be able to organize thousands of active compounds
into meaningful groups
– Group similar structures together and relate to activity
• Need to learn as much information as possible
(data mining)
– Apply statistical methods to the structures and related information
Indiana University School of
David Wild – Research Overview July 2006. Page 21
Tools for mining the data
Tripos Benchware HTS Dataminer (formerly SAR Navigator), www.tripos.com
Indiana University School of
David Wild – Research Overview July 2006. Page 22
Example 2: 3D Visualization & Docking
3D Visualization of interactions between compounds and proteins
“Docking” compounds into proteins computationally
Indiana University School of
David Wild – Research Overview July 2006. Page 23
3D Visualization
• X-ray crystallography and NMR Spectroscopy can reveal 3D
structure of protein and bound compounds
• Visualization of these “complexes” of proteins and potential
drugs can help scientists understand the mechanism of
action of the drug and to improve the design of a drug
• Visualization uses computational “ball and stick” model of
atoms and bonds, as well as surfaces
• Stereoscopic visualization available
Indiana University School of
David Wild – Research Overview July 2006. Page 24
Accelrys Discovery Studio
Indiana University School of
David Wild – Research Overview July 2006. Page 25
Docking algorithms
• Require 3D atomic structure for protein, and 3D structure for
compound (“ligand”)
• May require initial rough positioning for the ligand
• Will use an optimization method to try and find the best
rotation and translation of the ligand in the protein, for
optimal binding affinity
Indiana University School of
David Wild – Research Overview July 2006. Page 26
Genetic Algorithms
• Create a “population” of possible solutions, encoded as
“chromosomes”
• Use “fitness function” to score solutions
• Good solutions are combined together (“crossover”) and
altered (“mutation”) to provide new solutions
• The process repeats until the population “converges” on a
solution
Indiana University School of
David Wild – Research Overview July 2006. Page 27
Sample GOLD output
GMP into RNaseT1
Indiana University School of
David Wild – Research Overview July 2006. Page 28
Something fun…
Screensaver that docks molecules while your computer is idle at
http://www.grid.org/projects/cancer/
Indiana University School of
David Wild – Research Overview July 2006. Page 29
Representing 2D structures with SMILES
Indiana University School of
David Wild – Research Overview July 2006. Page 30
Historical ways of representing chemicals
• Trivial name, e.g. Baking Soda, Aspirin, Citric Acid, etc.
Identifies the compound, but gives no (or little) information
about what it consists of
• Chemical formula, e.g. C6H12O6. Specifies the type and
quantity of the atoms in the compound, but not its structure
(i.e. how the atoms are connected by bonds)
• Systematic name, e.g. 1,2-dibromo-3-chloropropane.
Identifies the atoms present and how they are connected by
bonds.
Indiana University School of
David Wild – Research Overview July 2006. Page 31
Trivial and Systematic Names
O
HO
NH2
CH CH2
OH
Trivial name:
– tyrosine
Systematic names:
– -(p-hydroxyphenyl)alanine
– -amino-p-hydroxyhydrocinnamic acid
Indiana University School of
David Wild – Research Overview July 2006. Page 32
Historical ways of representing chemicals
2D structure diagram shows atoms
present and how they are connected
by bonds
Indiana University School of
3D structure diagram, shows how atoms are
related to each other in 3D space. Can take a
variety of forms. Accurate models only really
possible since X-ray crystallography and
computers… but ball and stick models have
been around a long time!
David Wild – Research Overview July 2006. Page 33
Early computer representations
• How do we communicate structural information between
humans and the computer?
– Line notations, e.g. Wiswesser Line Notation (and later SMILES)
• How do we represent the atoms and bonds in a molecule
internally in a computer?
– Atom lookup and connection tables
Indiana University School of
David Wild – Research Overview July 2006. Page 34
Linear notations
• Represent the atoms, bonds and connectivity of a molecule
in a linear text string
• Consise representation
• Originally designed for manual command line entry into textonly systems
• Now an excellent format for file and database storage (e.g.
can be held in a spreadsheet cell, on one line of a text file, or
in an Oracle database text field)
Indiana University School of
David Wild – Research Overview July 2006. Page 35
Wiswesser Line Notation (obsolete)
O
HO
NH2
CH CH2
OH
• WLN for this structure is QVYZ1R DQ
• Uses text symbolic representation of function groups, e.g.:
– Q = OH, V = -CO-, Z = -NH2, R = benzene
• Other symbols represent branching, e.g. Y
Indiana University School of
David Wild – Research Overview July 2006. Page 36
SMILES
O
HO
NH2
CH CH2
OH
Dave Weininger, Daylight
www.daylight.com
• (one possible) SMILES for this structure is
OC(=O)C(N)CC1=CC=C(O)C=C1
• Can identify any chemical structure
• There can be several ways of writing the same strucutre in
SMILES (although a system of generating canonical SMILES)
exists
Indiana University School of
David Wild – Research Overview July 2006. Page 37
SMILES – Atoms & Bonds
• Atoms represented by their chemical symbol (C, N, S,
O, Br, etc). Uppercase for aliphatic, lowercase for
aromatic
• Adjacent atoms implicitly single bonded, or = for double
bond, or # for triple bond
• Hydrogens usually implicit
Propane
CCC
Indiana University School of
David Wild – Research Overview July 2006. Page 38
SMILES – Atoms & Bonds
• Atoms represented by their chemical symbol (C, N, S,
O, Br, etc). Uppercase for aliphatic, lowercase for
aromatic
• Adjacent atoms implicitly single bonded, or = for double
bond, or # for triple bond
• Hydrogens usually implicit
1-Propanol
CCCO
Or OCCC !
Indiana University School of
David Wild – Research Overview July 2006. Page 39
SMILES – Atoms & Bonds
• Atoms represented by their chemical symbol (C, N, S, O, Br, etc).
Uppercase for aliphatic, lowercase for aromatic
• Adjacent atoms implicitly single bonded, or = for double bond, or #
for triple bond
• Hydrogens usually implicit
Propene
C=CC
Or CC=C !
Indiana University School of
David Wild – Research Overview July 2006. Page 40
SMILES – Branching & Rings
• Parentheses represent branching
• Ring enclosures represented by using numbers to
signify attachment points
2-Propanol
CC(O)C
Indiana University School of
David Wild – Research Overview July 2006. Page 41
SMILES – Branching & Rings
• Parentheses represent branching
• Ring enclosures represented by using numbers to
signify attachment points
Cyclohexane
C1CCCCC1
Indiana University School of
David Wild – Research Overview July 2006. Page 42
SMILES – Branching & Rings
• Parentheses represent branching
• Ring enclosures represented by using numbers to
signify attachment points
Benzene
c1ccccc1
Indiana University School of
David Wild – Research Overview July 2006. Page 43
SMILES – Branching & Rings
• Parentheses represent branching
• Ring enclosures represented by using numbers to
signify attachment points
Bromobenzene
c1cc(Cl)ccc1
Indiana University School of
David Wild – Research Overview July 2006. Page 44
SMILES – Acetaminophen (Tylenol)
Acetaminophen
c1c(O)ccc(NC(=O)C)c1
Indiana University School of
David Wild – Research Overview July 2006. Page 45
SMILES – multiple ring structure
Indole
c1ccc2[nH]ccc2c1
Indiana University School of
David Wild – Research Overview July 2006. Page 46
Other SMILES notes
• All Hydrogen atoms are implicit unless declared otherwise
• Non-organic (i.e. not C,N,S,O,Cl,Br), Hydrogens and
modified atoms neet to be placed in square brackets, e.g.
[Pb], [Xe]
• Charged species indicated by a + or – (and square
brackets), e.g. [Na+], [N+], [O-], [Ca++]
• Unknown atoms can be represented by a * (but watch out for
confusion with SMARTS!)
• Stereochemistry can be indicated using @@
• “Canonical SMILES” can be created
Indiana University School of
David Wild – Research Overview July 2006. Page 47
SMILES Homepage
http://www.daylight.com/smiles/
Official Syntax Guide
• Tutorial
• Examples
• Resources
Indiana University School of
David Wild – Research Overview July 2006. Page 48
Other Line Notations
5
3
O
1
HO
NH2
12
11
6
13
CH CH2
OH
4
8
9
• ROSDAL - Beilstein
Representation Of Structure Diagram Arranged Linearly
1O-2=3O,2-4-5N,4-6-7=-12-7,10-13O
• Sybyl Line Notation (SLN) - Tripos
OHC(=O)CH(NH2)CH2C[1]=CHCH=C(OH)CH=CH@1
Indiana University School of
David Wild – Research Overview July 2006. Page 49
Example free online web resources
For more links, see http://www.chemoinf.com/
Indiana University School of
David Wild – Research Overview July 2006. Page 50
Pubchem
http://pubchem.ncbi.nlm.nih.gov/
Indiana University School of
David Wild – Research Overview July 2006. Page 51
MolInspiration Property Calculations
http://www.molinspiration.com/cgi-bin/properties
Indiana University School of
David Wild – Research Overview July 2006. Page 52
Download