Current trends & hot topics in Chemoinformatics

advertisement
Current trends & hot topics
in Chemoinformatics
Traditional areas of application
• Pharmaceutical & life science industry
– particularly in early stage drug design
• Databases of available chemicals
• Electronic publishing
– including searchable chemical structure information in
journals, etc.
• Government and patent databases
The theory so far (1960’s to present) …
• How do you represent 2D and 3D chemical structures?
– Not just a pretty picture
• How do you search databases of chemical structures?
– Google doesn’t help (much, but it might do soon…)
• How do you organize large amounts of chemical
information?
• How do you visualize chemical structures & proteins?
• Can computers predict how chemicals are going to
behave
– … in the test tube?
– … in the body?
Current trends & hot topics
• The move of chemical informatics into the
public domain (PubChem, MLI, eScience, open
source)
• Service-oriented architectures
• Packaging & processing large volumes of
complex information for human consumption
• Integration with other –ics (bioinformatics,
genomics, proteomics, systems biology)
What does it mean for the bench
chemist?
• An increasing number of web tools and databases
available which can aid in compound acquisition,
synthesis, and biological profiling
• A trend towards more (and more effective) use of
computers in the lab - not just for email
• A need for most synthetic chemists (and all medicinal
chemists) to be aware of computational techniques
and how they can assist in the compound synthesis and
drug discovery processes
• An opportunity to combine an interest in chemistry
with an interest in computers
Chemoinformatics software vendors
• Accelrys-Large chemoinformatics company
• ACD/Labs - analytical informatics & predictions
• Digital Chemistry - 2D fingerprinting, clustering toolkits &
software
• Cambridgesoft - 2D drawing tools & E-notebooks
• CAS - produce Scifinder Scholar searching software
• ChemAxon - Java based toolkits and software
• Daylight - 2D representation & searching software
• Leadscope - 2D structure and property tools
• Lion Bioscience - produce LeadNavigator
• MDL - Large chemoinformatics company
• Mesa Analytics and Computing - Educational & Statistical
tools
• Openeye - Fast 3D docking, structure generation, toolkits
• Quantum Pharmaceuticals - prediction, docking, screening
• Sage Informatics - ChemTK 2D analysis software
• Tripos - Large chemoinformatics company
Main academic sites
• “Pure” Chemoinformatics
– University of Sheffield, UK (Willett / Gillet)
• http://www.shef.ac.uk/uni/academic/IM/is/research/cirg.html
– Erlangen, Germany (Gasteiger)
• http://www2.chemie.uni-erlangen.de/
– Cambridge Unilever Center
• http://www-ucc.ch.cam.ac.uk/
– Indiana University School of Informatics
• http://www.informatics.indiana.edu/
Main academic sites
• Related (computational chemistry, etc.)
– UCSF (Kuntz)
• http://mdi.ucsf.edu/
– University of Texas (Pearlman)
• http://www.utexas.edu/pharmacy/divisions/pharmaceu
tics/faculty/pearlman.html
– Yale (Jorgensen)
• http://zarbi.chem.yale.edu/
– University of Michigan (Crippen)
• http://www.umich.edu/~pharmacy/MedChem/faculty/c
rippen/
“Traditional” Journals
• Journal of Chemical Information & Modeling (formerly JCICS)
– http://pubs.acs.org/journals/jcisd8/index.html
• Journal of Computer-Aided Molecular Design
– http://www.kluweronline.com/issn/0920-654X
• Journal of Molecular Graphics and Modeling
– http://www.elsevier.com/inca/publications/store/5/2/5/0/1/2/
• Journal of Computational Chemistry
– http://www3.interscience.wiley.com/cgi-bin/jhome/33822
• Journal of Chemical Theory and Computation
– http://pubs.acs.org/journals/jctcce/
• Journal of Medicinal Chemistry
– http://pubs.acs.org/journals/jmcmar/
“Informal” publications
• Network Science (online)
– http://www.netsci.org/Science/index.html
• Chemical & Engineering News
– http://pubs.acs.org/cen/
• Drug Discovery Today
– http://www.drugdiscoverytoday.com/
• Scientific Computing World
– http://www.scientific-computing.com/
• Bio-IT World
– http://www.bio-itworld.com/
Yahoo! Chemoinformatics Discussion List
• For
–
–
–
–
Job postings
Ideas exchange
Questions
Industry – Student connections
To join, go to http://groups.yahoo.com/group/chemoinf
Or send an email to chemoinf-subscribe@yahoogroups.com
Impacting Industry
Example 1
High-Throughput Screening
Testing perhaps millions of compounds in a
corporate collection to see if any show activity
against a certain disease protein
High-Throughput Screening
• Traditionally, small numbers of compounds were tested
for a particular project or therapeutic area
• About 10 years ago, technology developed that
enabled large numbers of compounds to be assayed
quickly
• High-throughput screening can now test 100,000
compounds a day for activity against a protein target
• Maybe tens of thousands of these compounds will
show some activity for the protein
• The chemist needs to intelligently select the 2 - 3
classes of compounds that show the most promise for
being drugs to follow-up
Informatics Implications
• Need to be able to store chemical structure and
biological data for millions of data points
– Computational representation of 2D structure
• Need to be able to organize thousands of active
compounds into meaningful groups
– Group similar structures together and relate to activity
• Need to learn as much information as possible
(data mining)
– Apply statistical methods to the structures and related
information
Tools for mining the data
Tripos Benchware HTS Dataminer (formerly SAR Navigator), www.tripos.com
Example 2: 3D Visualization & Docking
3D Visualization of interactions between compounds and proteins
“Docking” compounds into proteins computationally
3D Visualization
• X-ray crystallography and NMR Spectroscopy can
reveal 3D structure of protein and bound
compounds
• Visualization of these “complexes” of proteins
and potential drugs can help scientists
understand the mechanism of action of the drug
and to improve the design of a drug
• Visualization uses computational “ball and stick”
model of atoms and bonds, as well as surfaces
• Stereoscopic visualization available
Accelrys Discovery Studio
Docking algorithms
• Require 3D atomic structure for protein, and
3D structure for compound (“ligand”)
• May require initial rough positioning for the
ligand
• Will use an optimization method to try and
find the best rotation and translation of the
ligand in the protein, for optimal binding
affinity
Genetic Algorithms
• Create a “population” of possible solutions,
encoded as “chromosomes”
• Use “fitness function” to score solutions
• Good solutions are combined together
(“crossover”) and altered (“mutation”) to
provide new solutions
• The process repeats until the population
“converges” on a solution
Sample GOLD output
GMP into RNaseT1
Something fun…
Screensaver that docks molecules while your computer is idle at
http://www.grid.org/projects/cancer/
Representing 2D structures with
SMILES
Historical ways of representing
chemicals
• Trivial name, e.g. Baking Soda, Aspirin, Citric Acid, etc.
Identifies the compound, but gives no (or little)
information about what it consists of
• Chemical formula, e.g. C6H12O6. Specifies the type and
quantity of the atoms in the compound, but not its
structure (i.e. how the atoms are connected by bonds)
• Systematic name, e.g. 1,2-dibromo-3-chloropropane.
Identifies the atoms present and how they are
connected by bonds.
Trivial and Systematic Names
O
HO
NH2
CH CH2
OH
Trivial name:
– tyrosine
Systematic names:
– -(p-hydroxyphenyl)alanine
– -amino-p-hydroxyhydrocinnamic acid
Historical ways of representing chemicals
2D structure diagram shows atoms
present and how they are connected by
bonds
3D structure diagram, shows how atoms are related
to each other in 3D space. Can take a variety of
forms. Accurate models only really possible since Xray crystallography and computers… but ball and
stick models have been around a long time!
David Wild – Research Overview July 2006. Page 27
Early computer representations
• How do we communicate structural
information between humans and the
computer?
– Line notations, e.g. Wiswesser Line Notation (and
later SMILES)
• How do we represent the atoms and bonds in
a molecule internally in a computer?
– Atom lookup and connection tables
Linear notations
• Represent the atoms, bonds and connectivity
of a molecule in a linear text string
• Consise representation
• Originally designed for manual command line
entry into text-only systems
• Now an excellent format for file and database
storage (e.g. can be held in a spreadsheet cell,
on one line of a text file, or in an Oracle
database text field)
Wiswesser Line Notation (obsolete)
O
HO
NH2
CH CH2
OH
• WLN for this structure is QVYZ1R DQ
• Uses text symbolic representation of function groups, e.g.:
– Q = OH, V = -CO-, Z = -NH2, R = benzene
• Other symbols represent branching, e.g. Y
SMILES
O
HO
NH2
CH CH2
OH
• (one possible) SMILES for this structure is
OC(=O)C(N)CC1=CC=C(O)C=C1
Dave Weininger, Daylight
www.daylight.com
• Can identify any chemical structure
• There can be several ways of writing the same strucutre in SMILES
(although a system of generating canonical SMILES) exists
SMILES – Atoms & Bonds
• Atoms represented by their chemical symbol
(C, N, S, O, Br, etc). Uppercase for aliphatic,
lowercase for aromatic
• Adjacent atoms implicitly single bonded, or =
for double bond, or # for triple bond
• Hydrogens usually implicit
Propane
CCC
SMILES – Atoms & Bonds
• Atoms represented by their chemical symbol
(C, N, S, O, Br, etc). Uppercase for aliphatic,
lowercase for aromatic
• Adjacent atoms implicitly single bonded, or =
for double bond, or # for triple bond
• Hydrogens usually implicit
1-Propanol
CCCO
Or OCCC !
SMILES – Atoms & Bonds
• Atoms represented by their chemical symbol (C, N, S, O, Br, etc).
Uppercase for aliphatic, lowercase for aromatic
• Adjacent atoms implicitly single bonded, or = for double bond, or # for
triple bond
• Hydrogens usually implicit
Propene
C=CC
Or CC=C !
SMILES – Branching & Rings
• Parentheses represent branching
• Ring enclosures represented by using
numbers to signify attachment points
2-Propanol
CC(O)C
SMILES – Branching & Rings
• Parentheses represent branching
• Ring enclosures represented by using
numbers to signify attachment points
Cyclohexane
C1CCCCC1
SMILES – Branching & Rings
• Parentheses represent branching
• Ring enclosures represented by using
numbers to signify attachment points
Benzene
c1ccccc1
SMILES – Branching & Rings
• Parentheses represent branching
• Ring enclosures represented by using
numbers to signify attachment points
Bromobenzene
c1cc(Cl)ccc1
SMILES – Acetaminophen (Tylenol)
Acetaminophen
c1c(O)ccc(NC(=O)C)c1
SMILES – multiple ring structure
Indole
c1ccc2[nH]ccc2c1
Other SMILES notes
• All Hydrogen atoms are implicit unless declared
otherwise
• Non-organic (i.e. not C,N,S,O,Cl,Br), Hydrogens and
modified atoms neet to be placed in square brackets,
e.g. [Pb], [Xe]
• Charged species indicated by a + or – (and square
brackets), e.g. [Na+], [N+], [O-], [Ca++]
• Unknown atoms can be represented by a * (but watch
out for confusion with SMARTS!)
• Stereochemistry can be indicated using @@
• “Canonical SMILES” can be created
SMILES Homepage
http://www.daylight.com/smiles/
Official Syntax Guide
• Tutorial
• Examples
• Resources
Other Line Notations
5
3
O
1
HO
NH2
12
11
6
13
CH CH2
OH
4
8
9
• ROSDAL - Beilstein
Representation Of Structure Diagram
Arranged Linearly
1O-2=3O,2-4-5N,4-6-7=-12-7,10-13O
• Sybyl Line Notation (SLN) - Tripos
OHC(=O)CH(NH2)CH2C[1]=CHCH=C(OH
)CH=CH@1
Example free online web resources
For more links, see http://www.chemoinf.com/
Pubchem
http://pubchem.ncbi.nlm.nih.gov/
MolInspiration Property Calculations
http://www.molinspiration.com/cgi-bin/properties
Download