Computers in Chemistry - University of St Andrews

advertisement
Computers in Chemistry
Dr John Mitchell
University of St Andrews
1. Why?
• Working with experiment to test our theories.
• Computer uses theory to calculate an answer
that can be compared with experiment.
• If prediction and experiment don’t agree,
something has to give.
Atoms in molecules are not spherical
To Test Our Theories
• The theory that lies beneath chemistry is
ultimately quantum physics.
• To turn this into a prediction of the rate of a
chemical reaction or the frequency of a
transition in an IR spectrum requires a lot of
computation.
To Test Our Theories
• Computation’s ability to make accurate
predictions of experimental measurements is
a good test of the validity of a theory.
• We only understand if we can predict.
Crystal Structure Prediction
• Given the structural diagram of an organic
molecule, predict the 3D crystal structure.
Br
N
S
O
O
Slide after SL Price, Int. Sch. Crystallography, Erice, 2004
To Access Data that Experiment can’t
• Computational chemistry also provides a way
of obtaining information that would be very
difficult, expensive or time-consuming to get
experimentally.
• Behaviour at very high temperature or
pressure.
• Details of structure of liquids at atomic scale.
• Dynamics of proteins.
Phase Changes of Iron in the Earth’s Core
et al.,
Structure of Liquid Water and Water Clusters
Computer simulations are an
important source of evidence, since
atomic scale details of an irregular
structure are hard to obtain by
experiment.
Dynamic Motions of Proteins
X-ray crystallography gives a single static structure
Dynamic Motions of Proteins
Simulation can show how the protein flexes
2. The Power to Compute
Development of Computer Power
University of Manchester SSEM, 1948
Development of Computer Power
IBM Roadrunner, 2008
Computer Power: Moore’s Law
Computer power doubles every two years: exponential growth
Computer Power: Moore’s Law
Logarithmic scale
Computer Power: Moore’s Law
This growth will, eventually, slow down as components
reach atomic scale … we think!
The Size of the Problem
Scaling
• Nonetheless, theoretical chemistry is expensive
• Often cost scales as the fourth power of molecule size
Scaling of the Expense of a Typical Quantum Chemical Calculation
700000
600000
500000
400000
Time (seconds)
300000
200000
100000
Atoms in Molecule
0
0
10
20
30
40
50
60
Typical scaling is ~N4. For the foreseeable future, there will
be chemical problems at the limit of our computing power.
3. Philosophies of Computational Chemistry
The Two Faces of Computational Chemistry
Informatics
Theoretical
Chemistry
Philosophy of Theoretical Chemistry
“The problem is difficult, but by making suitable approximations
we can solve it at reasonable cost based on our understanding of
physics and chemistry.”
Theoretical Chemistry
• Calculations and simulations based on real
physics.
• Calculations are either quantum mechanical
or use numbers derived from quantum
mechanics.
• Attempt to model or simulate reality.
• Usually Low Throughput.
What Kinds of Theoretical Chemistry can
be Done?
(1) Quantum Chemistry
Prof. Eitan Geva
What Kinds of Theoretical Chemistry can
be Done?
(1) Quantum Chemistry
Using quantum mechanics to solve the structures
and energetics of molecules; everything depends on
the distribution of electrons.
What Kinds of Theoretical Chemistry can
be Done?
(1) Quantum Chemistry
Although quantum chemistry involves solving
Schrödinger’s equation, it is not fully exact. There
are some approximations involved.
What Kinds of Theoretical Chemistry can
be Done?
(1) Quantum Chemistry
Wavefunction  Distribution of electrons within the molecule
What Kinds of Theoretical Chemistry can
be Done?
(1) Quantum Chemistry
Distribution of electrons  Physical and chemical behaviour of the molecule
What Kinds of Theoretical Chemistry can
be Done?
(1) Quantum Chemistry
There are two main kinds of quantum chemistry:
• Ab initio
• Density Functional Theory
What Kinds of Theoretical Chemistry can
be Done?
(1) Quantum Chemistry
Ab initio “from first principles”.
• Solve Schrödinger equation to get wavefunction.
• In principle rigorous – we know what we calculate.
• But the standard “Hartree-Fock” method contains
significant approximations.
• Expensive to adjust for these and get more accuracy.
What Kinds of Theoretical Chemistry can
be Done?
(1) Quantum Chemistry
Density Functional Theory
• Makes use of the theorem that all properties of interest
can be determined directly from the electron density.
• True in principle, but the correct “functional” is unknown.
• Less rigorous than ab initio, but usually more accurate for
an equivalent cost (or cheaper for similar accuracy).
What Kinds of Theoretical Chemistry can
be Done?
(2) Molecular Simulation
What Kinds of Theoretical Chemistry can
be Done?
(2) Molecular Simulation
There are various techniques for simulating
molecules, the most significant is probably
Molecular Dynamics.
Molecular Dynamics makes a “balls-andsprings” model of the molecule in the
computer, and follows its behaviour over time.
What Kinds of Theoretical Chemistry can
be Done?
(2) Molecular Simulation
Light-harvesting protein subunit.
What Kinds of Theoretical Chemistry can
be Done?
(2) Molecular Simulation
Time steps need to be very, very short (~10-15
seconds), so it takes a million steps to simulate
one nanosecond of real time and a billion steps to
simulate a microsecond.
So it is hard to directly simulate relatively slow or
rare events, such as protein folding.
What Kinds of Theoretical Chemistry can
be Done?
(2) Molecular Simulation
Also, a balls-and-springs model lacks the quantum
mechanics needed to simulate a chemical
reaction.
Nonetheless, molecular dynamics is very
important for understanding shape changes,
interactions and energetics of large molecules.
The Two Faces of Computational Chemistry
Informatics
Theoretical
Chemistry
Philosophy of Informatics
“The problem is too difficult to solve at reasonable cost based
on real physics and chemistry, so instead we will build a purely
empirical model to predict the required molecular properties
from chemical structure, using the available data.”
Informatics
• In general, informatics methods represent
phenomena mathematically, but not in a
physics-based way.
• Inputs and output model are based on an
empirically parameterised equation or more
elaborate mathematical model.
• Do not attempt to simulate reality.
• Usually High Throughput.
What is Cheminformatics?
Calculating or predicting molecular properties without
using a physics-based approach. Rather than modelling
how the molecular world really works, cheminformatics
is an empirical discipline, using available data to find
correlations between chemical structure and properties.
Cheminformatics techniques are often used in drug
discovery and pharmaceutical research, and the
requirements of the pharmaceutical industry have
dominated the development of the subject.
LOW THROUGHPUT
Modelling in Chemistry
PHYSICS-BASED
ab initio
Density Functional Theory
Fluid Dynamics
Car-Parrinello
AM1, PM3 etc.
Molecular Dynamics
DPD
ATOMISTIC
Docking
CoMFA
EMPIRICAL
2-D QSAR/QSPR
Machine Learning
HIGH THROUGHPUT
NON-ATOMISTIC
Monte Carlo
4. How Best to Compute Solubility?
Which would you Prefer ...
or
?
Which would you Prefer ...
or
?
Solubility in water (and other biological fluids) is highly desirable for
pharmaceuticals!
Solubility is an important issue in drug
discovery and a major cause of failure of
drug development projects
This is expensive for the industry
A good computational model for predicting the
solubility of druglike molecules would be very
valuable.
Drug Disc.Today, 10 (4), 289 (2005)
Our Methods …
(A) Thermodynamic Cycle (Theoretical
chemistry)
Our Thermodynamic Cycle method …
We want to construct a theoretical model that will
predict solubility for druglike molecules …
We expect our model to use real physics and chemistry
and to give some insight …
We don’t expect it to be fast by informatics
standards, but it should be reasonably accurate …
Can we use theoretical chemistry to calculate solubility
via a thermodynamic cycle?
Gsub comes from lattice energy minimisation based on
the experimental crystal structure.
Calculate Energy of Infinite Periodic Lattice
Unit cell
Calculate Energy of Infinite Periodic Lattice
•
•
•
•
•
•
Take one molecule
Solve its Schrödinger equation
Calculate its interactions
Allow unit cell to change
Find best size, shape, packing
Find energy of infinite lattice
This is the same methodology as used in crystal structure prediction.
Gsub comes from lattice energy minimisation based on
the experimental crystal structure.
Gsolv comes from a computational solvation
model, RISM
Model of Solvent-Solute Interaction
Model is called RISM
Calculate energy of interaction between solute and solvent
Gsolv comes from model of solvent-solute interaction
Theoretical Chemistry: Solubility Results
These results are OK, but we would hope to do better
Theoretical Chemistry: Solubility Results
Our Methods …
(B) Random Forest (informatics)
Our Random Forest Model …
We want to construct a model that will predict
solubility for druglike molecules …
We don’t expect our model either to use real physics
and chemistry or to be easily interpretable …
We do expect it to be fast and reasonably accurate …
Random Forest
A Machine Learning Method
This is a decision tree.
We use lots of them to make a forest!
Random Forest
This is a decision tree.
Random Forest
Generate more trees randomly.
(1) By randomly sampling with replacement to make different “bootstrap
samples” of the data for each tree.
Random Forest
Generate more trees randomly.
(2) By randomly choosing the pool of questions to ask of the data for each
node (junction) of each tree.
Random Forest
● Machine Learning method introduced by Briemann and Cutler (2001)
● Development of Decision Trees (Recursive Partitioning):
● Dataset is partitioned into consecutively
smaller subsets
● Each partition is based upon the value of
one descriptor
● The descriptor used at each split is
selected so as to optimise splitting
● Bootstrap sample of N objects chosen from
the N available objects with replacement
Random Forest
Generate more trees randomly.
Random Forest
Generate more trees randomly.
Random Forest
Generate more trees randomly.
Random Forest
Generate more trees randomly.
We use lots of them to make a forest!
Random Forest for Solubility Prediction
A Forest of Regression Trees
Each leaf contains a group of molecules with similar solubility.
Random Forest
• The molecules whose solubility is to be
predicted are run through every tree (~ flow
chart) in the forest.
• Each tree predicts a solubility for each
molecule.
• We average the predictions over hundreds of
different trees.
Random Forest
Random Forest: Solubility Results
RMSE(tr)=0.27
r2(tr)=0.98
Bias(tr)=0.005
RMSE(oob)=0.68
r2(oob)=0.90
Bias(oob)=0.01
RMSE(te)=0.69
r2(te)=0.89
Bias(te)=-0.04
DS Palmer et al., J. Chem. Inf. Model., 47, 150-158 (2007)
These results are competitive with the best solubility prediction methods
RMSE(tr)=0.27
r2(tr)=0.98
Bias(tr)=0.005
RMSE(oob)=0.68
r2(oob)=0.90
Bias(oob)=0.01
RMSE(te)=0.69
r2(te)=0.89
Bias(te)=-0.04
DS Palmer et al., J. Chem. Inf. Model., 47, 150-158 (2007)
What Have we Learned?
• For this particular problem, informatics does a
bit better than pure theoretical chemistry.
How to Utilise Informatics
• Fast informatics models can be integrated into
drug discovery to compute solubilities for
molecules before deciding whether to
synthesise them.
• Saving much time and money on making
useless compounds.
Fits into drug discovery
pipeline here
Why Pursue Theory?
• Theory promises to give a greater
understanding of why some molecules are
more soluble than others.
• Advances in theory can be transferable to
other contexts.
• Theoretical models can be systematically
improved.
Download