The Genetic Algorithm in GOLD

advertisement
Molecular Docking Using GOLD
Tommi Suvitaival
Seppo Virtanen
S-114.2500
Basics for Biosystems of the Cell
Fall 2006
Table of Contents
Introduction
Software and Virtual Screening
Fitness Functions
GoldScore
ChemScore
Combined Fitness Functions
Genetic Algorithm
Overview
The Genetic Algorithm in GOLD
Algorithm Efficiency
References
Introduction
Computational molecular docking is a research technique for predicting whether one
molecule will bind to another, usually protein.
Ligand is a small molecule – compared to protein. It binds to a macromolecule. In
biochemistry, the macromolecule is usually a protein. When binding to a protein, the
ligand changes the conformation of the larger molecule, thereby affecting the protein
operation. In cell biology, ligand is usually a signal molecule. For example, a cell can
obtain information from its surroundings via receptors floating on its lipid bilayer and
then adjust its physiology to suit it. In protein-ligand docking the goal is to predict the
position and orientation of a ligand when it is bound to a protein receptor or enzyme.
The initial situation is such where the structure of the inspected protein and the ligand are
known. Such information can be obtained by spectroscopic methods such as X-ray
crystallography.
Because of the excess of possible conformations due to huge number of degrees of
freedom in large systems such as macromolecules, all possible conformations cannot be
compared. The problem must be somehow limited. The need for computational power
can be reduced by simplifying the model. The active site of protein-ligand interaction
must of course be modeled as precisely as possible but the further regions of the
macromolecule can be modeled less precisely because of their interaction with the active
region being much weaker.
Software and Virtual Screening
There are a few programs centered on predicting protein-ligand docking. The most
important property of such software is the ability to reproduce the results of experimental
binding modes of ligands found by crystallographic methods. To test the function of the
program, an imaged docked-in ligand is taken out of the protein-ligand complex. Then
best conformations for the possible docking of the two molecules are evaluated by the
algorithm. Then the computed result is compared with the real world conformation by
calculating the root-mean-squared deviation.
Root-means-squared error (RMSE) is a method for predicting the effective difference
between the expected and measured values. In this case, the imaged points of atoms are
considered as expected values. The value of RMSE can be calculated from equation
RMSE ( X )  E (( X   ) 2 ) 

,
n
where  is the standard deviation of the points (atoms) and n the number of points. A
model is usually considered successful when RMSE is below the value of 2.0 Å.
The greatest interest in docking is in life sciences. No wonder that computational
methods for drug-like ligands are in the center of the problem. Usually also the
functionality of docking software is tested with molecules expressing features of an
average medicine. Lipinski’s Rule of Five presents a rule of thumb for such features.
Generally an orally active drug has a molecular weight under 500 u, its Van der Waals
bonding activity is limited to at most 5 hydrogen bond donors and 10 acceptors and its
solubility is limited by partition coefficient log p < 5.
Virtual screening is used for this inspection. It is a term for using large libraries of
compounds with well-known dockings. Such a library is gained by imaging proteinligand-complexes. Accuracy of conformation is not the only property sought after. Also
efficiency of the algorithm plays a critical role when large numbers of molecules are
evaluated one after another. An easy way of quickening an algorithm is by going through
fewer steps or by taking fewer samples in the genetic algorithm. There, though, is a
danger of losing the accuracy.
A large and carefully constructed set of protein-ligand complexes is required for
estimating the success rate of a docking program. In the library which is a protein data
bank, complexes should represent usual features of protein-ligand-docking. A validation
set of complexes should not contain protein-ligand clashes, crystallographic contacts, or
unlikely ligand geometries. Diversity of different types should still be preserved to get as
broad as possible view of the quality of an algorithm.
Fitness Functions
A good docking program also takes the binding affinity of atoms into account. In a
scoring function, the atom-scale electromagnetic forces have to be taken into
consideration. On viewing results of fast-working algorithms, a trend of disagreement
between model and real world is seen when talking about binding affinity
GOLD offers a choice of fitness functions: GoldScore, ChemScore, and also a user
defined score. GoldScore and ChemScore are both equally reliable, but they may give
different prediction depending on the problem.
GoldScore
GoldScore fitness function is the original GOLD scoring function and it is selected by
default. It is made up of four components: protein-ligand hydrogen bond energy and van
der Waals energy, ligand internal van der Waals energy and ligand torsional strain
energy. Optionally fifth component ligand intra-molecular hydrogen bond energy may be
added. Empirical parameters used in fitness function such as hydrogen bond energies,
atom radii and polarizations, hydrogen bond directionalities etc. are taken from a
parameter file.
Goldscore function uses bond strengths in the fitness function, which is of form
f  S hb _ ext  S vdw _ ext  S hb _ int  S vdw _ int ,
where Shb_ext is the protein-ligand hydrogen bonding score, and Shb_int the internal
hydrogen bonding of the ligand. Usually, the best result is obtained by letting the internal
hydrogen bonding tend to zero. Svdw_ext and Svdw_int are the scores arising from weak Van
der Waals forces.
Goldscore has a mechanism for placing the ligand in the binding site, which is based on
fitting points. The program adds hydrogen-bonding fitting points to the protein and
ligand. Then it maps acceptor points on the ligand on donor points in the protein and vice
versa. Additionally, it generates hydrophobic fitting points in the protein cavity onto
which ligand CH groups are mapped. The fitness function in GoldScore is optimized for
the prediction of binding positions rather than binding affinities.
The actual search algorithm is a genetic algorithm optimizing several parameters of
which one is the fitting point score described above. Other parameters are dihedrals of
ligand rotable bonds, ligand ring geometries, and dihedrals of protein OH and NH3+
groups. It is obvious, that all the variables arise from the multiplicity of possible
conformations the molecules can be stretched into.
ChemScore
ChemScore was derived empirically from a set of 82 protein-ligand complexes for which
measured binding affinities were available. Unlike GoldScore, ChemScore was trained by
regression against measured affinity data, although there is no clear indication that it is
superior to GoldScore in predicting affinities.
ChemScore estimates the total free energy change that occurs on ligand binding as
described below:
ΔGbinding = ΔG0 + ΔGhbond + ΔGmetal + ΔGrot + ΔGlipo
G0   0
Ghbond   1 Phbond
Gmetal   2 Pmetal
Glipo   3 Plipo
Grot   4 Prot
Here the v terms are regression coefficients and the P terms represent the various types of
physical contributions to binding. The final ChemScore value is obtained by adding in a
clash penalty and internal torsion terms, which militate against close contacts in docking
and poor internal conformations. Covalent and constraint scores may also be included.
ChemScore = ΔGbinding + Pclash + Cinternal Pinternal (+ CcovalentPcovalent + Pconstraint)
The hydrogen-bond term is computed as a sum over all possible acceptor-donor pairs
such that one atom belongs to the protein and the other to the ligand. Each term in the
summation is the product of three Gaussian-smoothed block functions. The purpose of
the block functions is to reduce the contribution of a hydrogen bond according to how
much its geometry deviates from (a) ideal H…A distance (where ‘H’ is the hydrogen
atom linked to the donor atom (‘D’), ‘…’ the hydrogen bond, and ‘A’ the acceptor atom),
(b) ideal D-H…A angle (where D-H is a covalent bond between donor and hydrogen
atom), and (c) ideal directionality with respect to the acceptor atom. The maximum
contribution of a given acceptor-donor pair to the summation is 1; this will occur if the
pair forms a hydrogen bond of “ideal” geometry.
Block function is of form
and the Gaussian-smoothed block function looks like:
The summation function for hydrogen bond strengths is
where r is the distance, and α the angle as described above. In ChemScore the block
function is convoluted with a Gaussian function. σ represents the smearing sigma for each
term.
The third block function in the H-bond equation, B´*, is the sum of all possible values for
a given hydrogen bond. For example, a tertiary amine acceptor has three covalently
bound atoms that could be deemed as the ‘X’ atom: in this case, the term added for an Hbond to the amine is the product of the block function values for all three possible
H…A-X angles.
The metal-binding term in ChemScore is computed as a sum over all possible metal-ion
acceptor pairs, where the acceptor is an atom in the ligand that is capable of binding to a
metal. Again we use Gaussian-smoothed block function whose purpose is to reduce
contribution of the metal-acceptor interaction if the geometry is not ideal.
The parameter raM is the actual acceptor-metal (A-M) distance, Rideal is the ideal A-M
distance, Rmax the maximum A-M distance to be considered as a binding interaction,
and  metal the Gaussian smearing sigma with this term.
The lipophilic term is defined in a similar way.
The parameter rll is the actual distance between the pair of lipophilic atoms, Rideal is the
ideal atom-atom distance, Rmax the maximum separation beyond which no interaction is
deemed to occur, and  lipo is the Gaussian smearing sigma associated with this term.
Lipophilic atoms are defined as non-accepting sulphurs, non-polar carbon atoms and nonionic chlorine, bromine and iodine atoms.
The following formula is used to estimate the entropic loss that occurs when single,
acyclic bonds in the ligand become non-rotatable upon binding:
Nrot is the number of frozen rotatable bonds in the ligand (a bond is considered frozen if
one or more atoms on both sides of the rotatable bond are in contact with the protein).
The expression is deemed to have a value of zero if there are no rotatable bonds in the
ligand. Pnl(r) and P’nl(r) are the percentages of non-hydrogen atoms on either side of the
rotatable bond that are not lipophilic. For example, if there are 10 non-hydrogen atoms on
one side of the bond, of which 3 are not lipophilic, and there are 20 non-hydrogen atoms
on the other side, of which 2 are not lipophilic, then Pnl(r) and P’nl(r) are 30% and 10%,
respectively.
In addition, the final ChemScore fitness function contains terms such as clash penalty
term and internal torsion term. Clashes between protein and ligand atoms and ligand
internal torsional strain are accommodated by penalty terms in order to prevent poor
geometries in docking. The clash penalty terms differ on the nature of the contact,
whether it is a hydrogen-bonding contact, a metal-binding contact or neither of these.
Combined Fitness Functions
In Goldscore-CS protocol, dockings are produced by Goldcore function and then are
ranked by Chemscore. In Goldscore-GS, for one, dockings are produced by Chemscore
and ranked by Goldscore. Docking with Chemscore is up to three times faster but with
larger ligands Goldscore gives more accurate results. In small ligands, no such difference
appears. The difference, therefore, seems to arise from the number of degrees of freedom
in the molecules.
Combination of both functions, like in Goldscore-CS and GS, gives improved results.
Goldscore CS gives success rates up to 81 %, which is top-ranked GOLD solution within
2.0 Å (the usual root-mean-square distance considered as a successful prediction of
docking) of the experimental binding mode. Longer search time is a cost of this
combination of methods. In terms of producing binding-energy estimates, the Goldscore
function appears to perform better than the Chemscore function and the two consensus
protocols, particularly for faster search settings.
Verdonk et al. compared results from Goldscore and Chemscore functions and came to a
conclusion that Goldscore outperforms Chemscore in on larger molecules. Usually
Goldscore was better but also cases existed where Chemscore was the winner. The
interesting finding was the lack of cases where the both functions predicted correctly the
experimentally discovered conformation.
In all cases, a combination of the two functions gave better results than solitary functions.
The reason for this is that by using only one function, also errors are more probable.
These “hard failures”, which are of high rank in one function, do not have good scoring
in the other function using different parameters for the scoring. Although an incorrect
conformation might receive good grading by one function, the other function can be used
as a filter for these failures.
Goldscore-CS, where the conformations are first ranked by Goldscore, and then verified
by Chemscore gives better result than Chemscore-GS, where the functions are used in
reversed order. Accordingly, Goldscore, giving solitarily better results than Chemscore,
also gives better results when used as the primary scoring function (as in Goldscore-CS).
As mentioned above, the advantage of Chemscore function is its efficiency. There are
two main reasons for this difference. Firstly, Chemscore does not take hydrogen atoms
into account in lipophilic and clash terms. Therefore, the external van der Waals term,
Svdw_ext can be precalculated. Secondly, the functional form of the ligand intramolecular
energy is simpler in Chemscore.
Consensus docking, where several functions are used can be also extended to using
several algorithms. DOCK, FlexX, and GOLD have been used cooperatively to get better
results.
Verdonk et al. also noted that Goldscore function gives an equally good correlation with
binding affinity as ΔGbingind, which is surprising because the Goldscore function does not
have that parameter. To investigate this, though, the intramolecular terms of the
Goldscore function must be subtracted, because they cannot be compared between
different complexes. The correlation between the real and model deteriorates rapidly with
faster search settings. It means that to obtain reasonable estimates of the binding energy,
correctly predicted binding mode is essential.
Genetic Algorithm
Overview
A genetic algorithm can be used to evolve the pose of the molecule in the search of
optimum state. In genetic algorithm, definition of a fitness function is necessary. The
function must emphasize the properties of the evolving system that are being optimized.
In the case of protein structure and docking, the natural property of the quality of a
conformation is the overall energy of the molecule. The energy of the molecule varies as
a function of the positions of its components. Thereby, there are one or more
conformational states into which the molecule geometry attempts to converge. The task
of the algorithm is to find these few states from the excess of all states by changing the
values of variables.
The algorithm is initiated so that multiple conformations are produced randomly. The
genetic algorithm proceeds so that of these candidates several conformations with most
favorable value of the fitness function is chosen for the next step. Then properties of
these conformations are recombinated between each other. This recombination has an
analog in recombination of DNA chromosomes. Then again, the conformations are also
mutated to obtain new properties into the system. Of these recombinated and mutated
conformations a new generation is then chosen in a similar way. After several steps the
system finds its optimum so that the best conformation is one of the results.
The Genetic Algorithm in GOLD
GOLD optimizes the fitness score by using a genetic algorithm. A population of potential
solutions (in this case, possible docked orientations of the ligand) is set up at random.
Each member of the population is encoded as a chromosome which contains information
about the mapping of ligand H-bond atoms onto complementary protein H-bond atoms,
mapping of hydrophobic points on the ligand onto protein hydrophobic points and the
conformation around flexible ligand bonds and protein OH-groups. Each chromosome is
assigned a fitness score based on its predicted binding affinity and the chromosomes
within the population are ranked according to fitness.
The population of chromosomes is iteratively optimized. At each step, a point mutation
may occur in a chromosome or two chromosomes may mate to give a child. The selection
of parent chromosomes is biased towards fitter members of the population. A number of
parameters control the precise operation of the genetic algorithm: population size,
selection pressure, number of operations, number of islands, niche size, operator weights
and van der Waals and hydrogen bonding annealing parameters. No changes are
recommended in the algorithm parameters.
As mentioned above population size refers to the number of chromosomes on one island.
It is possible to have two or more islands each with specific population size.
Each of the genetic operations (crossing-over, migration and mutation) takes information
from parent chromosomes and assembles this information in child chromosomes. The
child chromosomes then replace the worst members of the population. Again the
selection of parent chromosomes is biased towards those of high fitness. The selection
pressure is defined as the ratio between the probability that the fittest member is selected
as a parent and the probability that an average member is selected as a parent. Too high a
selection pressure will result in the population converging too early.
The genetic algorithm starts off with a random population. Genetic operations are then
applied iteratively to the population. The parameter Number of operations is the number
of operations that are applied over the course of run. It is the key parameter in
determining how long a run will take.
Rather than maintaining a single population, the genetic algorithm can maintain a number
of populations that are arranged as a ring of islands. Individuals can migrate between
adjacent islands using the migration operation. The effect of the number of the islands on
the efficiency of the algorithm is uncertain.
Niching is a common technique used in genetic algorithms to preserve diversity within
the population. In GOLD two individuals share the same niche if the RMSD (deviation)
of the coordinates of their donor and acceptor atoms is less than 1 Å. When adding a new
individual to the population, count is made of the number of individuals in the population
that inhabit the same niche as the new chromosome. If there is more than the adjusted
number of individuals in the niche, the new member replaces the worst member of the
niche rather then the worst member of the total population.
Operator weights are the parameters mutate, crossover and migrate. They govern the
relative frequencies of the three types of operations that can occur during a genetic
optimization: point mutation of the chromosome, migration of population member from
one island to another and crossover of two chromosomes.
Algorithm Efficiency
The efficiency of the algorithm depends on how many steps the algorithm takes. If the
optimal state is found more quickly than expected, the algorithm can be stopped to save
time. If the algorithm converges so that the optimum state is found repeatedly within
certain extent of error, no more steps are needed. By default, GOLD terminates when the
top three dockings are within 1.5 Å of each other.
More efficiency can be gained by fixing the large protein almost completely to its solitary
energy optimum. The attaching ligand affects only on the receptor site of the protein,
leaving the rest of the protein unchanged. Also the ligand conformation can be considered
constant except for its docking groups (OH and NH3+). In this way, a lot of computational
power can be saved without considerable loss of accuracy. Of course, there are cases
where the ligand considerably affects the protein conformation so that the geometry of
the complex no longer resembles the prior protein.
The total time spent docking a ligand obviously depends on the number of docking runs
which by default are set to 10 for each ligand. By reducing the number of docking runs
we can make GOLD go faster. However, it is useful to perform at least a few docking
runs on each ligand. This increases the chances of getting right result. If the same answer
is found in several different runs it is usually a strong indicator that the answer is correct.
The early termination option can be used to save time. This option instructs GOLD to
terminate Docking runs on a given ligand as soon as a specified number of runs have
given essentially the same answer.
The time taken by GOLD to dock ligands can be controlled by altering the values of
genetic algorithm parameters. The easiest way to make GOLD go faster is to reduce the
number of genetic algorithm operations performed in the course of a run. GOLD
manipulates a pool of chromosomes of size (population size)*(number of islands). The
size of this pool should be such that the optimization converges within the specified
number of operations. If the pool size is too small for a given value of operations the
algorithm will converge prematurely.
References
Paul, N., Rognan, D. ConsDock: A new Program for the consensus analysis of proteinligand interactions. Proteins, 47, 521-533, 2002
Development and validation of a genetic algorithm for flexible docking
G. Jones, P. Willett, R. C. Glen, A. R. Leach and R. Taylor,
J. Mol. Biol., 267, 727-748, 1997
Improved Protein-Ligand Docking using GOLD
M. L. Verdonk, J. C. Cole, M. J. Hartshorn, C. W. Murray, R. D. Taylor
Proteins, 52, 609-623, 2003
Download