Again 0

advertisement
Contents
ABSTRACT ................................................................................................................................................... IV
CHAPTER 1
1
2
3
4
GENERAL INTRODUCTION .................................................................................................. 1
INTRODUCTION .............................................................................................................................................. 1
LITERATURE SURVEY........................................................................................................................................ 1
RESEARCH METHODOLOGY.......................................................................................................................... 14
OUTLINE OF THESIS ....................................................................................................................................... 14
CHAPTER 2
GENETIC ALGORITHM ....................................................................................................... 15
1
2
INTRODUCTION .......................................................................................................................................... 15
DARWIN'S THEORY OF EVOLUTION - NATURAL SELECTION ............................................................................ 15
2-1
Evolution ........................................................................................................................................ 15
2-2
Natural selection ............................................................................................................................ 17
3
PHENOTYPE AND GENOTYPE IN THE NATURE ................................................................................................. 17
4
GENETIC ALGORITHMS ................................................................................................................................ 19
5
GENETIC ALGORITHM ANALOGY.................................................................................................................... 20
6
THE STRUCTURES OF GENETIC ALGORITHM.................................................................................................... 20
7
GENETIC ALGORITHM STEPS: ....................................................................................................................... 20
8
ELEMENTS OF GENETIC ALGORITHM ............................................................................................................ 21
9
GENETIC ALGORITHM OPERATIONS .............................................................................................................. 22
9-1
Crossover ........................................................................................................................................ 22
9-2
Crossover rate ................................................................................................................................ 22
9-3
Types of crossover .......................................................................................................................... 22
9-4
Mutation ......................................................................................................................................... 24
9-5
Mutation rate ................................................................................................................................. 24
9-6
Types of Mutations ......................................................................................................................... 25
10 CONCLUSION ............................................................................................................................................... 25
CHAPTER 3
1
2
3
4
MOLECULES AND DRUGS ................................................................................................... 26
INTRODUCTION ............................................................................................................................................ 26
MOLECULE .................................................................................................................................................. 26
SMALL MOLECULES ....................................................................................................................................... 26
MOLECULAR PROPERTY .................................................................................................................................. 27
4-1
Chemical structure .......................................................................................................................... 27
4-2
Structure determination ................................................................................................................. 27
4-3
Medicinal chemistry ........................................................................................................................ 27
5
DRUGS ....................................................................................................................................................... 28
5-1
Drug design ..................................................................................................................................... 28
5-2
Drug action ..................................................................................................................................... 30
5-3
Drug discovery ............................................................................................................................... 30
5-4
Process of drug discovery ................................................................................................................ 31
6
STRUCTURE-ACTIVITY RELATIONSHIP (SAR)....................................................................................................... 32
7
QUANTITATIVE STRUCTURE-ACTIVITY RELATIONSHIP ............................................................................................ 32
8
QUALITY OF QSAR MODELS ........................................................................................................................... 32
9
COMFA AND COMSIA ................................................................................................................................ 33
10 CONCLUSION ............................................................................................................................................... 34
CHAPTER 4
1
2
3
4
5
7
8
9
10
RESEARCH WORK .............................................................................................................. 35
INTRODUCTION ............................................................................................................................................ 35
MOLECULAR SIMILARITY................................................................................................................................. 35
QUANTUM MOLECULAR SIMILARITY MEASURES ................................................................................................. 35
THE GRID POINTS ......................................................................................................................................... 36
ALIGNMENT ALGORITHM ................................................................................................................................ 38
THE MECHANISM OF THE PROGRAM: ................................................................................................................ 41
PROGRESS OF FITNESS VALUE ......................................................................................................................... 46
RESULTS AND DISCUSSION .............................................................................................................................. 54
CONCLUTION ............................................................................................................................................... 60
CHAPTER 5
CONCLUSION ...................................................................................................................... 61
CHAPTER 6
FUTURE RESEARCH (PHARMACOPHORE) ...................................................................... 62
REFERENCES ................................................................................................................................................ 63
List of Figures
Figure 1- Genotype and Phenotype (Michalewicz 2010) ................................................................. 18
Figure 2- Mechanism of Genetic Algorithm (Michaewicz 2010) ..................................................... 21
Figure 3- Single Point Crossover (Michalewxciz 2010) ................................................................... 23
Figure 4- N Points Crossover (Michalewicz 2010) ........................................................................... 23
Figure 5- Uniform Crossover (Michalewicz 2010) ........................................................................... 24
Figure 6- Mutation (Michalewicz 2010) ............................................................................................ 25
Figure 7- Mutation Factor 2m (Michalewicz 2010) .......................................................................... 25
Figure 8- One Molecule in a Grid (Lock 2007) .................................................................................. 37
Figure 9- Distance between the Atoms of the Molecule and One Point on the Grid (Lock 2007) 37
Figure 10- Represent a Molecule in a List (Lock 2007) ................................................................... 38
Figure 11- Fitness Function ................................................................................................................ 40
Figure 12- Taking Points Values from a Grid into a List .................................................................. 41
Figure 13- The Steps of our Software Algorithm .............................................................................. 45
Figure 14- Progress of Fitness Function ............................................................................................ 53
Figure 15- Two Molecules Aligned by Hand ..................................................................................... 55
Figure 16- Two Molecules Aligned by the Software ......................................................................... 56
Figure 17- Progress of Align Two Molecules (Step 1) ...................................................................... 57
Figure 18- Progress of Align Two Molecules (Step 2) ...................................................................... 57
Figure 19- Progress of Align Two Molecules (Step 3) ...................................................................... 58
Figure 20- Progress of Align Two Molecules (Step 4) ...................................................................... 58
Figure 21- Progress of Align Two Molecules (Step 5) ...................................................................... 59
Figure 22- Progress of Align Two Molecules (Step 6) ...................................................................... 59
Figure 23- Progress of Align Two Molecules (Step 7) ...................................................................... 60
List of Tables
Table 1 .................................................................................................................................................. 54
Abstract
One of the most common modern heuristic methods to solve computational problems is
genetic algorithm. When we look at genetic algorithms we see that Darwinian
evolution’s characteristics have been mimicked. In fact, it has achieved many successes
in various fields of life’s applications. Today we are going to use it to deal with
molecules and find a proper way to align them to be similar to the target molecule. In
particular, we will use the genetic algorithm as a mechanism to improve the ability for
aligning some molecules in the space and comparing them with the best position of
known structure to find the optimal solution which is optimal alignment. The optimal
alignment will be a prepared data and an input for the subsequent application, for
example, Comparative Molecular Similarity Indices Analysis (COMSIA) which is a 3D
method to predict and correlate molecule’s biological activity. Our research discusses
how we are going to perform transformation (translation and rotation) on each
molecule of the database, we will use transformation matrices and it will be very useful
to do translation and rotation, where we will consider the coordinate of each atom of
the molecule and its rotation angles to represent each chromosome. To find the best
transformation we have to use the chromosome mechanism and perform some
operation on it to obtain the diversity in random way. In addition, it will mention and
summaries some related projects which are near to our work such as genetic algorithm
in molecular recognition and design, protein structure alignment using a genetic
algorithm and genetic algorithm for protein threading.
Genetic Algorithm and Molecules Alignment
Chapter 1
1
General Introduction
Introduction
Good results have been obtained with genetic algorithm which has been developed for
calculating the similarity between the x-ray powers of molecules, one of the molecules is
rigid. Genetic algorithm has mimicked Darwin's Theory of Evolution and natural
selection which evolution presumes the development of life is a slow gradual process
began from non-life or simple life (simple solution in genetic algorithms) and stresses a
purely (optimal solution). In others words, the complex creatures evolve from more
simplistic ancestors naturally over time. Problems which have no compatible structure
to the genetic algorithms will be very difficult to solve. However, the structure of
molecules is very clear and it’s also feasible to be optimized by genetic algorithm.
Similarity measurements based on the molecular X-ray powers have been used to
quantify the degree of resemblance between pairs of rigid three-dimensional molecules.
This thesis discusses the effect of including molecular flexibility on the similarities that
are calculated using such measurements in search of large three dimensional databases.
It is achievable to predict the molecules biological activities by knowing how similar
they are in their shape. The research focuses on getting the molecule and aligning it by
rotation and translation to a target one by using genetic algorithm steps. We used the
grid points as a way to represent the molecule for the computer and that by constraint x
ray power on the molecule and we have measured the distance between each atom
belong the molecule and each point on the grid by using Pythagoras method. The tool to
find the difference between two molecules is Euclidean distance.
2
Literature Survey
There are some researches similar to our work and it is very useful to mention how
authors worked with and gave their ideas about dealing with molecules and genetic
algorithm.
According to Thorner et al (1996), molecular electrostatic potential (MEP) is the
method which has been used to measure the similarity between pairs of rigid threedimensional (3D) molecules. They mentioned that better results have been obtained
with genetic algorithm (GA) which has been developed for calculating the resemblance
between the MEPs of tow molecules. The authors stated that the development of a
1
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
range of sophisticated systems for 3-D substructure searching has been led by the
development of effective and efficient programs for generating three-dimensional (3-D)
structures from two dimensional (2-D).
The molecule electrostatic potential around a molecule has been represented by 3-D
grid where the ijkth element is the real-number value of MEP at this location (i, j, k).
There are two stages to obtain the similarity between the target structure and database
structure: align the corresponding grids to maximize the degree of overlapping, and
then use a measurement such as cosine coefficient to calculate the similarity
corresponding to this alignment. In fact, they did not use just the genetic algorithms as a
mechanism to obtain the similarity but they also have used the graph-theoretic
algorithm to match a target structure against each of the structures in a database and by
applying the graph-generation procedure to all of the constituent structures. Therefore,
the similarity search is affected by comparing the field-graph representing the target
structure with the field-graph of each of the molecules in the data base. The mean which
has been used to do the comparison is maximal common sub-graph (MCS) which
identifies the largest sub-graph common to the pair of field-graphs. The MCS resulting
from this mechanism specifies an alignment of the corresponding MEPs and this
alignment enables the calculation of the intermolecular similarity which Gaussian
approximation procedure has been used to do it. For applying genetic algorithm, the
chromosome here is encoded as a set of translations and rotations and applied to the 3D coordinates of one molecule to align its MEP with the MEP of another fixed molecule
in the space. The similarity value resulting from Gaussian similarity calculation is
considered as fitness functions for GA which identifies the alignment by maximizing the
value of this fitness. They mentioned that most organic molecules contain one or more
rotatable bonds; therefore, allowing the molecule to exist in many different
conformations and that are so useful for MEP-based similarity searching. The genetic
algorithm is designed to classify a set of geometric transformations (rotations,
translations and torsional rotations) to obtain the maximal overlap of a database
structure’s MEP with that of the target structure. The chromosome which represent the
transformation contains one-byte components plus and extra one-byte component for
each rotatable bond in the database structure, a single byte encodes 256 possible
rotations. To save time from being wasted to bring the two molecules into the same
general area of 3-D space, they initiate the algorithm by pulling the database structure
2
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
and target structure at the origin point (0, 0, 0). They have used the crossover and
mutation as a genetic operator; they have tested one-point crossover, two-point
crossover and uniform crossover. The best results have been found from using the twopoint crossover. They used mutation operator by checking each individual bit of the
chromosome in turn and then flipping it (changing it from zero to one and vice versa).
The mechanism to choose between crossover and mutation is generating a number in
the range 0-100, if the number is less than the crossover rate then the crossover is
performed, otherwise mutation.
There are some problems with using the field graph approach. First, the experiments
have reported that this algorithm is not very robust. Secondly, the generation of each
graph needs as input a single, fixed MEP, and this generation mechanism would
therefore have to be repeated many times to create a database for flexible searching
(with consequent storage and processing costs). Therefore, they prefer to use genetic
algorithms over field graph approach, especially that genetic algorithm has been shown
previously to be well suited to the processing of flexible molecules.
According to Willet (2006), one of the simplest virtual screening tools is similarity
searching using 2D fingerprints and it is widely used in the early stages of leaddiscovery programmes. In this paper the author has summarized the result of studies
that sought to increase the effectiveness of current system for similarity- based virtual
screening. He found out that if there is no specific information about the sizes of the
molecules required for testing, is the coefficient of choice for computing molecular
similarities.
Willet states that there are two main types of virtual screening systems: first, the
popular structure-based approach, for example, docking de novo design, which can be
used when the 3D structure of the biological target is available. The second is the ligandbased approaches which are applicable in the absence of such structural information.
For instance, pharmacophore methods, which involve the identification of the
pharmacophoric pattern common to a set of known actives and the use of pattern in a
subsequence 3D substructure search, the similar method which the author focuses on,
and machine learning methods, in which classification rule is developed from a trainingset containing known active and known inactive molecules.
3
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
The basic idea underlying similarity-based virtual screening is molecules that are
structurally similar are likely to have similar properties. Therefore, the strategy of
virtual screening involves computing the similarity between each of the molecules in a
database and the known reference structure, ranking the database molecules in
decreasing order of the computed similarities and then carrying out real screening on
just the top-ranked database molecules.
He mentioned that the measurement which is used to quantify the degree of
resemblance between the reference structure and each of the structure in the database
is the heart of any system for similarity-based virtual screening. Therefore, a similarity
measure involves three components: a method to represent the molecule in a way to be
compared with others (which 2D fingerprint is the structural representation the author
has focus on), the weighting scheme that is used to assign differing degrees of
importance to the various components of these representations and a function to find
the degree of resemblance between two structural representation.
The similarity coefficient which has been used for comparing fingerprint is the
Tanimoto coefficient. It suggests that two molecules have a and b bits set in their
fragment bit-strings, with c of these bits being set in both of the fingerprints; therefore,
the Tanimoto coefficient is defined to be: c / (a + b - c).
According to Yadgary, Amir and Unger (1998) using the amino acid sequence to
compute the three-dimensional structure of a protein is a way to obtain the physical and
chemical properties of the protein molecule and that is because of the chemical and
physical properties of a protein molecule depend on its three dimensional structure,
where the structure of proteins is the key to gain insight into their function. Today, it is
common to discover the structure of the protein by X-ray crystallography and NMR
spectroscopy. Calculation the structure of the protein directly from its sequence is not
possible since it requires minimization of a function of thousands of variables, with
constants that have not be accurately determined. Instead of that they have mentioned
another approach which is threading. Threading recognizes a known structure which
the sequence might be compatible to predict the three dimensional fold of a protein
sequence. In this approach, the way to thread a given sequence by a given target
structure through searching for alignment of sequence structure which puts sequence
4
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
residues in preferred structural places. Here the authors have suggested using genetic
algorithms to obtain optimal sequence structural alignment. It is a method to predict the
protein structure and that by threading the sequence of one protein through the known
structure on another. In the absence of detectable sequence similarity, this method has
proved its self in recognizing similarity of a sequence to a protein of known structure.
To design a threading procedure, it needs an algorithm to align the residues of the
sequence with a structure and fitness function to evaluate the quality of the alignment.
Knowledge based potentials and energy functions are obtained from a database of
known protein structures and these are depended on the analysis of known threedimensional structures of proteins using statistical physics. According to the authors,
the first step for using genetic algorithm is to represent the solutions as strings and
these strings are maintained as a population which allowed interacting. The interaction
is obtained via genetic operators such as: Mutation, crossover and Replication. They
used the alphabet of {0, 1} to represent the individual in the population. A residue which
is from the sequence aligned in the structure has represented by “1” in the string of the
population, number “0” represented no residue. Number N that is greater than number
“1” represented the number of residues which are not aligned in the structure position,
and N-1 represent skipped residue. After using some operators such as: crossover,
mutation and replication, the threaded sequence length has to be equal to the total sum
of the numbers of each string. The length of the structure has to be equal to the length of
the string. The string of lower normalized energy value has more chance to participate
for the next generation because it has higher fitness value. The string which have
higher chance to participate in genetic operators should have the higher fitness value.
They performed mutations by increasing randomly the value of a number and offsetting
it by decreasing the same amount in other positions. Crossovers have been performed
by choosing randomly and building two new offspring by concatenation of the suffix of
one, up to the chosen position, to the prefix of the other one. One of the genetic
algorithm problems the authors has met is early convergence of the population to one
high fitness individual which is common in using genetic algorithms and it makes the
genetic process meaningless, f it continues. Therefore, it will be not useful to continue in
generating new population because it will be the same population. The common
solution to this problem is to maintain high diversity in the population by using high
rate of mutation temporarily for number of generation then decreases it again, or
5
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
prevent and create solutions which appear frequently. In this proposal to avoid early
convergence, the authors have used the tree techniques. One of the good ideas to
prevent redundant solutions is using the tree data structure to make string
comparisons. As a conclusion, they have found that it is better to use higher rate of
mutations to achieve good results, but not to use too high rate of mutations as it does
not provide enough stability in the population to promote good solutions. Moreover,
even though rigid limitations are the reasons for failure in finding good alignments, the
method of genetic algorithm threader has representation that has designed to enable
full freedom in choosing positions for insertion and deletions.
In Willet’s research the author discussed the docking of flexible ligands into protein
active sites, in this research Willet (ND) encoded the conformation of the molecule by a
real or integer valued chromosome, the i-th rotatable bond’s torsion angle has been
represented at the i-th element of the chromosome. The fitness function here is the
energy for the specified conformation which it has been calculated by one of the several
standard molecular-modelling packages. It identifies the number of torsion angles
which aim to minimize the calculated energy. In this research the author mentioned the
study which chose 72 molecules with different structures chosen from the Cambridge
structural database, where each structure consists of number between one and twelve
rotatable bonds. The number of individuals in the population was ten times the number
of torsion angle in the molecule. He used six bits to represent each torsion angle. A key
role in determining the physical and biological properties of the molecule is the lowenergy conformations and there is much interest in ascertaining the stable
conformations that flexible molecules can adopt. Each individual consist of four strings,
tow for mapping and tow for rotatable bonds torsion angles (one in a ligand and one in
protein active site) . He has used a routine which is used to determine the hydrogen
bonding energy, the input for genetic algorithm here are the size and location of the
ligand that is docked into receptor site, also the size and location of the site receptor as
well. The protein and lignad conformations are the output here and they have to be
associated with fittest individual in the last population. The author found out that
systematic search is the most common approach for conformational analysis which each
torsion angle is rotated systematically by some fixed increment, but the problem with
6
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
this approach is the sheer number of conformations that need to be examined. For
instance, a systematic search with a 30° torsion increment for molecule containing
twelve rotatable bonds would require about “9 * 1012” energy calculations. Thus, this
approach is achievable only if there are very few rotatable bonds in a molecule. Willet
performed genetic algorithm by using a population of randomly generated
chromosomes as the input, and was run for a maximum of 10000 energy evaluations.
The improvement has been noticed after about 5000 evaluations. He has used another
approach which is SYBYL routine and he has found that this approach was faster than
genetic algorithm for molecules containing small numbers of rotational bonds, but the
genetic algorithm was faster for molecules containing more than 7 or 8 bonds, and the
difference increased as the number of rotatable bonds increased. Therefore, genetic
algorithm provide and effective way of exploring the conformational space of flexible
molecules; also, he work at sufficient speeds to allow the conformational analysis of
highly flexible molecules that are too time consuming to investigate using substitute
conformational-searching algorithm. He mentioned that rational approaches to drug
design to know the molecule that is complementary to the site receptor; they make use
of NMR and X-ray information about the binding-site geometry of a protein. These
approaches assumed that the ligand molecules are completely rigid and that molecule’s
suitability as a ligand depends on its steric complementarily with the site. They did not
take into account of the ability of the ligand to displace water and form hydrogen-bonds
with the active site. The genetic algorithm seeks to overcome these two limitations.
According to Wild and Willett 1995, using molecular electrostatic potentials is very
good idea to calculate the intermolecular similarity in database of three-dimensional
chemical structure, where they used the electron densities to measure the similarity
between tow molecules. They have used the equation so-called Carbo index. It depends
on cosine coefficient, which is a good tool to depend on when using genetic algorithm
approach. For example, an initial lead in drug – or pesticide- discovery program,
similarity searching involves matching some target molecule of interest against all of
the molecules in database to find the those molecules that are most similar to the target.
The authors mentioned they believe that genetic algorithms provide both an effective
and an efficient mechanism for the investigation of a range of complex chemical
7
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
matching problems, such as the generation of near maximal common sub graphs from
pairs of large tow dimensional structures, docking flexible ligands into protein active
sites and flexible three dimensional substructures. They have used the genetic
algorithms as follows:
The genetic algorithm search a mechanism to identify a combination of translation and
rotations which will align one molecule to another one, where every chromosome has
five components, three for translation and two for rotation. For rotation they use two
planes, each one has eight-bit binary number and this allow 256 possible of rotations.
For translation they use binary number as well but with the maximum permitted range.
They initialize the chromosomes randomly and then decoded by applying the indicated
translation and rotation to the three dimension coordinates inside the molecule which
has been aligned. They used the fitness function that depends on Gaussian similarity
calculation, where the resulting coordinate will be passed to this function to be
evaluated. They found that the best result obtained when they have used the uniform
crossover to get the diversity in the population and a crossover rate of 20 % was found
to give the best result for this problem. They achieved the diversity in the initial
population by ensuring that all of the individuals had a large Hamming distance
between them, where the Hamming distance between two binary individuals is the
number of corresponding bits that differ between two strings. This technique was found
to prevent early convergence. Each iteration discard non fittest individual with fittest
individual. They have used crossover and mutation to introduce new generation; they
have used single crossover, two crossovers and uniform crossover which is the best one
they have found. Also they have used a simple bit-flip mutation with some probability
(1/i). In this research they have also used roulette-wheel selection to select fittest
individual.
They did not use Gray coding which is a way of representing binary strings. In fact, to
increment or decrement the number always requires a change of only one bit. For
example in the standard binary representation the number 3 is 011 and the number 4 is
100, for the random mutation to go from 3 to 4 it is necessary for 3 bits to be flipped. In
Grey Code, 3 are represented with 010 and 4 with 110. So to change from 3 to 4
requires only the first bit to be flipped.
8
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
We can change between binary and grey code. Given a binary number
“b1 = (b1, b2m…., b5m)”
We can change it to a grey-coded reflection
“g = (g1; g2……., gm)”
Changing from binary to Gray will affects the distance between different solutions and
the fitness landscape performed by the operators such as mutation.
During the experiments in this paper, they demonstrate that the genetic algorithm leads
to similarities that are comparable in effectiveness for database search to those
resulting from the use of approach passed on field graphs and superior to those
resulting from the use of bit-climber. Moreover, genetic algorithm leads to more robust
alignments than does a simplex optimization procedure. The authors found some
weaknesses in the field graph approach which is far more complex and time consuming
owing to the need to generate the graphs from the electrostatic potential grid before the
search can be carried out. Also, for some molecules the field-graph does not contain
sufficient nodes to enable those molecules to be aligned with a target molecule.
They have used four strings to represent the chromosome. Firstly, two strings use
binary representation and two strings use integer representation. The first binary string
is to represent the ligand and the latter for the protein, where the angle of rotatable
bond in the rotation occupies one byte from the string. Secondly, hydrogen bond
between the protein active site and the ligand has the possibility and this the possibility
with mapping is encoded in the integer strings. For example, the first integer string
encodes mapping from hydrogen atoms of the protein to one pair of the ligand and the
second string has the inverse mechanism.
Payne and Glen (1993) have used genetic algorithm as away to optimize the fit of
flexible molecules to a set of restrictions. The restrictions may be shape similarity,
charge distribution or intermolecular distance constraints. The problem, when using Xray crystallographic analysis to know the structure of an active site, is how to dock the
ligand to this active site. Molecular modelling techniques are a way used to compare
dissimilar molecules to generate conformations. In addition, there are some numerical
methods such as: atom charge distribution, comparison of electron densities, dipole
9
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
moments, volume overlaps, and electrostatic and lipophilicity potentials. The algorithm
here receives the current molecule and converts it from phenotype into genotype.
Particularly, it take the coordinate of all atoms which belong to the molecule and
convert them to the string of bits and this string will represent one individual the first
population. Then it will apply the operators of genetic algorithms which are crossover,
selection and mutation to obtain new generation.
There are many steps have to be followed to do the algorithm:
Firstly, they have to find a good way to represent the problem. Secondly, they have to
use distance method to do the comparison. Thirdly, when they use X-ray
crystallography analysis to know the structure of an active site, it is important to know
how to dock the ligand into the active site of the protein. Finally, they have to define a
set of restrictions to compare and fit molecule with it.
The authors have used the binary strings to represent each individual. They have
broken down the string into four segments. The first segment represents the translation
of the molecule along the three axes x, y and z. the second segment represents the
rotation of the whole molecule around the all axes, the third segment represents
rotations around each rotatable stem (or bond). The fourth represents the conformation
of rings.
The methodology of Richmond’s research is to find alignment algorithm to superimpose
atoms in one molecule onto another similar atoms which belong to different molecule,
to do so, Richmond et al (2004) followed many steps:

Step 1- identify a set of equivalent candidate atoms which are belong to different
molecules and similar in term of local geometry.

Step 2- filter the set of equivalent candidate atoms by cancel and discard the
pairs which cannot be overlaid with any alignment transformation.

Step 3- calculate the alignment transformation which place over the molecules to
overlie the pairs in the filtered set of atoms equivalence.
10
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment

Step 4- repeat the alignment and calculate a new set of atoms equivalences,
compare the atoms to identify the distance between them in case it’s less than a
user defined threshold and depending on it for the next alignment.
The procedure to match tow 2D shapes:
Firstly, identify the correspond points which belong to shape A and shape B. Secondly,
the morphing transformation has to be calculated. Which map the points on first shape
to their corresponding points on the second shape. Finally, determine the similarity
between two shapes by calculate the sum of the matching errors of corresponding
points which belong to both of them.
Each shape has been represented by a discrete set of points sampled from external or
internal contours on the shape with using an edge detector. In fact, the more numbers of
points, the more accurate the description of the shape.
Over recent years the folding problem became one of the most challenging problems in
the computational chemistry world, specially the mechanism of folding. Genetic
algorithm became so common to search in the space of this field. Each possible solution
is represented by an encoded individual or string to change it from phenotype into
genotype. For instance, to represent the conformation of a molecule, they construct an
individual which contains of a number of real numbers where each real number
represent angle of rotation around a flexible bond in the molecule. Here the method of
genetic algorithms begins with the population which is the number of individuals that
have been created randomly. During the performing of the algorithm, the authors use
fitness function that evaluates each individual to see whether it has high fitness or low
fitness to decide whether it will participate in the next generation or not. The
individuals in the population which have high fitness will participate in the next
generation and the individual with low fitness will not participate in the next one (Jones
ND).
Daeyaert et al (2005) site how to use genetic algorithms to find the similarities between
two molecules in the space. They have mentioned that the requirement to do structure
based drug design methodologies is to find a proper alignment for molecules. The
methodologies which they have mentioned are Comparative Molecular Field Analysis
11
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
(COMFA) and Comparative Molecular Similarity Indices Analysis (COMSIA). The authors
have used multi- objective function optimization to combine flexible source molecules
onto rigid target molecules, they depend on two things: the similarity score between the
source and target molecules and conformational strain of the source molecule, the first
has to be maximized and the latter has to be minimized. The aim of this function is to
optimize the smaller square distance between the target and source molecules. To rank
the final individuals, they have used fast non-dominated sorting algorithm. They have
used the elitism to ensure survive of solutions which have high fitness and many
operators to provide the diversity for the population. Each individual or vector in this
search represents the alleles by real numbers: the first three positions represent a
translation in the x, y and z axis, of the source molecule, from 4 to 6 represent the Euler
angles deciding the direction of the source molecule, and the rest of the individual
represent the values of the torsion angel of each rotatable stem (bond) in the source
molecule. They mentioned that before beginning the genetic algorithm, the coordinates
of the target molecules have to be centred.
According to Xu et al (2003) mentioned that the physical properties and biological
behaviour of a molecule usually depend on its accessible and low energy conformations;
therefore, fast and reliable computational methods for producing conformation are
extremely valuable. They have used algorithm which produces molecular conformations
that are compatible with a set of geometric restrictions. These restrictions include
intering atomic distance bounds which derived from the molecular connectivity table. In
this work they have used Merck Molecular Force Field to calculate the potential
energies. They have mentioned that the main advantage of this work is to obtain more
diversity of the conformations. The authors focused on several enhancements to
generate better initial geometries and to detect and eliminate conformations which are
likely lead to the same local minima as well as on the use of this technique for protein
structure prediction, pharmacophore modelling and ligand docking.
Many problems have been solved successfully by using the distance geometry such as:
NMR structure determination, conformational analysis, ligand docking and protein
structure prediction. In this work the volume and distance constraints have reduced the
number of accessible conformations to molecule and search space. The general distance
12
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
geometry method is a self-organizing algorithm works as a fitness function, which tries
to minimize an error function that measures the breach against geometric restrictions.
According to (Nicholas E. Jewell 2001), using the 3D QSAR methods is essential to the
design of bioactive molecules, such as COMPASS, COMSIA, COMFA and HASL. It is very
important for 3D QSAR methods are to obtain alignment for the molecules in dataset as
an input for the calculation of the structural variables. Also, they have stated a method
to find the optimal way which obtains the convergence between two molecules. They
described the main features of FBSS (for field based similarity searching) and also
reported a simple validation experiment that supplies the use of FBSS-based alignments
in 3D QSAR analyses. They used FBSS as the prerequisite to 3D QSAR procedure, and
compare the results with those obtained from conventional manual alignments. Their
work was to provide an approach which is complementary to and not replacement for
the manual alignment. This program is essential to implement 3D QSAR specially
COMSIA and COMFA methods.
For calculating inter-molecular structural similarity, many different measures have
been described by the authors. Carbo et al(1980) describe one approach which involves
the use of molecular field descriptors, and this approach has been developed by Good et
al (1992). This approach is to put the molecule at the centre of a 3D grid and calculating
the value of molecular field, for instance, the electrostatic potential of the molecule at
each point of the 3D grid. To find the degree of similarity and the difference between
two molecules, they aligned the corresponding grid to find the best possible fitness, and
they use one of the distance methods to do that.
FFBS is software which used genetic algorithms to align molecules’ fields depending on
field based similarity measures for similarity searching in chemical structure database.
For each individual or chromosome the FFBS’s genetic algorithms encodes the
translations and rotations which applied to a structure to align it with a target one,
where the value of the similarity coefficient which obtained from the encoded alignment
will be the fitness function.
13
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
3
Research Methodology
In this research we developed a new algorithm to find the optimal alignment for a group
of molecules. In fact, we used the Java language to write the program that performs this
algorithm. This alignment will be as an input for a method which called Comparative
Molecular Similarity Indices Analysis (COMSIA) which is a 3D method to forecast and
correlate molecule’s biological activates. We obtained the optimal alignment by using
the genetic algorithm mechanism to do transformation (translation + rotation) for each
molecule. For each transformation we compared the new figure with other databases to
know if it is a good solution or not. For comparison we need to use good method to find
the distance between the sample and the target one; therefore, we are going to find the
best distance function to do this comparison. This research will be useful to deal with
any shape in the future not only the molecules.
4
Outline of Thesis
Chapter two: “Principle of Genetic Algorithm and how John Holland mimicked Darwin
theory (Evolution and Natural Selection) to invent Genetic Algorithm. “Genotype and
Phenotype”.
Chapter Three: “Molecules and Drugs and how to use QSAR and its methods (CoMFA
and CoMSIA)”.
Chapter Four: “Description of Implementing Genetic Algorithm to align some molecules
and how to use the fitness function to obtain the optimal alignment”.
Chapter Five: “The Deduction and Future Work (Pharmacophore)”.
14
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
Chapter 2
1
Genetic Algorithm
Introduction
In this chapter we are going to talk about Darwin’s theory which involves the
mechanism of evolution and natural selection then we will mention how John Holland
used this theory to invent the idea of genetic algorithm. Phenotype and genotype are the
hardware and software of organisms. We will state the principles, steps, elements,
operations and process of genetic algorithm and how to use them to provide the
diversity to offer more solutions for the problem. Operations of genetic algorithms are
crossover (single point, double points, and uniform crossover) and mutation (mutation
factor 1m and mutation factor 2m).
2
Darwin's Theory of Evolution - Natural Selection
There are two important things in Darwin’s theory which have been mimicked in
genetic algorithms: the mechanism of evolution and natural selection. We are going to
talk about both of them below:
2-1
Evolution
Darwin's Theory of Evolution presumes that the development of life is a slow gradual
process that began from non-life or simple life (simple solution in genetic algorithms)
and stresses a purely (optimal solution). In others words, the complex creatures evolve
from more simplistic ancestors naturally over time.
The process began in the sea three million years ago, where complex chemical
molecules started to clump together to form microscopic blobs (cells). These cells were
the seeds of the tree of life. They had the ability to split and replicate themselves as
bacteria do and during the time they have been diversified into different groups. Some
of these groups remained connected together and formed chain shapes which are called
alga. Others collapsed upon themselves and formed hollow balls creating a body with an
internal cavity, these we call multi- celled organisms and sponges are their direct
descendants. The tree of life became more complicated and diverse during the time as
more variation appeared. Some of these organisms had the ability to move and
developed a mouth that opened into a gut. Meanwhile, other organism had rod inside
their bodies which made them stronger, then sense organs developed around their front
15
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
end. Some groups had bodies which were divided into segments provided by little
projections on either side which helped them move in the sea floor and they then got
hard and protective skins that gave their bodies some rigidity. These creatures filled the
sea with lives. Roughly, before 450 million year, some of these armoured creatures got
out of the water into the land and here the tree of life brunched into multitude of
different species that exploited this new environment in all kinds of ways. Some of these
groups developed elongated flap on their backs and over many generations these things
developed eventually into wings, now we call these insects. Life began in the air and
diversified into many forms. At the same time, some organisms in the sea have been
faced with change by the stiffening rod in their bodies which became bond and a skull
developed in front of it with hinged jaw that could grab and hold onto its prey. These
creatures grew bigger and got the ability to swim with power and speed, because they
developed fins equipped with muscles. We call these creatures fish now and they are
dominated the waters of the world. Some of these creatures got the ability to gulp the
air from the water surface and their fleshy fins became weight-supporting legs. 375
years ago, a few of these backboned creatures followed the insects onto the land, they
had wet skin and they had to return to water to lay their eggs. These types we call them
amphibians, some of them evolved dry, scaly skins which they broke their link with
water by laying eggs with watertight shells.
These creatures, the reptiles, were the ancestors of today's tortoises, lizards and
crocodiles, snakes. 65 million years ago these creatures grew bigger and formed the
dinosaurs’ animals which dominated the land, but a great disaster happened and killed
all of them except one branch which their scales had developed into features and we call
these birds now. At the same time, some insignificant group of survivors began to
increase in numbers on the ground beneath and they are different from their
competitors in that their bodies were warm and insulated with coats of fur. Now, we
have the first mammals. They had a good chance of surviving and deploying without
existing for other creatures and they were lucky to have warm and insulated bodies
enabling them to be active at all places, from the tropics to the Arctic, on land as well as
in water, on grassy plains and up in the trees at all times, at night as well as during the
day ( Information from DVD about Charles Darwin and the Tree of Life, produced by
sacha Mirzoeff, released 2009 ).
16
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
2-2
Natural selection
Natural selection acts to keep and accumulate minor advantageous genetic mutations.
Suppose a member of a species developed a trait. For example, it grew wings and
learned to fly. Its offspring would inherit that feature and pass it on to their offspring.
The inferior (traits) members of the same species would gradually die out, leaving only
the superior (traits) members of the species. Natural selection is the preservation of
features that enables a species to compete better in the wild. It is also similar to
domestic breeding. Over the centuries, human breeders have produced dramatic
changes in domestic animal populations by selecting individuals to breed. Breeders
eliminate undesirable features gradually over time. Similarly, natural selection
eliminates inferior species gradually over time. For more explanation we are going to
give one example here. In the wild we have a population of rabbits, some of them smart
and some of them dumb, some of them fast, some of them slow. The slower and dumber
rabbits are more likely to be eaten by foxes. However, the smart and fast ones have
more chance to survive and do breeding to get new generation of rabbits. Of course,
some of the slower and dumber rabbits will survive, may be because they are lucky but
there population will be less than the smart and fast ones. Generation by generation we
will find that the smart and fast rabbit are much more than others type in the wild and
that is because there are more parents from their type and this are what we call the
natural selection which the foxes are a part of (Michalewicz 1999, p.14).
3
Phenotype and genotype in the nature
Phenotypes refer to the physical parts of a living organism such as the sum of atoms,
molecules, macromolecules, cells, structures, metabolism, energy utilization, tissues,
organs, reflexes and behaviours. They include anything that is part of the observable
structure, function or behaviour of a living organism. The Phenotype of an organism
refers to the physical expression of an organism’s genotype.
Genotype is the "internally coded, inheritable information" carried by almost all cells of
all living organisms. It is used as a “blueprint” or set of instructions for building and
maintaining a living organism. This information is written in a coded language (the
genetic code) and is encoded in the genes of an organism. These genes are connected
17
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
together into long strings called chromosomes. The genes and their settings are referred
to as the organism’s genotype. Each gene and its settings represent a specific trait of an
organism, like eye colour or hair colour. For example, a hair colour gene and its settings
determines with hair is blonde, black or auburn. Occasionally a mutation can occur in a
gene which can result in a completely new trait expressed in an organism. This is rare as
a mutated gene doesn’t normally affect the development of the phenotype of an
organism
Genetic information is copied at the time of cell division or reproduction. This copied
information is passed from generation to generation and for this reason is said to be
“inheritable”. When two organisms mate to reproduce the resulting offspring will get a
share of each organism’s genes. The process is called Recombination and involves the
offspring getting half its genes from one parent and half from the other. These
instructions are very important in all aspects of the life of a cell or organism. They
contain the information for many vital functions such as the formation of protein
macromolecules, and the regulation of metabolism and synthesis.
Genotype and phenotype in genetic algorithms are explained in this diagram below:
Figure 1- Genotype and Phenotype (Michalewicz 2010)
18
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
4
Genetic Algorithms
In 1975 John Holland and his students have developed the genetic algorithms at the
University of Michigan. The goals of their research were explaining the adaptive
processes of natural systems and design artificial software that remain the important
mechanism of natural system. This approach has obtained a new and important
discovery in both artificial and natural system science.
Genetic algorithm is computational model based on accepted theories of biological
evolution and natural selection. It is useful as research methods for solving problems
and for modelling evolutionary system. It depends on the stochastic and diversity to
find the optimal solution to the problem, most times it uses binary numbers or real
numbers to do its algorithms. The mechanism of genetic algorithm is to create initial
population which is number of individuals (chromosomes) and each individual
represent one possible solution for the problem then perform a loop of instructions
which are selected from some pairs of parent to do the crossover or mutation to
introduce new offspring which will participate in the next generation, that depends on
the fitness function to evaluate the new offspring and the selection method decide
whether it will be in the next generation or not. Problems which have no compatible
structure to the genetic algorithms will be so difficult to solve. However, the structures
of molecules alignment are so clear and it will be so feasible mechanism solution to
optimize by genetic algorithms.
When you look at genetic algorithms you will find some vocabularies have been
borrowed from natural genetics. For example, individuals, genotype or structure in a
population, sometimes these individuals are called string or chromosomes. If you
compare between genetic algorithms and the nature you will find that each organism in
the nature carries a certain number of chromosomes; for instance, the human has 64
chromosomes. However, in genetic algorithms each candidate solution is one
chromosome (individual, string or structure). Each chromosome in the nature has a
number of unites which are called genes, these unites in genetic algorithms are called
features.
19
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
5
Genetic algorithm analogy
The idea for constructing GAs based on the analogy to evolutionary biology requires
making a considerable mental transition, because the encoding mechanism is so
different in the two cases. The way in which genes are manipulated, combined and
expressed is very different in the biological and the genetic algorithms cases.
With GAs, there is much greater distance between mathematically encoded optimization
and the field of evolutionary biology from which the inspiration for the method is
derived. Consequently, the language and concepts transferred are much more subject to
reinterpretation. For example, a gene and a numerical encoding called a gene are not the
same. Reaping the benefit of the genetic analogy first requires reinterpretation before
the surprising possibilities of the analogy can be exploited.
6
The structures of genetic algorithm
To perform genetic algorithms we require these components:

A way of encoding solutions to the problem as a chromosome (phenotype to
genotype).

An evaluation function, which return a rating for each chromosome given to it.

A way to initialize population of chromosomes.

Operators that may be applied to parents when they reproduce to alter their
genetic compositions for example the standard operators are mutation and
crossover.
7
Genetic algorithm steps:

Initialize a population by a certain procedure and evaluate each individual in the
initial population.
20
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment

Choosing one of the genetic algorithms operators to apply it to the parents as
away to get more diversity.

Reproductions are obtained by choosing one or two parents to reproduce new
offspring. Although the individuals with high fitness are favoured, the selection is
stochastic.

Reproducing new generations until reach stopping criteria.
Figure 2- Mechanism of Genetic Algorithm (Michaewicz 2010)
8
Elements of Genetic Algorithm

1 -Encoding
 Binary Encoding
 Integer Encoding
21
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
 Real Encoding
 Complex Encoding

2-Initial population

3- Evaluation

4- Genetic Algorithms Operations
9
Genetic Algorithm Operations
9-1
Crossover
Crossover is a way to get more diversity and that by exchanging information among
individuals to creating the possibility of the right combination for better solutions
(individual). It takes two parents (tow individuals) depending on the selection method
which the selection itself depends on the fitness function. It performed by selecting a
random position along the length of the individual and swapping all the genes after this
position. As a result we will get two new individuals which can participate in the next
generation.
9-2
Crossover rate
Crossover rate is the chance which the method depends on to change or to swap the
information between two chromosomes (individuals). The good value for crossover rate
is roughly 0.7.
9-3
Types of crossover
Single Point Crossover
It is the easiest types of crossover. It is too fast but it has the problem of less diversity
than other types especially when the population has similar individuals. It works by
choosing random position of the chromosome and swapping all the genes after this
position between two chromosomes. (Figure 3)
22
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
Figure 3- Single Point Crossover (Michalewxciz 2010)
Point Crossover
In this type of crossover it will be chosen more than one point and it does randomly and
swaps all elements between these points to get two new chromosomes. It is fast and it
leads to more diversity in the next generation.
Figure 4- N Points Crossover (Michalewicz 2010)
23
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
Uniform Crossover
Another type of crossover is uniform crossover, where a coin toss is performed at each
position, and the result of the coin toss determining whether or not an exchange of
genes takes place at that position. It does by assigning 'heads' to one parent, 'tails' to the
other, flipping a coin for each gene of the first child and making an inverse copy of the
gene for the second child; therefore, the Inheritance is independent of position.
Figure 5- Uniform Crossover (Michalewicz 2010)
9-4
Mutation
Mutation is changing randomly one or more components of a chromosome. With binary
representation, this usually flipping (flip-flop) bits, that means change bits from zero to
one or vice versa. Because of that, the principles of mutation remain unchanged.
9-5
Mutation rate
Genes in a chromosome are randomly selected with a certain probability (Pm) and this
is the chance that a bit in a chromosome will be flipped (zero becomes one, one
becomes zero). The value of Pm is usually close to 0.001.
24
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
Figure 6- Mutation (Michalewicz 2010)
9-6
Types of Mutations
Mutation factor (1m)
In this mechanism the mutation will happen on one gene only and the value of the gene
will be changed to an entirely new value, therefore this factor will allow getting the new
value to the chromosome.
Mutation factor (2m)
In this method the existing value will be swaped with anothor existing value. The
charachtaristics of this factor that it does not allow to enter a new value to the
chromosome; therefore, it preseves the genes values in the chromosome.
Figure 7- Mutation Factor 2m (Michalewicz 2010)
10
Conclusion
Genetic algorithm is a method which has been invented by John Holland. He mimicked
Darwin’s theory, which is the mechanism of evolution and natural selection. We
summarised the meaning of genotype and phenotype in this chapter and how the
feature of organism are inherited from generation to generation. The process and
operation of genetic algorithm have been discussed, which are crossover and mutation.
25
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
Chapter 3
1
Molecules and Drugs
Introduction
Most drugs which are used now days in human therapy interact with certain
macromolecular targets. It blocks or activates molecule activity by binding to them.
Molecule is an electrically neutral group of at least two atoms held together by covalent
chemical bonds, where atom is a basic unit of molecule consisting of central nucleus
surrounded by a cloud of negatively charged electrons.
2
Molecule
A molecule may consist of atoms of different elements, as with water (H2O) or of a single
chemical element, as with oxygen (O2). Generally atoms which are connected by noncovalent bonds such as hydrogen bonds or ionic bonds are not considered single
molecules. Molecular chemistry or molecular physics is name of the science of
molecules depending on the focus. Molecular physics deals with the laws governing
their structure and properties, while molecular chemistry deals with the laws governing
the interaction between molecules those results in the formation and breakage of
chemical bonds. Very reactive species of molecules are called unstable molecules
(Brown 2003).
3
Small molecules
In the field of pharmacology, the Small molecule is usually restricted to a molecule that
binds with high affinity to a biopolymer such as protein, nucleic acid, or polysaccharide
and in addition alters the activity or function of the biopolymer. The term small
molecule in the fields of pharmacology and biochemistry is a low molecular weight
organic compound which is by definition, not a polymer. Small molecules can have a
variety of biological functions, serving as cell signalling molecules, as drugs in medicine,
as tools in molecular biology, as pesticides in farming, and in many other roles. These
compounds can be artificial (such as antiviral drugs) or natural (such as secondary
metabolites); they may have a beneficial effect against a disease (such as drugs) or may
be detrimental (such as teratogens and carcinogens) (Barnum 1991).
26
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
4
Molecular property
4-1
Chemical structure
When we talk about chemical structure, we mean molecular geometry, crystal structure
, and electronic structure of molecules. Molecular geometry is the spatial arrangement
of atoms in a molecule and the chemical bonds which keep all atoms together. Molecular
geometry can be very simple, such as nitrogen or diatomic oxygen molecules, or very
complex, such as DNA or protein molecules. The structural formula is the way which is
used to represent the molecular geometry. The occupation of a molecule's molecular
orbital is described by electronic structure.
4-2
Structure determination
In chemistry the structural determination is the mechanism to determine the chemical
structure of molecules. The last result of such mechanism is the obtainment of the
coordinates of the atoms in a molecule. There are some methods which one can
determine the structure of a molecule such infrared spectroscopy and Raman
spectroscopy, nuclear magnetic resonance (NMR), electron microscopy, and x-ray
crystallography (x-ray diffraction). Three-dimensional models at atomic-scale
resolution can be produced by the last technique, as long as crystals are available, as xray diffraction needs several copies of the molecule being studied that must also be
arranged in an organised way. X-ray diffraction, Proton NMR, Carbon-13 NMR, Infrared
spectroscopy, and Mass spectrometry are common methods for determining chemical
structure. Also, there are familiar methods for determining electronic structure such as:
Electron-spin resonance, cyclic voltammeter, electron absorption spectroscopy, and Xray photoelectron spectroscopy.
4-3
Medicinal chemistry
Medicinal chemistry, also referred to as pharmaceutical chemistry, is a regulation at the
intersection of pharmacology and chemistry involved with synthesizing, designing and
developing pharmaceutical drugs. Medicinal chemistry includes the identification,
synthesis and progress of new chemical entities proper for therapeutic use. It also
involves the study of existing drugs, their quantitative structure-activity relationships
27
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
(QSAR), and their biological properties. Pharmaceutical chemistry aims to assure fitness
for the purpose of medicinal products; therefore, it is focused on quality aspects of
medicines.
The most compounds used as medicines are organic compounds involving small organic
molecules and biopolymers, but the compounds which are not inorganic and metalcontaining compounds have been found to be useful as drugs. For instance, the cisplatin series of platinum-containing complexes have been found as anti-cancer agents.
Medicinal chemistry is a vastly interdisciplinary science combining organic chemistry
with
biochemistry,
computational
chemistry,
molecular
biology,
statistics,
pharmacology, pharmacognosy, and physical chemistry
5
Drugs
Drugs are usually small molecules with roughly 50 atoms. When a drug binds to a
protein by the proper way, it increases the activity of the protein. in the most basic
sense, drugs are an organic small molecule which prevent or activate the function of a
bimolecule such as a protein, so as a result it will be useful therapy to the patient (Lock
2007, p.1).
5-1
Drug design
Drug design, sometimes called rational drug design; is a process of finding new
medications based on the knowledge of the biological target molecule. Usually a drug
target is a key molecule which is specific to a disease condition. When the drug binds
the active side of the molecule, it inhibits the key molecule; therefore, some approaches
cause a key molecule to stop functioning as a try to reduce the functioning of the
pathway in the diseased state. However, to avoid the side effects, the drugs should not
be designed in such a way which affected any other molecules that may be similar in
appearance to the target molecule. Drug design most commonly involves design of small
molecules that are balancing in shape and charge to the bimolecular target to which
they interact and therefore bind to it. Frequently, drug design relies on computer
modelling techniques and these techniques called a computer-aided drug design. Some
consider the phrase "drug design" is the wrong name. The real meaning of drug design
is ligand design. The techniques of modelling to predict the binding affinity are so
28
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
successful. However there are many other properties such as bioavailability, lack of
side effects, metabolic half life etc (Chohen 1996) .
Drug design classified into two types:
Ligand based
Ligand-based drug design (or indirect drug design) relies on knowledge of other
molecules that bind to the biological target of interest. These other molecules may be
used to derive a pharmacophore model which defines the minimum necessary
structural characteristics a molecule must possess in order to bind to the target. In
other words, a model of the biological target may be built based on the knowledge of
what binds to it and this model in turn may be used to design new molecular entities
that interact with the target. Alternatively, a quantitative structure-activity relationship
(QSAR) in which a correlation between calculated properties of molecules and their
experimentally determined biological activity may be derived. These QSAR
relationships in turn may be used to predict the activity of new analogy (Guner and and
Osman 2000).
Structure based
Structure-based drug design (or direct drug design) relies on knowledge of the three
dimensional structure of the biological target obtained through methods such as x-ray
crystallography or NMR spectroscopy. If an experimental structure of a target is not
available, it may be possible to create a homology model of the target based on the
experimental structure of a related protein. Using the structure of the biological target,
candidate drugs that are predicted to bind with high affinity and selectivity to the target
may be designed using interactive graphics and the intuition of a medicinal chemist.
Alternatively various automated computational procedures may be used to suggest new
drug candidates.
As experimental methods such as X-ray crystallography and NMR develop, the amount
of information concerning 3D structures of bimolecular targets has increased
dramatically. In parallel, information about the structural dynamics and electronic
properties about ligands has also increased. This has encouraged the rapid development
of the structure-based drug design. Current methods for structure-based drug design
29
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
can be divided roughly into two categories. The first category is about “finding” ligands
for a given receptor, which is usually referred to as database searching. In this case, a
large number of potential ligand molecules are screened to find those fitting the binding
pocket of the receptor. This method is usually referred as ligand-based drug design. The
key advantage of database searching is that it saves synthetic effort to obtain new lead
compounds. Another category of structure-based drug design methods is about
“building” ligands, which is usually referred to as receptor-based drug design. In this
case, ligand molecules are built up within the constraints of the binding pocket by
assembling small pieces in a stepwise manner. These pieces can be either individual
atoms or molecular fragments. The key advantage of such a method is that novel
structures, not contained in any database, can be suggested. These techniques are
raising much excitement to the drug design community (Leacn, Andrew and Jhoti 2007).
5-2
Drug action
When drugs enter into human body, they cause the body to react in a specific way. For
example, they tend to stimulate certain receptors, ion channels, act on enzymes or
transporter proteins. The drug which stimulates and activates the receptors is called
Agonists and the drug which stops the agonists from stimulating the receptors are
called Antagonists. pharmacodynamics is The action of the drugs on the human body
and the pharmacokinetics are what the body does with the drugs. The receptors either
trigger a particular response directly on the body when they are activated, or they
trigger the release of hormones and/or other endogenous drugs in the body to
stimulate a particular response. Actually, the drugs interact at receptors by bonding at
specific binding sites and because of most receptors are made up of proteins; the drugs
can therefore interact with the amino acids to change the conformation of the receptor
proteins.
5-3
Drug discovery
Drug discovery in the fields of medicine, biotechnology and pharmacology is the
mechanism by which drugs are discovered and/or designed. Now days, the approach to
discover the drugs is by understanding how disease and infection are controlled at the
molecular and physiological level and to target specific entities based on this
knowledge, not like before in the past most drugs have been discovered either by
30
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
identifying the active ingredient from traditional remedies or by serendipitous
discovery(Anson et al 2009).
5-4
Process of drug discovery
Discovery
The identification of novel active compounds is Discovery, which are usually obtained
by screening many compounds for the most wanted biological properties. The most
successful technique to identify the novel active compounds (usually called hits)
depends on chemical and biological intuition developed through years of accurate
chemical-biological training. Other sources of novel active compounds can come from
natural sources, such as plants, fungi or animals. It creates also from synthetic chemical
libraries.
Optimization
Discovery has another step which includes further chemical modifications to improve
the biological and physiochemical properties of a certain applicant compound library.
Chemical
modifications
develop
the
recognition
and
binding
geometries
(pharmacophores) of the applicant compounds, their affinities and pharmacokinetics, or
their reactivity and constancy during their metabolic degradation. There are some
methods that have taken part to quantitative metabolic prediction such as SPORCalc.
One of the most important methods which played a big part in finding leading
compounds
is
quantitative
structure-activity
relationship
(QSAR)
of
the
pharmacophore, which put on display the most power, most selectivity, best
pharmacokinetics and least toxicity. QSAR classified into CoMFA and CoMSIA which are
the physical chemistry and molecular docking tools.
Development
Rendering the lead compounds proper for use in clinical trials is the final step in process
of drug discovery. It is optimization of the synthetic route for bulk production, and the
preparation of a compatible drug formulation.
31
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
6
Structure-Activity Relationship (SAR)
Similar molecules have similar activities; this is the basic assumption for all molecule
based hypotheses which is the principle that we call Structure-Activity Relationship
(SAR). Because each kind of activity, e.g. biotransformation ability, reaction ability,
solubility, target activity, and so on, might depend on another difference, the problem is
how to define a small difference on a molecular level. It is not the case that all similar
molecules have similar activities, this what SAR paradox refer to (Patani and Lavoie
1996).
7
Quantitative structure-activity relationship
According to Patani and Lavoie (1996) quantitative structure-activity relationship
(QSAR) is a mechanism or a process when chemical structure correlates quantitatively
with processes, such as a biological activity or chemical reactivity. Sometimes we call it
quantitative structure-property relationship (QSPR). For example, as in the
concentration of a stuff required to give a certain biological response, we can express
the biological activity quantitatively. In addition, when we can express physicochemical
properties or structures by numbers, we can make mathematical relationship, or
quantitative structure-activity relationship between the two. Therefore, it is possible to
predict the biological response of other chemical structures by using the mathematical
expression.
3D-QSAR is one application to calculate the power field and that requires threedimensional structures, for example molecule superimposition is based on protein
crystallography. Its mechanism depends on the computed potentials instead of
experimental constants. It uses the shape of the molecule and the electrostatic fields
based on the energy function which is applied (Leach and Andrew 2001).
The most general mathematical form for QSAR is:
Activity = F (physiochemical properties and / or structural properties)
8
Quality of QSAR models
QSAR is a predictive model which derived from statistical application tools correlating
biological activity such as desirable therapeutic effect and undesirable side effects of
32
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
chemicals. It applied in many disciplines for instance, toxicity prediction, regulatory
decisions, and risk assessment (Tong et al 2005). Also, lead optimization and drug
discovery (Dearden 2003).
Judging the quality of QSAR depends on choice of descriptors, statistical methods and
the quality of biological data. It has to obtain model which capable of making accurate
and reliable prediction of the new compounds’ biological activities (Wold and Eriksson
1995). Proper validation and evaluation of the prediction power is important
component of all Quantitative structure-activity relationships QSAR models (Radhika,
Kanth and Vijjulatha 2010, p. S76).
Obtaining successful QSAR model depends on the accuracy of the input data, selection
of appropriate descriptors and statistical tools, and validation of the developed model
(Roy 2007). According to Lionard (2006) the validation is the procedure that the
reliability and relevance of a process are established for a precise purpose.
According to Doytchinova and Flower (2002, p.536) 3D QSAR methods are attractive
because of their combination of an understandable molecular description, rigorous
statistical analysis, and an unambiguous graphical display of the results.
9
CoMFA and CoMSIA
The methodologies of CoMFA and CoMSIA provides all the information necessary for
understanding
aligned molecules’ biological properties by obtaining a suitable
sampling of steric, electrostatic and hydrogen-bond donor fields around them ( Radhika,
Kanth and Vijjulatha 2010, p. S76)
According to (Fabian and Timofei 1996, p. 155) the method of CoMFA has become a
powerful tool to obtain QSAR. The methodology of CoMFA assumes that the differences
in molecular biological activity are often related to the differences in the magnitudes of
molecular fields surrounding the receptor ligands investigated (Shagufta et al 2006,
p.106).
According to Doytchinova and Flower (2002, p.536) CoMSIA methods use fields based
on similarity indices describing similarities and differences between ligands and
correlates them with changes in the binding affinity. Also, they mentioned that CoMSIA
properties are the most important contributions responsible for binding affinity and
33
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
these properties are: fields describe steric, electrostatic, hydrophobic and hydrogenbond donor and acceptor.
CoMSIA is a substitute approach for performing 3D QSAR by CoMFA. In terms of
similarity indices, molecular similarity is compared. In addition to the steric and
electrostatic fields used in CoMFA, the CoMSIA method defines explicit hydrophobic
and hydrogen bond donor and acceptor descriptors. Mainly, the purpose of COMSIA is
to partition the different properties into various locations where they play an important
role in determining the biological activity. The most important parameter in optimizing
CoMSIA performance is how to combine the five properties in a CoMSIA model
(Shagufta et al 2006, p.106).
10
Conclusion
The main things that have been summarised in this chapter are molecule structures and
some types of molecules. We have mentioned the physiologies of drugs, the way they
work to block or activate others molecules function and how we use it as a therapy for
human being. We have explained the mechanism of Quantitative structure-activity
relationship QSAR and its methods (Comparative molecular field analysis CoMFA and
Comparative Molecular Similarity Index Analysis CoMSIA)
34
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
Chapter 4
1
Research Work
Introduction
In this chapter we are going to summarise our work which utilizes genetic algorithm to
find the optimal alignment for a group of molecules by using the grid points which is the
way to represent the molecule for the computer. We will use the Euclidean distance as a
tool to find the difference or the similarity between two molecules.
2
Molecular similarity
Proposing a new method to improve drugs is an extremely challenging but highly
rewarding task, which explains the current plethora of approaches. Molecular similarity
measures are so important in the field of new medicines and agrochemicals. We use the
new similarity measures operation to calculate the similarity between molecules from
the same family which is the Steroid family. We utilized this measurement to optimize
the alignment for these molecules based on one molecule as a target molecule and
others as sample molecules. The method that we are using depends on the Euclidean
distance to quantify between the sample and target molecule. We use a mechanism of
grid and this grid has a lot of points, these points get affected by the power which comes
from each atom in one molecule. This approach will enable predictions in medically
related QSAR. In the chemical environment you can predict the chemical behaviour of
one molecule, for example (reactivity, ligand docking, and acidity) based on its
structure. You do not need to understand the often extremely complex details of the
molecule’s action in the chemical environment. This means that we can use the data of
one molecule’s action to predict the action of another molecule closely related by
merely comparing how similar they are. This is the basis of molecule similarity in
chemical environment.
3
Quantum Molecular Similarity Measures
The development of analogous techniques for three dimensional similarity searching
has been supported by the development of effective and efficient techniques for three
dimensional substructure searching, where the aim is to identify those molecules in a
35
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
database that are most similar to a user-defined target structure, using some
quantitative measure of intermolecular structural similarity. There are a lot of ways to
measure the quantum similarity, such as encompass algorithms for clustering 2D
structures, molecular surface matching, similarity searching through 3D databases,
shape-group methods to describe the topology of molecular shape, CoMFA
(Comparative Molecular Field Analysis), shape-graph descriptions (Thorner et al 1996,
p. 900). In this approach we are going to use Comparative Molecular Similarity Indices
Analysis (CoMSIA) as a method to measure the similarity between two molecules.
In this research the similarity measures based on the molecular X ray powers which
have been used to quantify the degree of resemblance between pairs of rigid threedimensional molecules. This research discussed the effect of including molecular
flexibility on the similarities that are calculated using such measures in searches of large
three dimensional databases. Good results have been obtained with genetic algorithm
that has been developed for calculating the similarity between the X ray power effects of
molecules, one of them is rigid. Although some molecules are naturally rigid, there are
many organic molecules that contain one or more rotatable bonds and this allows the
molecule to exist in many different conformations. For this reason it is necessary to
consider how torsion flexibility will affect the molecular x ray power effect.
4
The Grid Points
In this project we used the gird points to measure the similarity between two molecules.
The molecule has a lot of atoms, each atom has X ray power effect and this effect can be
measured depending on the distance between this atom and one point of the grid. We
did this by using the Euclidian equation which is a way to calculate the distance
between two coordinate in three dimensions. For example, in three-dimensional
Euclidean space, the distance between the coordinate (x, y, z) and the coordinate (a, b,
c) is:
36
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
The figure below shows one molecule in a grid:
Figure 8- One Molecule in a Grid (Lock 2007)
The figure below explains how to calculate the distance between the atom and one point
of the grid:
Figure 9- Distance between the Atoms of the Molecule and One Point on the Grid (Lock 2007)
37
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
Therefore, to measure the effect of the power of one molecule on one point in the grid,
we calculated the summation of the powers for all atoms in the molecule to this point
and we called this the summation point value. In parallel, we made a list with size equal
to the number of points in the grid and this list stored the point’s values. To do so, we
made a method which we called “find grid powers”. When we traverse the molecule to
this method; it will find the point’s power values and will keep it in a list. To calculate
the power for one atom, we used the equation below:
Power = distance / 10
To measure the point value we found the summation for all atoms powers as explained
in the figure below:
Figure 10- Represent a Molecule in a List (Lock 2007)
5
Alignment algorithm
In this aproach we used genetic algorithms to do the alignment for some molecules to
predict their chemical behavior. We used the principles of genetic algorithms and we
38
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
added some new steps to get more diversity in the space of solution. We mimiked two
strategies to reach the optimal solution. The first strategy was to do exactly what a golf
player does when he/ she hits the ball to put it in the hole. The second strategy was to
consider the formal genetic algorthm to do each hit of the player. At the beginning of the
game the player hits the ball as hard as possible, then he/ she hits it with less power and
so on until he or she reaches his/her goal. In parallel, the first chromosome we used to
perform the genetic algorithms is a chromosome with big parameters values. Then for
the next step, we used a chromosome with less parameters values and so on until we
reached our goal. The mechanism to do the alignment is to consider each one of the
database structures as flexible molecules while, the target structure is rigid. The genetic
algorithm is designed to identify a set of gemotric transformations: rotations and
translations. These rotations and translations are encoded as a chromosome, which we
used to rotate and translate the database molecules for aligning them with a target
molecule. To avoid wasting a lot of time just to bring the two molecules into the general
area of 3D space, we found the centroid for each molecule in the database structure and
pulled it to the centroid of the target molecule. The initial population for the genetic
algorithm was created by generating random values for each parameter (gene) inside
the chromosomes. We applyed this chromosome to rotate the molocule about the
centroid which is the centre of the molecule and also translated it inside a small space.
The first step was to use the initial population a few times to get the best position for
the molecule depending on the fitness value and considering this as the first hit for the
golf player. The next step was to intiate new chromosomes with smaller parameters
values and do this a few times. We considered this as the second hit for the golf player
and so on until the molecule reached the optimal allignment to be similar to the target
molecule.
6
Fitness function
The fitness function is the most important aspect of the alignment by genetic algorithms
because it decides if there is progress during the process or not. The fitness value is the
difference between two molecules; therefore, the smaller the value is, the bigger the
similarity is. To measure the fitness value, first we found the list for the target molecule
and considered this list as a fixed list to compare it with other lists which belong to
other molecules. The mechanism to find the fitness value between two lists (the target
39
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
molecule and the sample molecule) was to perform Eucledian distance again. Here it
was not just for three dimensions but for 216 dimensions. See equation below.
The target molecule list contained 216 values, each value represented how much target
molecule atoms affect one point in the grid with their powers. The sample molecule list
contained 216 values as well and each value represented how much the sample
molecules atoms affect one point in the grid by their powers also. By applying Eculedian
distance on these two lists, we got the fitness value, which represented the progress of
the alignment. The diagram below explains how to find the fitness value between the
target molecule list and sample molecule list:
Figure 11- Fitness Function
40
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
This is another figure to explain how to take the points values into a list:
Figure 12- Taking Points Values from a Grid into a List
During the experience of the work with this alignment we found that it was much better
to push the molecules positively into the positive area of the grid and perform genetic
algorithms then pull the molecules back negatively by the same distance which we
pushed them before, because sometimes atoms in the same distance and in different
signals give the same power effects which affected the result in a negative way. Also,
molecules with more atoms compared with molecules with less atoms would affect to
calculate the power from the molecule into the points of the grid.
7
The mechanism of the program:
We wrote the program by Java language which is one of the best languges in the present
time. The first thing the program does is reading the first molecule from the identified
file and calls it the target or fixed molecule, then perform the loop which reading from
second molecule till the last molecule in the file and we call this loop is molecules loop.
In fact when the program calls the first molecule, it saves it into temporary array , which
we call temporaryMolecule1, then it finds the power effect for each point in the grid and
saves these power values in a vector which we call TVector refering to the target
molecule.When the program calls the second molecule, it saves it in array which we call
temporaryMolecule2. Then, the same thing; it found the power values for each point in
the grid and saves it in a vector called SVector refering to the sample molecule, using the
procedure which we call getEuclFittnes to find the distance between the two vectors
(TVector and SVector) by performing Euclidean distance. This considers the first fitness
which is before performing the genetic algorithm and it is the value of the difference
between the target and sample molecule. Actually, the program before performing
41
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
genetic algorithms, it push the two molecules positively in to the positive area of the
grid to avoid the problems of the negative signals then after genetic algorithms, it pulls
them back by the same distance. In genetic algorithms, the program initiate random
binary chromosome and then translate it into real number chromosome which consist
of six numbers, three to represent the translation and three to represent the rotation. It
finds the centroid of the molecule by obtaining the summation of the distances of its
atoms then divides this summation by the number f atoms in the molecule to transform
the molecule about this centroid. By checking the fitness every time, the program reuses
the same chromosome in case it found progress by using it. Otherwise, it initiates
another chromosome randomly to get new position for the molecule. It uses some
temporary arrays to keep these positions of the molecule when there is a progress in
the alignment steps and uses it in the next core of the loop.
42
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
The process of genetic algorithm is explained below by the flow chart:
43
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
No
Yes
No
Yes
44
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
Yes
No
Figure 13- The Steps of our Software Algorithm
45
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
8
Progress of Fitness Value
The progress of the fitness value to align one sample molecue to the target one in the
program is showen below:
Fitness before genetic algorithms 3475.421179465748
The progress of fitness during the process of genetic algorithms
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 0
Internal counter 2
Fitness 3472.567298407476
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 0
Internal counter 2
Fitness 3302.4535452169794
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 2
Internal counter 0
Fitness 3248.5183022971155
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 2
Internal counter 0
Fitness 3054.8163721724113
46
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 2
Internal counter 1
Fitness 2982.8447901424406
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 2
Internal counter 6
Fitness 2841.37017850248
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 2
Internal counter 6
Fitness 2813.9623233579355
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 5
Internal counter 9
Fitness 2745.5946955887607
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 5
Internal counter 9
Fitness 2567.9548395992474
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
47
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
Again 0
External counter 5
Internal counter 18
Fitness 2481.661177505608
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 5
Internal counter 18
Fitness 2254.6069136291703
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 5
Internal counter 18
Fitness 2160.867300287659
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 6
Internal counter 1
Fitness 2034.0507525272965
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 6
Internal counter 6
Fitness 2020.3937523150983
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
48
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
External counter 6
Internal counter 6
Fitness 1829.5571891079426
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 6
Internal counter 6
Fitness 1804.4578681132932
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 6
Internal counter 20
Fitness 1659.5671266650627
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 6
Internal counter 22
Fitness 1434.221464646745
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 6
Internal counter 22
Fitness 1250.3419280607452
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 9
49
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
Internal counter 2
Fitness 1184.3723007336894
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 9
Internal counter 2
Fitness 1002.6036714036574
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 9
Internal counter 2
Fitness 940.6145222485881
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 9
Internal counter 2
Fitness 787.1680503248232
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 9
Internal counter 3
Fitness 741.2731699290922
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 9
Internal counter 17
50
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
Fitness 734.892515573167
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 9
Internal counter 17
Fitness 619.602990979145
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 9
Internal counter 17
Fitness 512.9090031803194
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 10
Internal counter 5
Fitness 310.26777439088215
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 10
Internal counter 13
Fitness 216.04409712612068
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 10
Internal counter 23
Fitness 167.20598169213577
51
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 12
Internal counter 0
Fitness 24.450877979887906
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 14
Internal counter 4
Fitness 6.120089334482836
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 22
Internal counter 22
Fitness 5.4866454356213294
Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms
Again 0
External counter 28
Internal counter 5
Fitness 2.434489938614776
52
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
The diagam below showes the progress of the fittnes during the process:
Figure 14- Progress of Fitness Function
The amount of differecne is getting less during the process of the program
53
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
The steps to find the standard deviation for running the proram ten times:
M = (4 + 3 + 3 + 2 + 6 + 4 + 3 + 3 + 4 + 3) / 10 = 3.5
Table 1
X
M
(X-M)
(X-M)2
4
3.5
0.5
0.25
3
3.5
- 0.5
0.25
3
3.5
- 0.5
0.25
2
3.5
- 1.5
2.25
6
3.5
2.5
6.25
4
3.5
0.5
0.25
3
3.5
- 0.5
0.25
3
3.5
- 0.5
0.25
4
3.5
0.5
0.25
3
3.5
0.5
0.25
The sum of (X-M) 2
= 0.25
+ 0.25 + 0.25 + 2.25 + 6.25 + 0.25 + 0.25 + 0.25 + 0.25 + 0.25 = 10.5
N–1=9
√9 * √10.5 = 9.721
The standard deviation is 9.721
9
Results and discussion
We have first tested our algorithm with the set of molecules of the Steroid family. We
considered the first molecule as a target molecule and the rest of the list as sample
54
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
molecules. The chromosome which has been use to do the processes of genetic
algorithm has six degrees of freedom, three values represent the translation in three
dimension X, Y and Z, the rest three values represent the rotation in three dimension as
well X, Y and Z. The software roughly spends five seconds to align each sample
molecule in the data base to the target molecule. When we tried to print the result of
aligning the target molecule with one of the sample molecules by using the hand, we
found that is not so different from aligning the same two molecules by our software. The
figures below show the difference between using the software and using the only hand,
where the red atoms represent the target molecule and the black atoms represent the
sample molecule.
Figure 15- Two Molecules Aligned by Hand
55
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
Figure 16- Two Molecules Aligned by the Software
The screen shows the results only in two dimensions and this is not enough to be sure
that our software is working successfully because each molecule structure is of three
dimensions. Therefore, we tried to find another way to test our software. In fact, we
took a copy of the target molecule and used it as a sample molecule. So, now we have
the target molecule and the sample molecule are similar one hundred percent.
Therefore, the software to be successful should align them one hundred percent. Before
starting the process of genetic algorithm, we pushed the sample molecule far away from
the target one and rotated it in three dimension randomly (random values in each X, Y
and Z dimension). Therefore, the two molecules now are in different positions and
different shapes as well.
56
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
The figures below show the progress of the program before and after perform the
genetic algorithm in the software.
Figure 17- Progress of Align Two Molecules (Step 1)
Figure 18- Progress of Align Two Molecules (Step 2)
57
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
Figure 19- Progress of Align Two Molecules (Step 3)
Figure 20- Progress of Align Two Molecules (Step 4)
58
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
Figure 21- Progress of Align Two Molecules (Step 5)
Figure 22- Progress of Align Two Molecules (Step 6)
59
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
Figure 23- Progress of Align Two Molecules (Step 7)
From the final result picture, we can find that the software is very successful. Although
it did not reach the optimal solution, which make the two molecules aligned one
hundred percent however it reached the very near optimal solution which is more than
enough to align two different molecules.
10
Conclution
We have presented a method for aligning a collection of steroid molecule family. The
method produces a collection of alignments along with a score for each alignment based
on the atoms energy and similarity score definded by eucledian distance. The method
accepts molecules with 3D coordinates as input and computes a collection of
alignments. Each alignment is given a score, which quantifies the quality of the
alignment between the target and sample molecule. We have used the grid points as a
way to represent the molecule in a numerical way. We have used the eucledian distance
as a tool to measure the deferences and similarities between two molecules and it was a
succesful tool. The genetic algorthms is the alorithm which we have use to do the
alignment for the group of molecules and we have mimicked the mechanism of golf
players to reach the goal.
60
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
Chapter 5
Conclusion
In this research we went through the mechanism of evolution and natural selection
which belong to the Darwin theory. We discussed how John Holland mimicked this
theory to invent the idea of genetic algorithm. Genetic algorithm is a method to find the
optimal solution for some problems in the real life where these problems have
compatible structure with genetic algorithm. It depends of stochastic and diversity to do
its processes. It uses some operations such as crossover and mutation to obtain the
diversity with solutions. It uses the fitness function as a brain of the algorithm to control
the whole process. We have used the golf player idea to get the optimal alignment for a
group of molecules. We went through some papers related to our work and they were
useful, as they gave us experience about dealing with molecules alignment and genetic
algorithm. In this research, we discussed the phenotype and genotype and how the
features of organism can get inherited from generation to generation. In this research
we talked about the molecules, small molecules, molecular formula, molecular
geometry, medicinal chemistry, drug, drug design, drug action, drug discovery, and the
process of drug discovery. In fact, align molecules is a good method to improve drugs.
Comparative Molecular Similarity Indices Analysis (COMSIA) has been discussed and it
is a 3D method to predict and correlate molecule’s biological activity. It’s one method of
QSAR. The research focus was translation and rotation (transformation) for each
molecule in the data base. The algorithm has been developed to align some molecules
comparing to one molecule considered as target one. We have use the Euclidean
distance as a tool to measure the difference and similarity between two molecules. Grid
points have been used to represent each molecule in the space and it was a good tool to
control the progress of fitness function and steps of genetic algorithms. We have tested
our algorithms by taking a copy of the target molecule and consider it as a sample
molecule. We used both of them as an input for our algorithm and that because both of
the molecules are similar one hundred percent. In fact, it was a good idea to test our
software and it was successful to get the optimal alignment.
61
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
Chapter 6 Future Research (Pharmacophore)
According to Jones G (2000), pharmocophore modelling is a useful tool because in case
of absence of the three dimensions structure of a protein target, it provides good
alternative. pharmacophore describes the molecule features which are necessary to
recognize the ligand molecule by biological macromolecule. It is an ensemble of steric
and electronic features that is used to ensure the optimal supramolecular interactions
with a specific biological molecule target and to trigger or block its biological activity.
Pharmacophores are used in modern computational chemistry to define the important
features of one or more molecules with the same biological activity. A chemical
compounds database can then be searched for more molecules that share the same
features located a similar distance apart from each other. Genetic algorithm is used to
maximize the distance similarity between pharmacophore features. It encodes
conformational information in bit individuals or strings mappings between molecules in
the overlay.
The fitness function is useful to guide the progress of overlapping
pharmacophore features Strozjev et all (2005).
In the future work we will use sophisticated new genetic algorithm that defines each
molecule as a core structure plus a set of torsions and to overcome the limitation
located in pharmacophore tools. Using pharmacophore will focus on some parts of the
molecule which is the most important part on the molecule rather than consider the
entire molecule. There are many advantages of our future work: for example, pareto
multi-objective will be useful to simultaneously balance steric, and energy information
for building the most valuable hyper molecule models require. Moreover, unlike other
methods run time will scales linearly with the number of lignads.
62
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
References
Alex Barnum (13 May 1991). "Biotech companies shift focus". The Toronto Star.
Anson, Blake, D , Junyi, M & Jia-Qiang, H 2009). ‘Identifying Cardiotoxic Compounds’.
Brown, T.L. 2003. Chemistry – the Central Science, 9th Ed. New Jersey: Prentice Hall.
Carbó, R., Leyda, L. and Arnau, M. ‘How similar is a molecule to another? An electron
density measure of similarity between two molecular structures’. Int. J. Quant. Chem.
1980, 17, 1185-1189.
Cohen, N. Claude (1996). Guidebook on Molecular Modeling in Drug Design. Boston:
Academic Press.
Daeyaert, Jonge, M, Heeres, J, Koymans, L, Lewi, P, Broeck, W, & Vinkers, M (2005).
"Pareto optimal flexible alignment of molecules using a non-dominated sorting genetic
algorithm " Chemometric and Intelligent Laboratory Systems vol. 77, 232-237.
Dearden JC (2003). "In silico prediction of drug toxicity". Journal of Computer-aided
Molecular Design 17 (2–4): 119–27.
Doytchinova, I & Flower, D 2002, ‘A Comparative Molecular Similarity Index Analysis
(CoMSIA) study identifies an HLA-A2 binding supermotif’, Journal of Computer-Aided
Molecular Design, vol. 16, pp. 535-544.
Fabian, W & Tiofei, S 1996, ‘Comparative Molecular Field analysis (CoMFA) of dye-Fibre
affinities’, Elsevier, pp.155-162.
Genetic Engineering & Biotechnology News (Mary Ann Liebert) 29 (9): pp. 34–35.
Good, A.C., Hodgkin, E.E. and Richards, W.G. The utilisation of Gaussian functions for the
rapid evaluation of molecular similarity. J. Chem. Inf. Comput. Sci. 1992, 32, 188-191.
Guner, Osman F. (2000). Pharmacophore Perception, Development, and use in Drug
Design. La Jolla, Calif: International University Line.
63
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
Jewell , N, Turner,Willett, P & Sexton, G (2001). ‘automatic generation of alignments for
3D QSAR analyses’, Journal of Molecular and Graphics and Modeling, vol. 2, pp. 111-121.
Jones, Gareth. Genetic and evolutionary algorithms
Kubinyi, H ND, ‘Comparative Molecular Field Analysis (CoMFA)’, Ludwigshafen.
Leach, Andrew R. 2001. Molecular modelling: principles and applications. Englewood
Cliffs, N.J: Prentice Hall.
Leach, Andrew R.; Harren Jhoti (2007). Structure-based Drug Discovery. Berlin: Springer.
Leonard JT, Roy K (2006). "On selection of training and test sets for the development of
predictive QSAR models". QSAR & Combinatorial Science 25 (3): 235–251.
‘Ligand-based design: Pharmacophore Perception and Molecular Alignment’, Tripos.
Lock, P 2007, ‘Machine Learning in Drug Discovery’, pp.4-5 .
Michalewicz, Z 1999, Genetic Algorithms + Data Structure = Evolution Programs,
Springer, Berlin.
Michalewicz, Z 2010, Evolutionary Computation.
Michalewicz, Z & Foge, D 2004, How to Solve It: Modern Heuristics, Springer, Berlin.
Patani GA & LaVoie EJ 1996. ‘Bioisosterism: A Rational Approach in Drug Design’.
Chemical Reviews, vol. 96 , pp. 3147–3176.
Payne, A. W. R. a. Glen., R.C 1993. ‘Molecular recognition using a binary genetic search
algorithm’, Jounal of molecular Graphic and Modeling vol.11, pp. 72-91.
Radhilka, V, Kanth, S & Vijjulatha, M 2010,’ CoMFA and CoMSIA Studies on Inhibitors of
HIV-1 Integrase - Bicyclic Pyrimidinones’, E-Journal of Chemistry, vol. 7(S1), pp. S75-S84.
Richmond, N, Willet, P & Clark, R 2004, ‘Alignment of three-dimensional molecules
using an image recognition algorithm’, Jounal of molecular Graphic and Modeling, vol.
23, pp. 199-209.
Roy, K 2007, ‘On some aspects of validation of predictive quantitative structure-activity
relationship models’, Expert Opin. Drug Discov. 2, pp. 1567–1577.
64
The School of Computer & Information Science
Genetic Algorithm and Molecules Alignment
Shagufta, Kumar, A, Panda, G & Siddiqi, M 2006, ‘CoMFA and CoMSIA 3D-QSAR analysis
of diaryloxy-methano-phenanthrene derivatives as anti-tubercular agents’, J Mol Model,
vol. 13, pp. 99-109.
Strizhev, A, Abrahamian, E, Choi, S, Leonard, J, Wolohan, P & Clark, P 2006, ‘The Effects
of Biasing Torsional Mutation in a Conformational GA’, J.Chem,Inf,Model, vol. 46, pp.
1862-1870.
Thorner, D, Wild, D, Willet, P & Wright, P 1996, ‘Similarity Searching in Files of ThreeDimensional Chemical Structure: Flexible Field-Based Searching of Molecular
Electrostatic Potentials’, J.Chem.Inf.Comput.Sci, vol. 36, pp.900-908.
Tong W, Hong H, Xie Q, Shi L, Fang H, Perkins R (April 2005). "Assessing QSAR
Limitations – A Regulatory Perspective". Current Computer-Aided Drug Design, vol. 1, pp.
195–205.
Wild, D, & Willett, p 1996, ‘Similarity Searching in Files of Three Dimensional Chemical
Structures. Alignment of Molecular Electrostatic Potential Fields with a Genetic
Algorithm’, Journal of Chemical Iformation and Computer Science, vol. 36, pp. 159-167.
Willet, P. 1995. "Genetic algorithms in molecular recognition and design. TIBECH.
Wold S & Eriksson, L 1995, ‘Statistical validation of QSAR results. In Waterbeemd, Han
van de’. Chemometric methods in molecular design. Weinheim: VCH. pp. 309–318.
Xu, H, Sergei, Z & Dimitris, A 2003, ‘Conformational sampling by self-organization’,
Jounal of Chemical Iformation and Computer Science, vol. 43, pp 1186 1191.
Yadgary, J, Amihod, A & Ron U 1998, ‘Genetic algorithms for protein threading’.
65
The School of Computer & Information Science
Download