Contents ABSTRACT ............................................................................................................................................................ III CHAPTER 1 1 2 3 4 GENERAL INTRODUCTION....................................................................................................... 1 INTRODUCTION .............................................................................................................................................. 1 LITERATURE SURVEY........................................................................................................................................ 1 RESEARCH METHODOLOGY.......................................................................................................................... 14 OUTLINE OF THESIS ....................................................................................................................................... 14 CHAPTER 2 GENETIC ALGORITHM ............................................................................................................ 15 1 2 INTRODUCTION .......................................................................................................................................... 15 DARWIN'S THEORY OF EVOLUTION - NATURAL SELECTION ............................................................................ 15 2-1 Evolution ........................................................................................................................................ 15 2-2 Natural selection ............................................................................................................................ 17 3 PHENOTYPE AND GENOTYPE IN THE NATURE ................................................................................................. 17 4 GENETIC ALGORITHMS ................................................................................................................................ 19 5 GENETIC ALGORITHM ANALOGY.................................................................................................................... 20 6 THE STRUCTURES OF GENETIC ALGORITHM.................................................................................................... 20 7 GENETIC ALGORITHM STEPS: ....................................................................................................................... 21 8 ELEMENTS OF GENETIC ALGORITHM ............................................................................................................ 22 9 GENETIC ALGORITHM OPERATIONS .............................................................................................................. 22 9-1 Crossover ........................................................................................................................................ 22 9-2 Crossover rate ................................................................................................................................ 22 9-3 Types of crossover .......................................................................................................................... 22 9-4 Mutation ......................................................................................................................................... 24 9-5 Mutation rate ................................................................................................................................. 24 9-6 Types of Mutations ......................................................................................................................... 25 10 CONCLUSION ............................................................................................................................................... 25 CHAPTER 3 1 2 3 4 5 6 7 8 INTRODUCTION ............................................................................................................................................ 26 MOLECULE .................................................................................................................................................. 26 SMALL MOLECULES ....................................................................................................................................... 26 DRUGS ....................................................................................................................................................... 27 QUANTITATIVE STRUCTURE-ACTIVITY RELATIONSHIP ............................................................................................ 27 QUALITY OF QSAR MODELS ........................................................................................................................... 28 COMFA AND COMSIA ................................................................................................................................ 28 CONCLUSION ............................................................................................................................................... 29 CHAPTER 4 1 2 3 4 5 7 8 MOLECULES AND DRUGS ................................................................................................... 26 RESEARCH WORK .............................................................................................................. 30 INTRODUCTION ............................................................................................................................................ 30 MOLECULAR SIMILARITY................................................................................................................................. 30 QUANTUM MOLECULAR SIMILARITY MEASURES ................................................................................................. 30 THE GRID POINTS ......................................................................................................................................... 31 ALIGNMENT ALGORITHM ................................................................................................................................ 34 THE MECHANISM OF THE PROGRAM: ................................................................................................................ 37 PROGRESS OF FITNESS VALUE ......................................................................................................................... 41 9 10 RESULTS AND DISCUSSION .............................................................................................................................. 50 CONCLUTION ............................................................................................................................................... 55 CHAPTER 5 CONCLUSION....................................................................................................................... 56 CHAPTER 6 FUTURE RESEARCH (PHARMACOPHORE) ...................................................................... 57 REFERENCES ................................................................................................................................................ 58 List of Figures Figure 1- Genotype and Phenotype (Michalewicz 2010) ................................................................. 18 Figure 2- Mechanism of Genetic Algorithm (Michaewicz 2010) ..................................................... 21 Figure 3- Single Point Crossover (Michalewxciz 2010) ................................................................... 23 Figure 4- N Points Crossover (Michalewicz 2010) ........................................................................... 23 Figure 5- Uniform Crossover (Michalewicz 2010) ........................................................................... 24 Figure 6- Mutation (Michalewicz 2010) ............................................................................................ 25 Figure 7- Mutation Factor 2m (Michalewicz 2010) .......................................................................... 25 Figure 8- One Molecule in a Grid (Lock 2007) .................................................................................. 32 Figure 9- Distance between the Atoms of the Molecule and One Point on the Grid (Lock 2007) 32 Figure 10- Represent a Molecule in a List (Lock 2007) ................................................................... 33 Figure 11- Fitness Function ................................................................................................................ 36 Figure 12- Taking Points Values from a Grid into a List .................................................................. 36 Figure 13- The Steps of our Software Algorithm .............................................................................. 40 Figure 14- Progress of Fitness Function ............................................................................................ 48 Figure 15- Two Molecules Aligned by Hand ..................................................................................... 50 Figure 16- Two Molecules Aligned by the Software ......................................................................... 51 Figure 17- Progress of Align Two Molecules (Step 1) ...................................................................... 52 Figure 18- Progress of Align Two Molecules (Step 2) ...................................................................... 52 Figure 19- Progress of Align Two Molecules (Step 3) ...................................................................... 53 Figure 20- Progress of Align Two Molecules (Step 4) ...................................................................... 53 Figure 21- Progress of Align Two Molecules (Step 5) ...................................................................... 54 Figure 22- Progress of Align Two Molecules (Step 6) ...................................................................... 54 Figure 23- Progress of Align Two Molecules (Step 7) ...................................................................... 55 List of Tables Table 1 .................................................................................................................................................. 49 Abstract One of the most common modern heuristic methods to solve computational problems is genetic algorithm. When we look at genetic algorithm we see that Darwinian evolution’s characteristics have been mimicked. In fact, it has achieved many successes in various fields of life’s applications. This research used genetic algorithm to properly align molecules to be similar to a target molecule. In particular, genetic algorithm has been used as a mechanism to improve the ability for aligning some molecules in the space and comparing them with the best position of known structure to find the optimal solution which is optimal alignment. The optimal alignment is a prepared data and an input for the subsequent application, for example, Comparative Molecular Similarity Indices Analysis (COMSIA) which is a 3D method to predict and correlate molecule’s biological activity. The research discussed how transformation (translation and rotation) has been performed on each molecule of the database, I used transformation matrices and it is very useful to do translation and rotation, where I considered the coordinate of each atom of the molecule and its rotation angles to represent each chromosome. To find the best transformation I have to use the chromosome mechanism and perform some operation on it to obtain the diversity in random way. In addition, it mentioned and summarized some related projects which are near to my work such as genetic algorithm in molecular recognition and design, protein structure alignment using a genetic algorithm and genetic algorithm for protein threading. The research question is “how well does genetic algorithm optimisation perform the alignment of similar molecules”. Genetic Algorithm and Molecules Alignment Chapter 1 General Introduction 1 Introduction Good results have been obtained with genetic algorithm which has been developed for calculating the similarity between the x-ray powers of molecules, one of the molecules is rigid. Genetic algorithm has mimicked Darwin's Theory of Evolution and natural selection which evolution presumes the development of life is a slow gradual process began from non-life or simple life (simple solution in genetic algorithms) and stresses a purely (optimal solution). In others words, the complex creatures evolve from more simplistic ancestors naturally over time. Problems which have no compatible structure to the genetic algorithms will be very difficult to solve. However, the structure of molecules is very clear and it’s also feasible to be optimized by genetic algorithm. Similarity measurements based on the molecular X-ray powers have been used to quantify the degree of resemblance between pairs of rigid three-dimensional molecules. This thesis discussed the effect of including molecular flexibility on the similarities that are calculated using such measurements in search of large three dimensional databases. It is achievable to predict the molecules biological activities by knowing how similar they are in their shape. The research focused on getting the molecule and aligning it by rotation and translation to a target one by using genetic algorithm steps. I used the grid points as a way to represent the molecule for the computer and that by constraint x ray power on the molecule and I have measured the distance between each atom belong the molecule and each point on the grid by using Pythagoras method. The tool to find the difference between two molecules is Euclidean distance. 2 Literature Survey There are some researches similar to my work and it is very useful to mention how authors worked with and gave their ideas about dealing with molecules and genetic algorithm. According to Thorner et al(1996), molecular electrostatic potential (MEP) is the method which has been used to measure the similarity between pairs of rigid three-dimensional (3D) molecules. They mentioned that better results have been obtained with genetic algorithm (GA) which has been developed for calculating the resemblance between the MEPs of tow molecules. The authors stated that the development of a range of 1 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment sophisticated systems for 3-D substructure searching has been led by the development of effective and efficient programs for generating three-dimensional (3-D) structures from two dimensional (2-D). The molecule electrostatic potential around a molecule has been represented by 3-D grid where the ijkth element is the real-number value of MEP at this location (i, j, k). There are two stages to obtain the similarity between the target structure and database structure: align the corresponding grids to maximize the degree of overlapping, and then use a measurement such as cosine coefficient to calculate the similarity corresponding to this alignment. In fact, they did not use just the genetic algorithms as a mechanism to obtain the similarity but they also have used the graph-theoretic algorithm to match a target structure against each of the structures in a database and by applying the graph-generation procedure to all of the constituent structures. Therefore, the similarity search is affected by comparing the field-graph representing the target structure with the field-graph of each of the molecules in the data base. The mean which has been used to do the comparison is maximal common sub-graph (MCS) which identifies the largest sub-graph common to the pair of field-graphs. The MCS resulting from this mechanism specifies an alignment of the corresponding MEPs and this alignment enables the calculation of the intermolecular similarity which Gaussian approximation procedure has been used to do it. For applying genetic algorithm, the chromosome here is encoded as a set of translations and rotations and applied to the 3D coordinates of one molecule to align its MEP with the MEP of another fixed molecule in the space. The similarity value resulting from Gaussian similarity calculation is considered as fitness functions for GA which identifies the alignment by maximizing the value of this fitness. They mentioned that most organic molecules contain one or more rotatable bonds; therefore, allowing the molecule to exist in many different conformations and that are so useful for MEP-based similarity searching. The genetic algorithm is designed to classify a set of geometric transformations (rotations, translations and torsional rotations) to obtain the maximal overlap of a database structure’s MEP with that of the target structure. The chromosome which represent the transformation contains one-byte components plus and extra one-byte component for each rotatable bond in the database structure, a single byte encodes 256 possible rotations. To save time from being wasted to bring the two molecules into the same general area of 3-D space, they initiate the algorithm by pulling the database structure 2 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment and target structure at the origin point (0, 0, 0). They have used the crossover and mutation as a genetic operator; they have tested one-point crossover, two-point crossover and uniform crossover. The best results have been found from using the twopoint crossover. They used mutation operator by checking each individual bit of the chromosome in turn and then flipping it (changing it from zero to one and vice versa). The mechanism to choose between crossover and mutation is generating a number in the range 0-100, if the number is less than the crossover rate then the crossover isperformed, otherwise mutation. There are some problems with using the field graph approach. First, the experiments have reported that this algorithm is not very robust. Secondly, the generation of each graph needs as input a single, fixed MEP, and this generation mechanism would therefore have to be repeated many times to create a database for flexible searching (with consequent storage and processing costs). Therefore, they prefer to use genetic algorithms over field graph approach, especially that genetic algorithm has been shown previously to be well suited to the processing of flexible molecules. According to Willet (2006), one of the simplest virtual screening tools is similarity searching using 2D fingerprints and it is widely used in the early stages of leaddiscovery programmes. In this paper the author has summarized the result of studies that sought to increase the effectiveness of current system for similarity- based virtual screening. He found out that if there is no specific information about the sizes of the molecules required for testing, is the coefficient of choice for computing molecular similarities. Willet states that there are two main types of virtual screening systems: first, the popular structure-based approach, for example, docking de novo design, which can be used when the 3D structure of the biological target is available. The second is the ligandbased approaches which are applicable in the absence of such structural information. For instance, pharmacophore methods, which involve the identification of the pharmacophoric pattern common to a set of known actives and the use of pattern in a subsequence 3D substructure search, the similar method which the author focuses on, and machine learning methods, in which classification rule is developed from a trainingset containing known active and known inactive molecules. 3 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment The basic idea underlying similarity-based virtual screening is molecules that are structurally similar are likely to have similar properties. Therefore, the strategy of virtual screening involves computing the similarity between each of the molecules in a database and the known reference structure, ranking the database molecules in decreasing order of the computed similarities and then carrying out real screening on just the top-ranked database molecules. He mentioned that the measurement which is used to quantify the degree of resemblance between the reference structure and each of the structure in the database is the heart of any system for similarity-based virtual screening. Therefore, a similarity measure involves three components: a method to represent the molecule in a way to be compared with others (which 2D fingerprint is the structural representation the author has focus on), the weighting scheme that is used to assign differing degrees of importance to the various components of these representations and a function to find the degree of resemblance between two structural representation. The similarity coefficient which has been used for comparing fingerprint is the Tanimoto coefficient. It suggests that two molecules have a andb bits set in their fragment bit-strings, with c of these bits being set in both of the fingerprints; therefore, the Tanimoto coefficient is defined to be: c / (a + b - c). According to Yadgary, Amir and Unger (1998) using the amino acid sequence to compute the three-dimensional structure of a protein is a way to obtain the physical and chemical properties of the protein molecule and that is because of the chemical and physical properties of a protein molecule depend on its three dimensional structure, where the structure of proteins is the key to gain insight into their function. Today, it is common to discover the structure of the protein by X-ray crystallography and NMR spectroscopy. Calculation the structure of the protein directly from its sequence is not possible since it requires minimization of a function of thousands of variables, with constants that have not be accurately determined. Instead of that they have mentioned another approach which is threading. Threading recognizes a known structure which the sequence might be compatible to predict the three dimensional fold of a protein sequence. In this approach, the way to thread a given sequence by a given target structure through searching for alignment of sequence structure which puts sequence 4 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment residues in preferred structural places. Here the authors have suggested using genetic algorithms to obtain optimal sequence structural alignment. It is a method to predict the protein structure and that by threading the sequence of one protein through the known structure on another. In the absence of detectable sequence similarity, this method has proved it’s self in recognizing similarity of a sequence to a protein of known structure. To design a threading procedure, it needs an algorithm to align the residues of the sequence with a structure and fitness function to evaluate the quality of the alignment. Knowledge based potentials and energy functions are obtained from a database of known protein structures and these are depended on the analysis of known threedimensional structures of proteins using statistical physics. According to the authors, the first step for using genetic algorithm is to represent the solutions as strings and these strings are maintained as a population which allowed interacting. The interaction is obtained via genetic operators such as: Mutation, crossover and Replication. They used the alphabet of {0, 1} to represent the individual in the population. A residue which is from the sequence aligned in the structure has represented by “1” in the string of the population, number “0” represented no residue. Number N that is greater than number “1” represented the number of residues which are not aligned in the structure position, and N-1 represent skipped residue. After using some operators such as: crossover, mutation and replication, the threaded sequence length has to be equal to the total sum of the numbers of each string. The length of the structure has to be equal to the length of the string. The string of lower normalized energy value has more chance to participate for the next generation because it has higher fitness value. The string which have higher chance to participate in genetic operators should have the higher fitness value. They performed mutations by increasing randomly the value of a number and offsetting it by decreasing the same amount in other positions. Crossovers have been performed by choosing randomly and building two new offspring by concatenation of the suffix of one, up to the chosen position, to the prefix of the other one. One of the genetic algorithm problems the authors has met is early convergence of the population to one high fitness individual which is common in using genetic algorithms and it makes the genetic process meaningless, f it continues. Therefore, it will be not useful to continue in generating new population because it will be the same population. The common solution to this problem is to maintain high diversity in the population by using high rate of mutation temporarily for number of generation then decreases it again, or 5 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment prevent and create solutions which appear frequently. In this proposal to avoid early convergence, the authors have used the tree techniques. One of the good ideas to prevent redundant solutions is using the tree data structure to make string comparisons. As a conclusion, they have found that it is better to use higher rate of mutations to achieve good results, but not to use too high rate of mutations as it does not provide enough stability in the population to promote good solutions. Moreover, even though rigid limitations are the reasonsfor failure in finding good alignments, the method of genetic algorithm threader has representation that has designed to enable full freedom in choosing positions for insertion and deletions. In Willet’s research the author discussed the docking of flexible ligands into protein active sites, in this research Willet (ND) encoded the conformation of the molecule by a real or integer valued chromosome, the i-th rotatable bond’s torsion angle has been represented at the i-th element of the chromosome. The fitness function here is the energy for the specified conformation which it has been calculated by one of the several standard molecular-modelling packages. It identifies the number of torsion angles which aim to minimize the calculated energy. In this research the author mentioned the study which chose 72 molecules with different structures chosen from the Cambridge structural database, where each structure consists of number between one and twelve rotatable bonds. The number of individuals in the population was ten times the number of torsion angle in the molecule. He used six bits to represent each torsion angle. A key role in determining the physical and biological properties of the molecule is the lowenergy conformations and there is much interest in ascertaining the stable conformations that flexible molecules can adopt. Each individual consist of four strings, tow for mapping and tow for rotatable bonds torsion angles (one in a ligand and one in protein active site) . He has used a routine which is used to determine the hydrogen bonding energy, the input for genetic algorithm here are the size and location of the ligand that is docked into receptor site, also the size and location of the site receptor as well. The protein and lignad conformations are the output here and they have to be associated with fittest individual in the last population. The author found out that systematic search is the most common approach for conformational analysis which each torsion angle is rotated systematically by some fixed increment, but the problem with 6 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment this approach is the sheer number of conformations that need to be examined. For instance, a systematic search with a 30° torsion increment for molecule containing twelve rotatable bonds would require about “9 * 1012” energy calculations. Thus, this approach is achievable only if there are very few rotatable bonds in a molecule. Willet performed genetic algorithm by using a population of randomly generated chromosomes as the input, and was run for a maximum of 10000 energy evaluations. The improvement has been noticed after about 5000 evaluations. He has used another approach which is SYBYL routine and he has found that this approach was faster than genetic algorithm for molecules containing small numbers of rotational bonds, but the genetic algorithm was faster for molecules containing more than 7 or 8 bonds, and the difference increased as the number of rotatable bonds increased. Therefore, genetic algorithm provide and effective way of exploring the conformational space of flexible molecules; also, he work at sufficient speeds to allow the conformational analysis of highly flexible molecules that are too time consuming to investigate using substitute conformational-searching algorithm. He mentioned that rational approaches to drug design to know the molecule that is complementary to the site receptor; they make use of NMR and X-ray information about the binding-site geometry of a protein. These approaches assumed that the ligand molecules are completely rigid and that molecule’s suitability as a ligand depends on its steric complementarily with the site. They did not take into account of the ability of the ligand to displace water and form hydrogen-bonds with the active site. The genetic algorithm seeks to overcome these two limitations. According to Wild and Willett 1995, using molecular electrostatic potentials is very good idea to calculate the intermolecular similarity in database of three-dimensional chemical structure, where they used the electron densities to measure the similarity between tow molecules. They have used the equation so-called Carbo index. It depends on cosine coefficient, which is a good tool to depend on when using genetic algorithm approach. For example, an initial lead in drug – or pesticide- discovery program, similarity searching involves matching some target molecule of interest against all of the molecules in database to find the those molecules that are most similar to the target. The authors mentioned they believe that genetic algorithms provide both an effective and an efficient mechanism for the investigation of a range of complex chemical 7 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment matching problems, such as the generation of near maximal common sub graphs from pairs of large tow dimensional structures, docking flexible ligands into protein active sites and flexible three dimensional substructures. They have used the genetic algorithms as follows: The genetic algorithm search a mechanism to identify a combination of translation and rotations which will align one molecule to another one, where every chromosome has five components, three for translation and two for rotation. For rotation they use two planes, each one has eight-bit binary number and this allow 256 possible of rotations. For translation they use binary number as well but with the maximum permitted range. They initialize the chromosomes randomly and then decoded by applying the indicated translation and rotation to the three dimension coordinates inside the molecule which has been aligned. They used the fitness function that depends on Gaussian similarity calculation, where the resulting coordinate will be passed to this function to be evaluated. They found that the best result obtained when they have used the uniform crossover to get the diversity in the population and a crossover rate of 20 % was found to give the best result for this problem. They achieved the diversity in the initial population by ensuring that all of the individuals had a large Hamming distance between them, where the Hamming distance between two binary individuals is the number of corresponding bits that differ between two strings. This technique was found to prevent early convergence. Each iteration discard non fittest individual with fittest individual. They have used crossover and mutation to introduce new generation; they have used single crossover, two crossovers and uniform crossover which is the best one they have found. Also they have used a simple bit-flip mutation with some probability (1/i). In this research they have also used roulette-wheel selection to select fittest individual. They did not use gray coding which is a way of representing binary strings. In fact, to increment or decrement the number always requires a change of only one bit. For example in the standard binary representation the number 3 is 011 and the number 4 is 100, for the random mutation to go from 3 to 4 it is necessary for 3 bits to be flipped. In Grey Code, 3 are represented with 010 and 4 with 110. So to change from 3 to 4 requires only the first bit to be flipped. 8 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment We can change between binary and grey code. Given a binary number “b1 = (b1, b2m…., b5m)” We can change it to a grey-coded reflection “g = (g1; g2……., gm)” Changing from binary to gray will affects the distance between different solutions and the fitness landscape performed by the operators such as mutation. During the experiments in this paper, they demonstrate that the genetic algorithm leads to similarities that are comparable in effectiveness for database search to those resulting from the use of approach passed on field graphs and superior to those resulting from the use of bit-climber. Moreover, genetic algorithm leads to more robust alignments than does a simplex optimization procedure. The authors found some weaknesses in the field graph approach which is far more complex and time consuming owing to the need to generate the graphs from the electrostatic potential grid before the search can be carried out. Also, for some molecules the field-graph does not contain sufficient nodes to enable those molecules to be aligned with a target molecule. They have used four strings to represent the chromosome. Firstly, two strings use binary representation and two strings use integer representation. The first binary string is to represent the ligand and the latter for the protein, where the angle of rotatable bond in the rotation occupies one byte from the string. Secondly, hydrogen bond between the protein active site and the ligand has the possibility and this the possibility with mapping is encoded in the integer strings. For example, the first integer string encodes mapping from hydrogen atoms of the protein to one pair of the ligand and the second string has the inverse mechanism. Payne and Glen (1993) have used genetic algorithm as away to optimize the fit of flexible molecules to a set of restrictions. The restrictions may be shape similarity, charge distribution or intermolecular distance constraints. The problem, when using Xray crystallographic analysis to know the structure of an active site, is how to dock the ligand to this active site. Molecular modelling techniques are a way used to compare dissimilar molecules to generate conformations. In addition, there are some numerical methods such as: atom charge distribution, comparison of electron densities, dipole 9 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment moments, volume overlaps, and electrostatic and lipophilicity potentials. The algorithm here receives the current molecule and converts it from phenotype into genotype. Particularly, it take the coordinate of all atoms which belong to the molecule and convert them to the string of bits and this string will represent one individual the first population. Then it will apply the operators of genetic algorithms which are crossover, selection and mutation to obtain new generation. There are many steps have to be followed to do the algorithm: Firstly, they have to find a good way to represent the problem. Secondly, they have to use distance method to do the comparison. Thirdly, when they use X-ray crystallography analysis to know the structure of an active site, it is important to know how to dock the ligand into the active site of the protein. Finally, they have to define a set of restrictions to compare and fit molecule with it. The authors have used the binary strings to represent each individual. They have broken down the string into four segments. The first segment represents the translation of the molecule along the three axes x, y and z. the second segment represents the rotation of the whole molecule around the all axes, the third segment represents rotations around each rotatable stem (or bond). The fourth represents the conformation of rings. The methodology of Richmond’s research is to find alignment algorithm to superimpose atoms in one molecule onto another similar atoms which belong to different molecule, to do so, Richmond et al (2004) followed many steps: Step 1- identify a set of equivalent candidate atoms which are belong to different molecules and similar in term of local geometry. Step 2- filter the set of equivalent candidate atoms by cancel and discard the pairs which cannot be overlaid with any alignment transformation. Step 3- calculate the alignment transformation which place over the molecules to overlie the pairs in the filtered set of atoms equivalence. 10 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment Step 4- repeat the alignment and calculate a new set of atoms equivalences, compare the atoms to identify the distance between them in case it’s less than a user defined threshold and depending on it for the next alignment. The procedure to match tow 2D shapes: Firstly, identify the correspond points which belong to shape A and shape B. Secondly, the morphing transformation has to be calculated. Map the points on first shape to their corresponding points on the second shape. Finally, determine the similarity between two shapes by calculate the sum of the matching errors of corresponding points which belong to both of them. Each shape has been represented by a discrete set of points sampled from external or internal contours on the shape with using an edge detector. In fact, the more numbers of points, the more accurate the description of the shape. Over recent years the folding problem became one of the most challenging problems in the computational chemistry world, specially the mechanism of folding. Genetic algorithm became so common to search in the space of this field. Each possible solution is represented by an encoded individual or string to change it from phenotype into genotype. For instance, to represent the conformation of a molecule, they construct an individual which contains of a number of real numbers where each real number represent angle of rotation around a flexible bond in the molecule. Here the method of genetic algorithms begins with the population which is the number of individuals that have been created randomly. During the performing of the algorithm, the authors use fitness function that evaluates each individual to see whether it has high fitness or low fitness to decide whether it will participate in the next generation or not. The individuals in the population which have high fitness will participate in the next generation and the individual with low fitness will not participate in the next one (Jones ND). Daeyaert et al (2005) site how to use genetic algorithms to find the similarities between two molecules in the space. They have mentioned that the requirement to do structure based drug design methodologies is to find a proper alignment for molecules. The methodologies which they have mentioned are Comparative Molecular Field Analysis 11 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment (COMFA) and Comparative Molecular Similarity Indices Analysis (COMSIA). The authors have used multi- objective function optimization to combine flexible source molecules onto rigid target molecules, they depend on two things: the similarity score between the source and target molecules and conformational strain of the source molecule, the first has to be maximized and the latter has to be minimized. The aim of this function is to optimize the smaller square distance between the target and source molecules. To rank the final individuals, they have used fast non-dominated sorting algorithm. They have used the elitism to ensure survive of solutions which have high fitness and many operators to provide the diversity for the population. Each individual or vector in this search represents the alleles by real numbers: the first three positions represent a translation in the x, y and z axis, of the source molecule, from 4 to 6 represent the Euler angles deciding the direction of the source molecule, and the rest of the individual represent the values of the torsion angel of each rotatable stem (bond) in the source molecule. They mentioned that before beginning the genetic algorithm, the coordinates of the target molecules have to be centred. According to Xu et al (2003) mentioned that the physical properties and biological behaviour of a molecule usually depend on its accessible and low energy conformations; therefore, fast and reliable computational methods for producing conformation are extremely valuable. They have used algorithm which produces molecular conformations that are compatible with a set of geometric restrictions. These restrictions include intering atomic distance bounds which derived from the molecular connectivity table. In this work they have used Merck Molecular Force Field to calculate the potential energies. They have mentioned that the main advantage of this work is to obtain more diversity of the conformations. The authors focused on several enhancements to generate better initial geometries and to detect and eliminate conformations which are likely lead to the same local minima as well as on the use of this technique for protein structure prediction, pharmacophore modelling and ligand docking. Many problems have been solved successfully by using the distance geometry such as: NMR structure determination, conformational analysis, ligand docking and protein structure prediction. In this work the volume and distance constraints have reduced the number of accessible conformations to molecule and search space. The general distance 12 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment geometry method is a self-organizing algorithm works as a fitness function, which tries to minimize an error function that measures the breach against geometric restrictions. According to (Nicholas E. Jewell 2001), using the 3D QSAR methods is essential to the design of bioactive molecules, such as COMPASS, COMSIA, COMFA and HASL. It is very important for 3D QSAR methods are to obtain alignment for the molecules in dataset as an input for the calculation of the structural variables. Also, they have stated a method to find the optimal way which obtains the convergence between two molecules. They described the main features of FBSS (for field based similarity searching) and also reported a simple validation experiment that supplies the use of FBSS-based alignments in 3D QSAR analyses. They used FBSS as the prerequisite to 3D QSAR procedure, and compare the results with those obtained from conventional manual alignments. Their work was to provide an approach which is complementary to and not replacement for the manual alignment. This program is essential to implement 3D QSAR specially COMSIA and COMFA methods. For calculating inter-molecular structural similarity, many different measures have been described by the authors. Carbo et al(1980) describe one approach which involves the use of molecular field descriptors, and this approach has been developed by Good et al (1992). This approach is to put the molecule at the centre of a 3D grid and calculating the value of molecular field, for instance, the electrostatic potential of the molecule at each point of the 3D grid. To find the degree of similarity and the difference between two molecules, they aligned the corresponding grid to find the best possible fitness, and they use one of the distance methods to do that. FFBS is software which used genetic algorithms to align molecules’ fields depending on field based similarity measures for similarity searching in chemical structure database. For each individual or chromosome the FFBS’s genetic algorithms encodes the translations and rotations which applied to a structure to align it with a target one, where the value of the similarity coefficient which obtained from the encoded alignment will be the fitness function. 13 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment 3 Research Methodology In this research I developed a new algorithm to find the optimal alignment for a group of molecules. In fact, I used the Java language to write the program that performs this algorithm. This alignment considered as an input for a method which called Comparative Molecular Similarity Indices Analysis (COMSIA) which is a 3D method to forecast and correlate molecule’s biological activates. I obtained the optimal alignment by using the genetic algorithm mechanism to do transformation (translation + rotation) for each molecule. For each transformation I compared the new figure with other databases to know if it is a good solution or not. For comparison I needed to use good method to find the distance between the sample and the target one; therefore, I tried to find the best distance function to do this comparison. This research will be useful to deal with any shape in the future not only the molecules. 4 Outline of Thesis Chapter two: “Principle of Genetic Algorithm and how John Holland mimicked Darwin theory (Evolution and Natural Selection) to invent Genetic Algorithm. “Genotype and Phenotype”. Chapter Three: “Molecules and Drugs and how to use QSAR and its methods (Comparative Molecular Field Analysis CoMFA and Comparative Molecular Similarity Indices AnalysisCoMSIA)”. Chapter Four: “Description of Implementing Genetic Algorithm to align some molecules and how to use the fitness function to obtain the optimal alignment”. Chapter Five: “The Deduction and Future Work (Pharmacophore)”. 14 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment Chapter 2 Genetic Algorithm 1 Introduction In this chapter I talked about Darwin’s theory which involves the mechanism of evolution and natural selection (Darwin 1859).Then I mentioned how John Holland used this theory to invent the idea of genetic algorithm. Phenotype and genotype are the hardware and software of organisms. Istated the principles, steps, elements, operations and process of genetic algorithm and how to use them to provide the diversity to offer more solutions for the problem. Operations of genetic algorithms are crossover (single point, double points, and uniform crossover) and mutation (mutation factor 1m and mutation factor 2m). 2 Darwin's Theory of Evolution - Natural Selection There are two important things in Darwin’s theory which have been mimicked in genetic algorithms: the mechanism of evolution and natural selection. I talked about both of them below: 2-1 Evolution Darwin's Theory of Evolution presumes that the development of life is a slow gradual process that began from non-life or simple life (simple solution in genetic algorithms) and stresses a purely (optimal solution). In others words, the complex creatures evolve from more simplistic ancestors naturally over time. The process began in the sea three million years ago, where complex chemical molecules started to clump together to form microscopic blobs (cells). These cells were the seeds of the tree of life. They had the ability to split and replicate themselves as bacteria do and during the time they have been diversified into different groups. Some of these groups remained connected together and formed chain shapes which are called alga. Others collapsed upon themselves and formed hollow balls creating a body with an internal cavity, these we call multi- celled organisms and sponges are their direct descendants. The tree of life became more complicated and diverse during the time as more variation appeared. Some of these organisms had the ability to move and developed a mouth that opened into a gut. Meanwhile, other organism had rod inside their bodies which made them stronger, then sense organs developed around their front 15 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment end. Some groups had bodies which were divided into segments provided by little projections on either side which helped them move in the sea floor and they then got hard and protective skins that gave their bodies some rigidity. These creatures filled the sea with lives. Roughly, before 450 million year, some of these armoured creatures got out of the water into the land and here the tree of life brunched into multitude of different species that exploited this new environment in all kinds of ways. Some of these groups developed elongated flap on their backs and over many generations these things developed eventually into wings, now we call these insects. Life began in the air and diversified into many forms. At the same time, some organisms in the sea have been faced with change by the stiffening rod in their bodies which became bond and a skull developed in front of it with hinged jaw that could grab and hold onto its prey. These creatures grew bigger and got the ability to swim with power and speed, because they developed fins equipped with muscles. We call these creatures fish now and they are dominated the waters of the world. Some of these creatures got the ability to gulp the air from the water surface and their fleshy fins became weight-supporting legs. 375 years ago, a few of these backboned creatures followed the insects onto the land, they had wet skin and they had to return to water to lay their eggs. These types we call them amphibians, some of them evolved dry, scaly skins which they broke their link with water by laying eggs with watertight shells. These creatures, the reptiles, were the ancestors of today's tortoises, lizards and crocodiles, snakes. 65 million years ago these creatures grew bigger and formed the dinosaurs’ animals which dominated the land, but a great disaster happened and killed all of them except one branch which their scales had developed into features and we call these birds now. At the same time, some insignificant group of survivors began to increase in numbers on the ground beneath and they are different from their competitors in that their bodies were warm and insulated with coats of fur. Now, we have the first mammals. They had a good chance of surviving and deploying without existing for other creatures and they were lucky to have warm and insulated bodies enabling them to be active at all places, from the tropics to the Arctic, on land as well as in water, on grassy plains and up in the trees at all times, at night as well as during the day ( Information from DVD about Charles Darwin and the Tree of Life, produced by sachaMirzoeff, released 2009 ). 16 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment 2-2 Natural selection Natural selection acts to keep and accumulate minor advantageous genetic mutations. Suppose a member of a species developed a trait. For example, it grew wings and learned to fly. Its offspring would inherit that feature and pass it on to their offspring. The inferior (traits) members of the same species would gradually die out, leaving only the superior (traits) members of the species. Natural selection is the preservation of features that enables a species to compete better in the wild. It is also similar to domestic breeding. Over the centuries, human breeders have produced dramatic changes in domestic animal populations by selecting individuals to breed. Breeders eliminate undesirable features gradually over time. Similarly, natural selection eliminates inferior species gradually over time. For more explanation we are going to give one example here. In the wild we have a population of rabbits, some of them smart and some of them dumb, some of them fast, some of them slow. The slower and dumber rabbits are more likely to be eaten by foxes. However, the smart and fast ones have more chance to survive and do breeding to get new generation of rabbits. Of course, some of the slower and dumber rabbits will survive, may be because they are lucky but there population will be less than the smart and fast ones. Generation by generation we will find that the smart and fast rabbit are much more than others type in the wild and that is because there are more parents from their type and this are what we call the natural selection which the foxes are a part of (Michalewicz 1999, p.14). 3 Phenotype and genotype in the nature Phenotypes refer to the physical parts of a living organism such as the sum of atoms, molecules, macromolecules, cells, structures, metabolism, energy utilization, tissues, organs, reflexes and behaviours. They include anything that is part of the observable structure, function or behaviour of a living organism. The Phenotype of an organism refers to the physical expression of an organism’s genotype. Genotype is the "internally coded, inheritable information" carried by almost all cells of all living organisms. It is used as a “blueprint” or set of instructions for building and maintaining a living organism. This information is written in a coded language (the genetic code) and is encoded in the genes of an organism. These genes are connected 17 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment together into long strings called chromosomes. The genes and their settings are referred to as the organism’s genotype. Each gene and its settings represent a specific trait of an organism, like eye colour or hair colour. For example, a hair colour gene and its settings determines with hair is blonde, black or auburn. Occasionally a mutation can occur in a gene which can result in a completely new trait expressed in an organism. This is rare as a mutated gene doesn’t normally affect the development of the phenotype of an organism Genetic information is copied at the time of cell division or reproduction. This copied information is passed from generation to generation and for this reason is said to be “inheritable”. When two organisms mate to reproduce the resulting offspring will get a share of each organism’s genes. The process is called Recombination and involves the offspring getting half its genes from one parent and half from the other. These instructions are very important in all aspects of the life of a cell or organism. They contain the information for many vital functions such as the formation of protein macromolecules, and the regulation of metabolism and synthesis. Genotype and phenotype in genetic algorithms are explained in this diagram below: Figure 1- Genotype and Phenotype (Michalewicz 2010) 18 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment 4 Genetic Algorithms In 1975 John Holland and his students have developed the genetic algorithms at the University of Michigan. The goals of their research were explaining the adaptive processes of natural systems and design artificial software that remain the important mechanism of natural system. This approach has obtained a new and important discovery in both artificial and natural system science. Genetic algorithm is computational model based on accepted theories of biological evolution and natural selection. It is useful as research methods for solving problems and for modelling evolutionary system. It depends on the stochastic and diversity to find the optimal solution to the problem, most times it uses binary numbers or real numbers to do its algorithms. The mechanism of genetic algorithm is to create initial population which is number of individuals (chromosomes) and each individual represent one possible solution for the problem then perform a loop of instructions which are selected from some pairs of parent to do the crossover or mutation to introduce new offspring which will participate in the next generation, that depends on the fitness function to evaluate the new offspring and the selection method decide whether it will be in the next generation or not. Problems which have no compatible structure to the genetic algorithms will be so difficult to solve. However, the structures of molecules alignment are so clear and it will be so feasible mechanism solution to optimize by genetic algorithms (Michalewicz 1999). . When you look at genetic algorithms you will find some vocabularies have been borrowed from natural genetics. For example, individuals, genotype or structure in a population, sometimes these individuals are called string or chromosomes. If you compare between genetic algorithms and the nature you will find that each organism in the nature carries a certain number of chromosomes; for instance, the human has 64 chromosomes. However, in genetic algorithms each candidate solution is one chromosome (individual, string or structure). Each chromosome in the nature has a number of unites which are called genes, these unites in genetic algorithms are called features. 19 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment 5 Genetic algorithm analogy The idea for constructing GAs based on the analogy to evolutionary biology requires making a considerable mental transition, because the encoding mechanism is so different in the two cases. The way in which genes are manipulated, combined and expressed is very different in the biological and the genetic algorithms cases. With GAs, there is much greater distance between mathematically encoded optimization and the field of evolutionary biology from which the inspiration for the method is derived. Consequently, the language and concepts transferred are much more subject to reinterpretation. For example, a gene and a numerical encoding called a gene are not the same. Reaping the benefit of the genetic analogy first requires reinterpretation before the surprising possibilities of the analogy can be exploited (Michalewicz 1999). . 6 The structures of genetic algorithm To perform genetic algorithms we require these components: A way of encoding solutions to the problem as a chromosome (phenotype to genotype). An evaluation function, which return a rating for each chromosome given to it. A way to initialize population of chromosomes. Operators that may be applied to parents when they reproduce to alter their genetic compositions for example the standard operators are mutation and crossover. 20 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment 7 Genetic algorithm steps: Initialize a population by a certain procedure and evaluate each individual in the initial population. Choosing one of the genetic algorithms operators to apply it to the parents as away to get more diversity. Reproductions are obtained by choosing one or two parents to reproduce new offspring. Although the individuals with high fitness are favoured, the selection is stochastic. Reproducing new generations until reach stopping criteria. Figure 2- Mechanism of Genetic Algorithm (Michaewicz 2010) 21 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment 8 Elements of Genetic Algorithm 1 -Encoding Binary Encoding Integer Encoding Real Encoding Complex Encoding 2-Initial population 3- Evaluation 4- Genetic Algorithms Operations 9 Genetic Algorithm Operations 9-1 Crossover Crossover is a way to get more diversity and that by exchanging information among individuals to creating the possibility of the right combination for better solutions (individual). It takes two parents (two individuals) depending on the selection method which the selection itself depends on the fitness function. It performed by selecting a random position along the length of the individual and swapping all the genes after this position. As a result we will get two new individuals which can participate in the next generation. 9-2 Crossover rate Crossover rate is the chance which the method depends on to change or to swap the information between two chromosomes (individuals). The good value for crossover rate is roughly 0.7. 9-3 Types of crossover Single Point Crossover It is the easiest types of crossover. It is too fast but it has the problem of less diversity than other types especially when the population has similar individuals. It works by 22 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment choosing random position of the chromosome and swapping all the genes after this position between two chromosomes. Figure 3- Single Point Crossover (Michalewxciz 2010) Point Crossover In this type of crossover it will be chosen more than one point and it does randomly and swaps all elements between these points to get two new chromosomes. It is fast and it leads to more diversity in the next generation. Figure 4- N Points Crossover (Michalewicz 2010 23 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment Uniform Crossover Another type of crossover is uniform crossover, where a coin toss is performed at each position, and the result of the coin toss determining whether or not an exchange of genes takes place at that position. It does by assigning 'heads' to one parent, 'tails' to the other, flipping a coin for each gene of the first child and making an inverse copy of the gene for the second child; therefore, the Inheritance is independent of position. Figure 5- Uniform Crossover (Michalewicz 2010) 9-4 Mutation Mutation is changing randomly one or more components of a chromosome. With binary representation, this usually flipping (flip-flop) bits, that means change bits from zero to one or vice versa. Because of that, the principles of mutation remain unchanged. 9-5 Mutation rate Genes in a chromosome are randomly selected with a certain probability (Pm) and this is the chance that a bit in a chromosome will be flipped (zero becomes one, one becomes zero). The value of Pm is usually close to 0.001. 24 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment Figure 6- Mutation (Michalewicz 2010) 9-6 Types of Mutations Mutation factor (1m) In this mechanism the mutation will happen on one gene only and the value of the gene will be changed to an entirely new value, therefore this factor will allow getting the new value to the chromosome. Mutation factor (2m) In this method the existing value will be swaped with anothor existing value. The charachtaristics of this factor that it does not allow to enter a new value to the chromosome; therefore, it preseves the genes values in the chromosome. Figure 7- Mutation Factor 2m (Michalewicz 2010) 10 Conclusion Genetic algorithm is a method which has been invented by John Holland. He mimicked Darwin’s theory, which is the mechanism of evolution and natural selection. We summarised the meaning of genotype and phenotype in this chapter and how the feature of organism are inherited from generation to generation. The process and operation of genetic algorithm have been discussed, which are crossover and mutation. 25 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment Chapter 3 Molecules and Drugs 1 Introduction Most drugs which are used now days in human therapy interact with certain macromolecular targets. It blocks or activates molecule activity by binding to them. Molecule is an electrically neutral group of at least two atoms held together by covalent chemical bonds, where atom is a basic unit of molecule consisting of central nucleus surrounded by a cloud of negatively charged electrons. 2 Molecule A molecule may consist of atoms of different elements, as with water (H2O) or of a single chemical element, as with oxygen (O2). Generally atoms which are connected by noncovalent bonds such as hydrogen bonds or ionic bonds are not considered single molecules. Molecular chemistry or molecular physics is name of thescience of molecules depending on the focus. Molecular physics deals with the laws governing their structure and properties, while molecular chemistry deals with the laws governing the interaction between molecules those results in the formation and breakage of chemical bonds. Very reactive species of molecules are called unstable molecules (Brown 2003). 3 Small molecules In the field of pharmacology, the Small molecule is usually restricted to a molecule that binds with high affinity to a biopolymer such as protein, nucleic acid, or polysaccharide and in addition alters the activity or function of the biopolymer. The term small molecule in the fields of pharmacology and biochemistry is a low molecular weight organic compound which is by definition, not a polymer. Small molecules can have a variety of biological functions, serving as cell signalling molecules, as drugs in medicine, as tools in molecular biology, as pesticides in farming, and in many other roles. These compounds can be artificial (such as antiviral drugs) or natural (such as secondary metabolites); they may have a beneficial effect against a disease (such as drugs) or may be detrimental (such as teratogens and carcinogens) (Barnum 1991). 26 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment 4 Drugs Drugs are usually small molecules with roughly 50 atoms. When a drug binds to a protein by the proper way, it increases the activity of the protein.When the drug binds the active side of the molecule, it inhibits the key molecule; therefore, some approaches cause a key molecule to stop functioning as a try to reduce the functioning of the pathway in the diseased state. However, to avoid the side effects, the drugs should not be designed in such a way which affected any other molecules that may be similar in appearance to the target molecule. In the most basic sense, drugs are an organicsmall molecule which prevent or activate the function of a biomolecule such as a protein, so as a result it will be useful therapy to the patient (Lock 2007, p.1). 5 Quantitative structure-activity relationship According to Patani and Lavoie (1996) quantitative structure-activity relationship (QSAR) is a mechanism or a process when chemical structure correlates quantitatively with processes, such as a biological activity or chemical reactivity. Sometimes we call it quantitative structure-property relationship (QSPR). For example, as in the concentration of a stuff required to give a certain biological response, we can express the biological activity quantitatively. In addition, when we can express physicochemical properties or structures by numbers, we can make mathematical relationship, or quantitative structure-activity relationship between the two. Therefore, it is possible to predict the biological response of other chemical structures by using the mathematical expression. 3D-QSAR is one application to calculate the power field and that requires threedimensional structures, for example molecule superimposition is based on protein crystallography. Its mechanism depends on the computed potentials instead of experimental constants. It uses the shape of the molecule and the electrostatic fields based on the energy function which is applied (Leach and Andrew 2001). 27 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment 6 Quality of QSAR models QSAR is a predictive model which derived from statistical application tools correlating biological activity such as desirable therapeutic effect and undesirable side effects of chemicals. It applied in many disciplines for instance, toxicity prediction, regulatory decisions, and risk assessment (Tong et al 2005). Also, lead optimization and drug discovery (Dearden 2003). Judging the quality of QSAR depends on choice of descriptors, statistical methods and the quality of biological data. It has to obtain model which capable of making accurate and reliable prediction of the new compounds’ biological activities (Wold and Eriksson 1995). Proper validation and evaluation of the prediction power is important component of all Quantitative structure-activity relationships QSAR models (Radhika, Kanth and Vijjulatha 2010, p. S76). Obtaining successful QSAR model depends on the accuracy of the input data, selection of appropriate descriptors and statistical tools, and validation of the developed model (Roy 2007). According to Lionard (2006) the validation is the procedure that the reliability and relevance of a process are established for a precise purpose. According to Doytchinova and Flower (2002, p.536) 3D QSAR methods are attractive because of their combination of an understandable molecular description, rigorous statistical analysis, and an unambiguous graphical display of the results. 7 CoMFA and CoMSIA The methodologies ofComparative Molecular Field AnalysisCoMFA and Comparative Molecular Similarity Indices AnalysisCoMSIA provides all the information that necessary for understanding aligned molecules’ biological properties by obtaining a suitable sampling of steric, electrostatic and hydrogen-bond donor fields around them ( Radhika, Kanth and Vijjulatha 2010, p. S76) According to (Fabian and Timofei 1996, p. 155) the method of CoMFA has become a powerful tool to obtain QSAR. The methodology of CoMFA assumes that the differences in molecular biological activity are often related to the differences in the magnitudes of molecular fields surrounding the receptor ligands investigated (Shagufta et al 2006, p.106). 28 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment According to Doytchinova and Flower (2002, p.536) CoMSIA methods use fields based on similarity indices describing similarities and differences between ligands and correlates them with changes in the binding affinity. Also, they mentioned that CoMSIA properties are the most important contributions responsible for binding affinity and these properties are: fields describe steric, electrostatic, hydrophobic and hydrogenbond donor and acceptor. CoMSIA is a substitute approach for performing 3D QSAR by CoMFA. In terms of similarity indices, molecular similarity is compared. In addition to the steric and electrostatic fields used in CoMFA, the CoMSIA method defines explicit hydrophobic and hydrogen bond donor and acceptor descriptors. Mainly, the purpose of COMSIA is to partition the different properties into various locations where they play an important role in determining the biological activity. The most important parameter in optimizing CoMSIA performance is how to combine the five properties in a CoMSIA model (Shagufta et al 2006, p.106). 8 Conclusion The main things that have been summarised in this chapter are molecule structures and some types of molecules. We have mentioned the physiologies of drugs, the way they work to block or activate others molecules function and how we use it as a therapy for human being. We have explained the mechanism of Quantitative structure-activity relationship QSAR and its methods (Comparative molecular field analysis CoMFA and Comparative Molecular Similarity Index Analysis CoMSIA) 29 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment Chapter 4 1 Research Work Introduction In this chapter I summarised my work which utilizes genetic algorithm to find the optimal alignment for a group of molecules by using the grid points which is the way to represent the molecule for the computer. I used the Euclidean distance as a tool to find the difference or the similarity between two molecules. 2 Molecular similarity Proposing a new method to improve drugs is an extremely challenging but highly rewarding task, which explains the current plethora of approaches. Molecular similarity measures are so important in the field of new medicines and agrochemicals. I used the new similarity measures operation to calculate the similarity between molecules from the same family which is the Steroid family. I utilized this measurement to optimize the alignment for these molecules based on one molecule as a target molecule and others as sample molecules. The method that I used depends on the Euclidean distance to quantify between the sample and target molecule. I used a mechanism of grid and this grid has a lot of points, these points get affected by the power which comes from each atom in one molecule. This approach will enable predictions in medically related QSAR. In the chemical environment you can predict the chemical behaviour of one molecule, for example (reactivity, ligand docking, and acidity) based on its structure. You do not need to understand the often extremely complex details of the molecule’s action in the chemical environment. This means that we can use the data of one molecule’s action to predict the action of another molecule closely related by merely comparing how similar they are. This is the basis of molecule similarity in chemical environment. 3 Quantum Molecular Similarity Measures The development of analogous techniques for three dimensional similarity searching has been supported by the development of effective and efficient techniques for three dimensional substructure searching, where the aim is to identify those molecules in a database that are most similar to a user-defined target structure, using some 30 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment quantitative measure of intermolecular structural similarity. There are a lot of ways to measure the quantum similarity, such as encompass algorithms for clustering 2D structures, molecular surface matching, similarity searching through 3D databases, shape-group methods to describe the topology of molecular shape, CoMFA (Comparative Molecular Field Analysis), shape-graph descriptions (Thorner et al 1996, p. 900). In this approach I used Comparative Molecular Similarity Indices Analysis (CoMSIA) as a method to measure the similarity between two molecules. In this research the similarity measures based on the molecular X ray powers which have been used to quantify the degree of resemblance between pairs of rigid threedimensional molecules. This research discussed the effect of including molecular flexibility on the similarities that are calculated using such measures in searches of large three dimensional databases. Good results have been obtained with genetic algorithm that has been developed for calculating the similarity between the X ray power effects of molecules, one of them is rigid. Although some molecules are naturally rigid, there are many organic molecules that contain one or more rotatable bonds and this allows the molecule to exist in many different conformations. For this reason it is necessary to consider how torsion flexibility will affect the molecular x ray power effect. 4 The Grid Points In this project I used the gird points to measure the similarity between two molecules. The molecule has a lot of atoms, each atom has Xray power effect and this effect can be measured depending on the distance between this atom and one point of the grid. I did this by using the Euclidian equation which is a way to calculate the distance between two coordinate in three dimensions. For example, in three-dimensional Euclidean space, the distance between the coordinate (x, y, z) and the coordinate (a, b, c) is: Euclidean Distance = √(𝑥 − 𝑎)2 +(𝑦 − 𝑏)2 + (𝑧 − 𝑐)2 (1) 31 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment The figure below shows one molecule in a grid: Figure 8- One Molecule in a Grid (Lock 2007) The figure below explains how to calculate the distance between the atom and one point of the grid: Figure 9- Distance between the Atoms of the Molecule and One Point on the Grid (Lock 2007) 32 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment Therefore, to measure the effect of the power of one molecule on one point in the grid, I calculated the summation of the powers for all atoms in the molecule to this point and I called this the summation point value. In parallel, I made a list with size equal to the number of points in the grid and this list stored the point’s values. To do so, I made a method which I called “find grid powers”. When I traverse the molecule to this method; it will find the point’s power values and will keep it in a list. To calculate the power for one atom, I used the equation below: Power = distance 10 (2) To measure the point value we found the summation for all atoms powers as explained in the figure below: Figure 10- Represent a Molecule in a List (Lock 2007) 33 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment 5 Alignment algorithm In this aproach I used genetic algorithms to do the alignment for some molecules to predict their chemical behavior. I used the principles of genetic algorithms and I added some new steps to get more diversity in the space of solution. I mimiked two strategies to reach the optimal solution. The first strategy was to do exactly what a golf player does when he/ she hits the ball to put it in the hole. The second strategy was to consider the formal genetic algorthm to do each hit of the player. At the beginning of the game the player hits the ball as hard as possible, then he/ she hits it with less power and so on until he or she reaches his/her goal. In parallel, the first chromosome I used to perform the genetic algorithms is a chromosome with big parameters values. Then for the next step, I used a chromosome with less parameters values and so on until we reached my goal. The mechanism to do the alignment is to consider each one of the database structures as flexible molecules while, the target structure is rigid. The genetic algorithm is designed to identify a set of gemotric transformations: rotations and translations. These rotations and translations are encoded as a chromosome, which we used to rotate and translate the database molecules for aligning them with a target molecule. To avoid wasting a lot of time just to bring the two molecules into the general area of 3D space, I found the centroid for each molecule in the database structure and pulled it to the centroid of the target molecule. The initial population for the genetic algorithm was created by generating random values for each parameter (gene) inside the chromosomes. I applyed this chromosome to rotate the molocule about the centroid which is the centre of the molecule and also translated it inside a small space. The first step was to use the initial population a few times to get the best position for the molecule depending on the fitness value and considering this as the first hit for the golf player. The next step was to intiate new chromosomes with smaller parameters values and do this a few times. I considered this as the second hit for the golf player and so on until the molecule reached the optimal allignment to be similar to the target molecule. 34 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment 6 Fitness function The fitness function is the most important aspect of the alignment by genetic algorithms because it decides if there is progress during the process or not. The fitness value is the difference between two molecules; therefore, the smaller the value is, the bigger the similarity is. To measure the fitness value, first we found the list for the target molecule and considered this list as a fixed list to compare it with other lists which belong to other molecules. The mechanism to find the fitness value between two lists (the target molecule and the sample molecule) was to perform Eucledian distance again. Here it was not just for three dimensions but for 216 dimensions. See equation below. 𝑛=215 Fitness value = √ ∑ (Slist(i) − Tlist(i))2 (3) 𝑖=0 The target molecule list contained 216 values, each value represented how much target molecule atoms affect one point in the grid with their powers. The sample molecule list contained 216 values as well and each value represented how much the sample molecules atoms affect one point in the grid by their powers also. By applying Eculedian distance on these two lists, I got the fitness value, which represented the progress of the alignment. 35 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment The diagram below explains how to find the fitness value between the target molecule list and sample molecule list: Figure 11- Fitness Function This is another figure to explain how to take the points values into a list: Figure 12- Taking Points Values from a Grid into a List During the experience of the work with this alignment I found that it was much better to push the molecules positively into the positive area of the grid and perform genetic 36 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment algorithms then pull the molecules back negatively by the same distance which I pushed them before, because sometimes atoms in the same distance and in different signals give the same power effects which affected the result in a negative way. Also, molecules with more atoms compared with molecules with less atoms would affect to calculate the power from the molecule into the points of the grid. 7 The mechanism of the program: I wrote the program by Java language which is one of the best languges in the present time.The first thing the program does is reading the first molecule from the identified file and calls it the target or fixed molecule, then perform the loop which reading from second molecule till the last molecule in the file and I call this loop is molecules loop. In fact when the program calls the first molecule, it saves it into temporary array , which I call temporaryMolecule1, then it finds the power effect for each point in the grid and saves these power values in a vector which I call TVector refering to the target molecule.When the program calls the second molecule, it saves it in array which I call temporaryMolecule2. Then, the same thing; it found the power values for each point in the grid and saves it in a vector called SVector refering to the sample molecule, using the procedure which I call getEuclFittnes to find the distance between the two vectors (TVector and SVector) by performing Euclidean distance. This considers the first fitness which is before performing the genetic algorithm and it is the value of the difference between the target and sample molecule. Actually, the program before performing genetic algorithms, it push the two molecules positively in to the positive area of the grid to avoid the problems of the negative signals then after genetic algorithms, it pulls them back by the same distance. In genetic algorithms, the program initiate random binary chromosome and then translate it into real number chromosome which consist of six numbers, three to represent the translation and three to represent the rotation. It finds the centroid of the molecule by obtaining the summation of the distances of its atoms then divides this summation by the number f atoms in the molecule to transform the molecule about this centroid. By checking the fitness every time, the program reuses the same chromosome in case it found progress by using it. Otherwise, it initiates another chromosome randomly to get new position for the molecule. It uses some temporary arrays to keep these positions of the molecule when there is a progress in the alignment steps and uses it in the next core of the loop. 37 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment The process of genetic algorithm is explained below by the flow chart: 38 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment No Yes No Yes 39 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment Yes No Figure 13- The Steps of our Software Algorithm 40 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment 8 Progress of Fitness Value The progress of the fitness value to align one sample molecue to the target one in the program is showen below: Fitness before genetic algorithms 3475.421179465748 The progress of fitness during the process of genetic algorithms Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 0 Internal counter 2 Fitness 3472.567298407476 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 0 Internal counter 2 Fitness 3302.4535452169794 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 2 Internal counter 0 Fitness 3248.5183022971155 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 2 Internal counter 0 Fitness 3054.8163721724113 41 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 2 Internal counter 1 Fitness 2982.8447901424406 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 2 Internal counter 6 Fitness 2841.37017850248 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 2 Internal counter 6 Fitness 2813.9623233579355 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 5 Internal counter 9 Fitness 2745.5946955887607 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 5 Internal counter 9 Fitness 2567.9548395992474 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms 42 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment Again 0 External counter 5 Internal counter 18 Fitness 2481.661177505608 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 5 Internal counter 18 Fitness 2254.6069136291703 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 5 Internal counter 18 Fitness 2160.867300287659 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 6 Internal counter 1 Fitness 2034.0507525272965 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 6 Internal counter 6 Fitness 2020.3937523150983 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 43 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment External counter 6 Internal counter 6 Fitness 1829.5571891079426 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 6 Internal counter 6 Fitness 1804.4578681132932 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 6 Internal counter 20 Fitness 1659.5671266650627 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 6 Internal counter 22 Fitness 1434.221464646745 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 6 Internal counter 22 Fitness 1250.3419280607452 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 9 44 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment Internal counter 2 Fitness 1184.3723007336894 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 9 Internal counter 2 Fitness 1002.6036714036574 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 9 Internal counter 2 Fitness 940.6145222485881 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 9 Internal counter 2 Fitness 787.1680503248232 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 9 Internal counter 3 Fitness 741.2731699290922 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 9 Internal counter 17 45 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment Fitness 734.892515573167 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 9 Internal counter 17 Fitness 619.602990979145 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 9 Internal counter 17 Fitness 512.9090031803194 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 10 Internal counter 5 Fitness 310.26777439088215 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 10 Internal counter 13 Fitness 216.04409712612068 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 10 Internal counter 23 Fitness 167.20598169213577 46 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 12 Internal counter 0 Fitness 24.450877979887906 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 14 Internal counter 4 Fitness 6.120089334482836 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 22 Internal counter 22 Fitness 5.4866454356213294 Target molecule has 56 atoms & Sample molecule number 5 has 51 atoms Again 0 External counter 28 Internal counter 5 Fitness 2.434489938614776 47 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment The diagam below showes the progress of the fittnes during the process: Figure 14- Progress of Fitness Function The amount of differecne is getting less during the process of the program 48 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment The steps to find the standard deviation for running the proram ten times: M = (4 + 3 + 3 + 2 + 6 + 4 + 3 + 3 + 4 + 3) / 10 = 3.5 Table 1 X M (X-M) (X-M)2 4 3.5 0.5 0.25 3 3.5 - 0.5 0.25 3 3.5 - 0.5 0.25 2 3.5 - 1.5 2.25 6 3.5 2.5 6.25 4 3.5 0.5 0.25 3 3.5 - 0.5 0.25 3 3.5 - 0.5 0.25 4 3.5 0.5 0.25 3 3.5 0.5 0.25 The sum of (X-M) 2 = 0.25 + 0.25 + 0.25 + 2.25 + 6.25 + 0.25 + 0.25 + 0.25 + 0.25 + 0.25 = 10.5 N–1=9 √9 * √10.5 = 9.721 The standard deviation is 9.721 49 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment 9 Results and discussion I have first tested my algorithm with the set of molecules of the Steroid family. I considered the first molecule as a target molecule and the rest of the list as sample molecules. The chromosome which has been use to do the processes of genetic algorithm has six degrees of freedom, three values represent the translation in three dimension X, Y and Z, the rest three values represent the rotation in three dimension as well X, Y and Z. The software roughly spends five seconds to align each sample molecule in the data base to the target molecule. When I tried to print the result of aligning the target molecule with one of the sample molecules by using the hand, I found that is not so different from aligning the same two molecules by my software. The figures below show the difference between using the software and using the only hand, where the red atoms represent the target molecule and the black atoms represent the sample molecule. Figure 15- Two Molecules Aligned by Hand 50 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment Figure 16- Two Molecules Aligned by the Software The screen shows the results only in two dimensions and this is not enough to be sure that my software is working successfully because each molecule structure is of three dimensions. Therefore, I tried to find another way to test my software. In fact, I took a copy of the target molecule and used it as a sample molecule. So, now we have the target molecule and the sample molecule are similar one hundred percent. Therefore, the software to be successful should align them one hundred percent. Before starting the process of genetic algorithm, we pushed the sample molecule far away from the target one and rotated it in three dimension randomly (random values in each X, Y and Z dimension). Therefore, the two molecules now are in different positions and different shapes as well. 51 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment The figures below show the progress of the program before and after perform the genetic algorithm in the software. Figure 17- Progress of Align Two Molecules (Step 1) Figure 18- Progress of Align Two Molecules (Step 2) 52 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment Figure 19- Progress of Align Two Molecules (Step 3) Figure 20- Progress of Align Two Molecules (Step 4) 53 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment Figure 21- Progress of Align Two Molecules (Step 5) Figure 22- Progress of Align Two Molecules (Step 6) 54 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment Figure 23- Progress of Align Two Molecules (Step 7) From the final result picture, Ican find that the software is very successful. Although it did not reach the optimal solution, which make the two molecules aligned one hundred percenthowever it reached the very near optimal solution which is more than enough to align two different molecules. 10 Conclution I have presented a method for aligning a collection of steroid molecule family. The method produces a collection of alignments along with a score for each alignment based on the atoms energy and similarity score definded by eucledian distance. The method acceptsmolecules with 3D coordinates as input and computes a collection of alignments. Each alignment is given a score, which quantifies the quality of the alignment between the target and sample molecule. I have used the grid points as a way to represent the molecule in a numerical way. I have used the eucledian distance as a tool to measure the deferences and similarities between two molecules and it was a succesful tool. The genetic algorthms is the alorithm which I have used to do the alignment for the group of molecules and I have mimicked the mechanism of golf players to reach the goal. 55 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment Chapter 5 Conclusion This research went through the mechanism of evolution and natural selection which belong to the Darwin theory. It discussed how John Holland mimicked this theory to invent the idea of genetic algorithm. Genetic algorithm is a method to find the optimal solution for some problems in the real life where these problems have compatible structure with genetic algorithm. It depends on stochastic and diversity to do its processes. It uses some operations such as crossover and mutation to obtain the diversity with solutions. It uses the fitness function as a brain of the algorithm to control the whole process. The idea of golf player has been used to get the optimal alignment for a group of molecules. The research went through some papers related to our work and they were useful, as they gave us experience about dealing with molecules alignment and genetic algorithm. Phenotype and genotype have been discussed in this research and how the features of organism can get inherited from generation to generation. In this research I talked about the molecules, small molecules, molecular formula, molecular geometry, medicinal chemistry, and drugs. In fact, align molecules is a good method to improve drugs. Comparative Molecular Similarity Indices Analysis (COMSIA) has been discussed and it is a 3D method to predict and correlate molecule’s biological activity. It is one method of quantitative structure-activity relationship QSAR. The research focus was translation and rotation (transformation) for each molecule in the data base. The algorithm has been developed to align some molecules comparing to one molecule considered as target one. Euclidean distance has been used as a tool to measure the difference and similarity between two molecules. Grid points have been used to represent each molecule in the space and it was a good tool to control the progress of fitness function and steps of genetic algorithms. The algorithm has been tested by taking a copy of the target molecule and considers it as a sample molecule. Both of them have been used as an input for the algorithm and that because both of the molecules are similar one hundred percent. In fact, it was a good idea to test the software and it was successful to get the optimal alignment. As an answer for the research question, the results proved that genetic algorithm is a successful method to perform the alignment of similar molecules and find the optimal alignment for them. 56 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment Chapter 6 Future Research (Pharmacophore) According to Jones G (2000), pharmocophore modelling is a useful tool because in case of absence of the three dimensions structure of a protein target, it provides good alternative. pharmacophore describes the molecule features which are necessary to recognize the ligand molecule by biological macromolecule. It is an ensemble of steric and electronic features that is used to ensure the optimal supramolecular interactions with a specific biological molecule target and to trigger or block its biological activity. Pharmacophores are used in modern computational chemistry to define the important features of one or more molecules with the same biological activity. A chemical compounds database can then be searched for more molecules that share the same features located a similar distance apart from each other. Genetic algorithm is used to maximize the distance similarity between pharmacophore features. It encodes conformational information in bit individuals or strings mappings between molecules in the overlay. The fitness function is useful to guide the progress of overlapping pharmacophore features Strozjevet all (2005). In the future work I will use sophisticated new genetic algorithm that defines each molecule as a core structure plus a set of torsions and to overcome the limitation located in pharmacophore tools. Using pharmacophore will focus on some parts of the molecule which is the most important part on the molecule rather than consider the entire molecule. There are many advantages of our future work: for example, pareto multi-objective will be useful to simultaneously balance steric, and energy information for building the most valuable hyper molecule models require. Moreover, unlike other methods run time will scales linearly with the number of lignads. 57 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment References Alex Barnum (13 May 1991). "Biotech companies shift focus". The Toronto Star. Brown, T.L. 2003. Chemistry – the Central Science, 9th Ed. New Jersey: Prentice Hall. Carbó, R., Leyda, L. and Arnau, M. ‘How similar is a molecule to another?An electron density measure of similarity between two molecular structures’.Int. J. Quant. Chem. 1980, 17, 1185-1189. Daeyaert, Jonge, M, Heeres, J, Koymans, L, Lewi, P, Broeck, W, &Vinkers, M (2005). "Pareto optimal flexible alignment of molecules using a non-dominated sorting genetic algorithm "Chemometric and Intelligent Laboratory Systems vol. 77, 232-237. Dearden JC (2003). "In silico prediction of drug toxicity".Journal of Computer-aided Molecular Design17 (2–4): 119–27. Doytchinova, I & Flower, D 2002, ‘A Comparative Molecular Similarity Index Analysis (CoMSIA) study identifies an HLA-A2 binding supermotif’, Journal of Computer-Aided Molecular Design, vol. 16, pp. 535-544. Fabian, W &Tiofei, S 1996, ‘Comparative Molecular Field analysis (CoMFA) of dye-Fibre affinities’, Elsevier, pp.155-162. Genetic Engineering & Biotechnology News (Mary Ann Liebert) 29 (9): pp. 34–35. Good, A.C., Hodgkin, E.E. and Richards, W.G. The utilisation of Gaussian functions for the rapid evaluation of molecular similarity. J. Chem. Inf. Comput. Sci. 1992, 32, 188-191. Jewell , N, Turner,Willett, P & Sexton, G (2001). ‘automatic generation of alignments for 3D QSAR analyses’, Journal of Molecular and Graphics and Modeling, vol. 2, pp. 111-121. Jones, Gareth. Genetic and evolutionary algorithms 58 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment Kubinyi, H ND, ‘Comparative Molecular Field Analysis (CoMFA)’, Ludwigshafen. Leach, Andrew R.; HarrenJhoti (2007).Structure-based Drug Discovery. Berlin: Springer. Leonard JT, Roy K (2006). "On selection of training and test sets for the development of predictive QSAR models". QSAR & Combinatorial Science25 (3): 235–251. ‘Ligand-based design: Pharmacophore Perception and Molecular Alignment’, Tripos. Lock, P 2007, ‘Machine Learning in Drug Discovery’, pp.4-5 . Michalewicz, Z 1999, Genetic Algorithms + Data Structure = Evolution Programs, Springer, Berlin. Michalewicz, Z 2010, Evolutionary Computation. Michalewicz, Z &Foge, D 2004, How to Solve It: Modern Heuristics, Springer, Berlin. Payne, A. W. R. a. Glen., R.C 1993. ‘Molecular recognition using a binary genetic search algorithm’, Jounal of molecular Graphic and Modelingvol.11, pp. 72-91. Radhilka, V, Kanth, S &Vijjulatha, M 2010,’ CoMFA and CoMSIA Studies on Inhibitors of HIV-1 Integrase - Bicyclic Pyrimidinones’, E-Journal of Chemistry, vol. 7(S1), pp. S75-S84. Richmond, N, Willet, P & Clark, R 2004, ‘Alignment of three-dimensional molecules using an image recognition algorithm’, Jounal of molecular Graphic and Modeling, vol.23, pp. 199-209. Roy, K 2007, ‘On some aspects of validation of predictive quantitative structure-activity relationship models’, Expert Opin.Drug Discov.2, pp. 1567–1577. Shagufta, Kumar, A, Panda, G &Siddiqi, M 2006, ‘CoMFA and CoMSIA 3D-QSAR analysis of diaryloxy-methano-phenanthrene derivatives as anti-tubercular agents’, J Mol Model, vol. 13, pp. 99-109. 59 The School of Computer & Information Science Genetic Algorithm and Molecules Alignment Strizhev, A, Abrahamian, E, Choi, S, Leonard, J, Wolohan, P & Clark, P 2006, ‘The Effects of Biasing Torsional Mutation in a Conformational GA’, J.Chem,Inf,Model, vol. 46, pp. 1862-1870. Thorner, D, Wild, D, Willet, P & Wright, P 1996, ‘Similarity Searching in Files of ThreeDimensional Chemical Structure: Flexible Field-Based Searching of Molecular Electrostatic Potentials’, J.Chem.Inf.Comput.Sci, vol. 36, pp.900-908. Tong W, Hong H, Xie Q, Shi L, Fang H, Perkins R (April 2005). "Assessing QSAR Limitations – A Regulatory Perspective".Current Computer-Aided Drug Design, vol. 1, pp. 195–205. Wild, D, & Willett, p 1996, ‘Similarity Searching in Files of Three Dimensional Chemical Structures. Alignment of Molecular Electrostatic Potential Fields with a Genetic Algorithm’, Journal of Chemical Iformation and Computer Science, vol.36, pp. 159-167. Willet, P. 1995. "Genetic algorithms in molecular recognition and design.TIBECH. Wold S & Eriksson, L 1995, ‘Statistical validation of QSAR results. In Waterbeemd, Han van de’. Chemometric methods in molecular design.Weinheim: VCH. pp. 309–318. Xu, H, Sergei, Z &Dimitris, A 2003, ‘Conformational sampling by self-organization’, Jounal of Chemical Iformation and Computer Science, vol.43, pp 1186 1191. Yadgary, J, Amihod, A & Ron U 1998, ‘Genetic algorithms for protein threading’. 60 The School of Computer & Information Science