International Journal of Electronics, Electrical and Computational System IJEECS Genetic Algorithms Approach to Protein Alignments Gurpreet Singh Faculty of Computer Applications Chandigarh Group of Colleges Gharuan Campus, Mohali, Punjab, India. gp_cgc@yahoo.com Abstract—Comparison of proteins may highlight regions in which the proteins are most similar. These conserved areas might represent the regions or domains of the proteins that are responsible for common function. Locating similarities between protein sequences is usually done using dynamic programming algorithms which are guaranteed to find the optimal alignment under a given set of costs for the sequence editing operation. The computational problem becomes more complicated when multiple rather than pair wise sequence alignments are needed. Genetic algorithms are computing algorithms constructed in analogy with the process of evolution. Genetic algorithms seem to be useful for searching very general spaces and poorly defined spaces. . Genetic algorithm has the merits of plentiful coding, and decoding, conveying complex knowledge flexibly. An advantage of the Genetic Algorithm is that it works well during global optimization especially with poorly behaved objective functions such as those that are discontinuous or with many local minima.GAs have gained steady recognition as useful computational tools for addressing optimization tasks related to protein structures and in particular to protein structure prediction. Keywords— GA, NP, Alignment, Threading. I. Introduction The GA approach is based on the observation that living systems adapt to their environment in an efficient manner. Thus, genetic processes involved in evolution actually perform a computational process of finding an optimal adaptation for a set of environmental conditions. Evolution works by using a large genetic pool of traits that are reproduced faithfully, but with some random variations that are subject to the process of natural selection. While there is no guarantee that the process will always find the optimal solution, it is evident that during the course of time it is powerful enough to select a combination of traits that enables the organism to function in its environment. The GA approach attempts to implement these fundamental ideas in other optimization problems. The basic idea behind the GA search method is to maintain a Er. Varun Nayyar Department of Electronics and Communication Engg, RBCENT for Women Hoshiarpur, Punjab. varunnayyarnayyar@gmail.com population of solutions. This population is allowed to advance through successive generations in which the solutions are evolved via genetic operations. The size of the population is maintained by pruning in a manner that gives better survival and reproduction probabilities to more fit solutions, while maintaining large diversity within the population. This implies that the algorithm must utilize a fitness function that can express the quality of each solution as a numerical value. In many applications, possible solutions are represented as strings and are subject to three genetic operators: replication, crossover, and mutation. Genetic algorithms, a cooperative computational method, have been successful in many difficult computational tasks. Thus, it is not surprising that in recent years several studies were performed to explore the possibility of using genetic algorithms to address the protein alignment problem. In this review, a general framework of how genetic algorithms can be used for alignment is described. II. Genetic Algorithms Genetic Algorithms have been used as stochastic methods for solving optimization and search problems, operating on a population of possible solutions. According to Darwin’s Theory of Evolution, the repetitive application of the aforementioned procedures alters an initial species into various other species; however, only the stronger prevail. Genetic Algorithms perform the same operations on the population of possible targets with only those that fit the solution better surviving. Even though there is no formal definition of GAs, all of them consist of four elements [1]. The first is the population of chromosomes which represent the possible solutions of the problem. Selection is the second element and it refers to the part of the population that will evolve to the next generation. Selection is performed based on a fitness function, that determines how “good” a solution is. The selection process is applied to each generation produced. International Journal of Electronics, Electrical and Computational System IJEECS Crossover refers to the combination or exchange of characteristics between two members of the elite group defined by selection, by which offspring is produced. There are various types of crossover but the most frequently used are: the one-point crossover, in which the parents are cut at a specific point and the head of the first is pasted to the tail of the second or vice versa and the two-point crossover, in which a part from one of the parents is obtained and exchanged with the part that lies in the same location of the other parent. Table 1.1 - One and Two points Crossover Parent 1 Parent 1 Parent 2 Parent 2 Offspring 1 Offspring 1 Offspring 2 Offspring 2 110 / 0100110 110 / 0100 / 110 101 / 1010101 101 / 1010 / 101 110 1010101 110 1010 110 101 0100110 101 0100 101 After the application of crossover on the population, a new generation is produced. Whether parents are part of the new generation or not is an option that depends on the problem. In any case, before re-applying selection to the new population, mutation takes place. Mutation is a random event, occurring with a user-defined probability to only some of the new offspring. It is used to maintain genetic diversity by altering only a little piece of the new offspring. Table 1.2 Mutation Parent 1 110 / 0100110 Parent 2 101 / 1010101 Offspring 1 110 1010101 Offspring 2 (mutated on the 1st bit) 101 0100110 All the methods described above rely heavily on the nature of the problem to be solved, the domain in which the solutions are to be found, and the encoding of the solutions. More complex encoding structures, such as digital trees, allow more difficult problems to be solved, but also require more complex methods to be defined for the manipulation of the generations. However, the basic structure of the GAs remains the same and is outlined below [2]. Table 1.3 Outline of the Basic Genetic Algorithm 1. [Start] Generate random population of n chromosomes (suitable solutions for the problem). 2. [Fitness] Evaluate the fitness f(x) of each chromosome x in the population. 3. [New population] Create a new population by repeating following steps until the new population is complete. 3.1. [Selection] Select two parent chromosomes from a population according to their fitness (the better fitness, the bigger chance to be selected). 3.2. [Crossover] with a crossover probability cross over the parents to form a new offspring (children). If no crossover was performed, offspring is an exact copy of parents. 3.3. [Mutation] With a mutation probability mutate new offspring at each locus (position in chromosome). 3.4. [Accepting] Place new offspring in a new population. 4. [Replace] Use new generated population for a further run of algorithm. 5. [Test] If the end conditions are satisfied, stop, and return the best solution in current population. 6. [Loop] Go to step 2 III. Genetic Algorithms for Protein Alignments Multiple sequence alignment was shown to be difficult [4]. Similarly, seeking structure alignment even between a pair of proteins, and clearly between multiple protein structures, is difficult. Another related difficult problem is threading: alignment of the sequence of one protein on the structure of another, which was also shown to be nondeterministic polynomial hard (NP-hard) [5]. Threading is useful for foldrecognition, a less ambitious task than ab initio folding, in which the goal is not to predict the detailed structure of the protein but rather to recognize its general fold, for example, by assignment of the protein to a known structural class. Because these are complex problems, it is not surprising that GAs have been used to address them. In these questions the representation issue is even more critical than in the protein structure prediction, where the dihedral angles set provides a “natural” solution. SAGA [6] is a GA-based method for multiple sequence alignments. Multiple sequence alignments are represented as matrices in which each sequence occupies one row. The genetic operators (22 types of operators are used!) manipulate the insertions of gaps into the alignments. Since a multiple sequence alignment induces a pair wise alignment on each pair of sequences that participates in the alignment, and then the fitness function simply sums the scores of the pair wise alignments. It was claimed that SAGA performs better than some of the common packages for multiple sequence alignment. The issue of structure alignment was addressed in several studies. When two proteins with the same length and a very similar structure are compared, they can be aligned by a mathematical procedure [7] that finds the optimal rigid superposition between them. However, if the proteins differ International Journal of Electronics, Electrical and Computational System IJEECS in size or when their structures are only somewhat similar, then there is a need to consider introducing gaps in the alignment between them such that the regions where they are most similar could be aligned on each other (Fig. 1). sequence, 1 represents a match between the corresponding positions in the sequence and in the structure, and a number bigger than 1 represents insertion of one or more sequence residues relative to the structure. The genetic operators manipulated these strings by changing these numbers. The changes were done in a coordinated manner such that the string would always encode a valid alignment. In several test cases, it was shown that this method is capable of finding good alignments. IV. Conclusion GAs are efficient general search algorithms and as such are appropriate for any optimization problem, including problems related to protein folding. This is quite intriguing since in reality protein folding occurs on the single-molecule level. Protein molecules fold individually (at least in vitro) as single molecules, and clearly not by a “mix-and-match” strategy on the population level. The strength of the GA approach and its ability to describe many biological processes comes from its unique ability to model cooperative pathways. Fig 1.Structural alignment of hemoglobin (b-chain) (the ribbon representation) with allophycocyanin (the ball-and-stick representation). The gaps in the structural alignment of one protein relative to the other are shown in a thick line representation. GA was used to produce a large number of initial rigid superposition (using the six parameters of the superposition, three for rotation, and three for translation) as the manipulated objects. Then, a dynamic programming algorithm was used to find the best way to introduce gaps into the structural alignment. This method was extended to identify local structure similarities amongst a large number of structures. It was shown that the results are consistent with other methods of structural alignments. Structure alignment was addressed in a different way. Secondary structure elements were identified for each protein, and the structural alignment was done by matching, using a GA, these elements across the two structures. The representation was the paired list of secondary structure elements. The genetic operators changed the pairing of these elements to each other. A refinement stage was performed later to determine the exact boundaries of each secondary structure fragment. The results show very good agreement with high-quality alignments made by human experts based on careful structural examination. The threading problem, the alignment of the sequence of one protein to the structure of another. Again the crux of the problem is where to introduce gaps in the alignment in one protein relative to the other. Threading was encoded as strings of numbers where 0 represents a deletion of a structural element relative to the V. References [1]. Holland JH (1975) Adaptation in natural and artificial systems. The University of Michigan Press, Ann Harbor, MI [2]. Goldberg DH (1985) Genetic algorithms in search, optimization and machine learning. Addison-Wesley, Reading, MA [3]. Huberman BA (1990) Phys D 42:38 [4]. Clearwater SH, Huberman BA, Hogg T (1991) Science 254:1181 [5]. Ramakrishnan C, Ramachandran GN (1965) Biophys J 5:909 [6]. Anfinsen CB, Haber E, Sela M, White FH (1961) Proc Natl Acad Sci USA 47:1309 [7]. Anfinsen CB (1973) Science 181:223 [8]. Burley SK, Bonanno JB (2003) Methods Biochem Anal 44:591 [9]. Karplus M (1987) The prediction and analysis of mutant structures. In: Oxender DL, Fox CF (eds) Protein engineering. Liss, New York [10]. Roterman IK, Lambert MH, Gibson KD, Scheraga HA (1989) J Biomol Struct Dyn 7:421 [11]. Even S (1979) Graph algorithms. Computer Science Press, Rockville, MD [12]. Unger R, Moult J (1993) Bull Math Biol 55:1183 [13]. Berger B, Leighton TJ (1998) J Comput Biol 5:27 [14]. Levitt M (1982) Annu Rev Biophys Bioeng 11:251 [15]. Dandekar T, Argos P (1997) Protein Eng 10:877.