f201312261388071217

advertisement
International Journal of Electronics, Electrical and Computational System
IJEECS
Genetic Algorithms Approach to Protein
Alignments
Gurpreet Singh
Faculty of Computer Applications
Chandigarh Group of Colleges
Gharuan Campus, Mohali, Punjab, India.
gp_cgc@yahoo.com
Abstract—Comparison of proteins may highlight regions in
which the proteins are most similar. These conserved areas
might represent the regions or domains of the proteins that
are responsible for common function. Locating similarities
between protein sequences is usually done using dynamic
programming algorithms which are guaranteed to find the
optimal alignment under a given set of costs for the
sequence editing operation. The computational problem
becomes more complicated when multiple rather than pair
wise sequence alignments are needed. Genetic algorithms
are computing algorithms constructed in analogy with the
process of evolution. Genetic algorithms seem to be useful
for searching very general spaces and poorly defined
spaces. . Genetic algorithm has the merits of plentiful
coding, and decoding, conveying complex knowledge
flexibly. An advantage of the Genetic Algorithm is that it
works well during global optimization especially with
poorly behaved objective functions such as those that are
discontinuous or with many local minima.GAs have gained
steady recognition as useful computational tools for
addressing optimization tasks related to protein structures
and in particular to protein structure prediction.
Keywords— GA, NP, Alignment, Threading.
I.
Introduction
The GA approach is based on the observation that living
systems adapt to their environment in an efficient manner.
Thus, genetic processes involved in evolution actually perform
a computational process of finding an optimal adaptation for a
set of environmental conditions. Evolution works by using a
large genetic pool of traits that are reproduced faithfully, but
with some random variations that are subject to the process of
natural selection. While there is no guarantee that the process
will always find the optimal solution, it is evident that during
the course of time it is powerful enough to select a
combination of traits that enables the organism to function in
its environment. The GA approach attempts to implement
these fundamental ideas in other optimization problems. The
basic idea behind the GA search method is to maintain a
Er. Varun Nayyar
Department of Electronics and Communication Engg,
RBCENT for Women Hoshiarpur, Punjab.
varunnayyarnayyar@gmail.com
population of solutions. This population is allowed to advance
through successive generations in which the solutions are
evolved via genetic operations. The size of the population is
maintained by pruning in a manner that gives better survival
and reproduction probabilities to more fit solutions, while
maintaining large diversity within the population. This implies
that the algorithm must utilize a fitness function that can
express the quality of each solution as a numerical value. In
many applications, possible solutions are represented as
strings and are subject to three genetic operators: replication,
crossover, and mutation.
Genetic algorithms, a cooperative computational method, have
been successful in many difficult computational tasks. Thus, it
is not surprising that in recent years several studies were
performed to explore the possibility of using genetic
algorithms to address the protein alignment problem. In this
review, a general framework of how genetic algorithms can be
used for alignment is described.
II. Genetic Algorithms
Genetic Algorithms have been used as stochastic methods for
solving optimization and search problems, operating on a
population of possible solutions. According to Darwin’s
Theory of Evolution, the repetitive application of the
aforementioned procedures alters an initial species into
various other species; however, only the stronger prevail.
Genetic Algorithms perform the same operations on the
population of possible targets with only those that fit the
solution better surviving. Even though there is no formal
definition of GAs, all of them consist of four elements [1]. The
first is the population of chromosomes which represent the
possible solutions of the problem. Selection is the second
element and it refers to the part of the population that will
evolve to the next generation. Selection is performed based on
a fitness function, that determines how “good” a solution is.
The selection process is applied to each generation produced.
International Journal of Electronics, Electrical and Computational System
IJEECS
Crossover refers to the combination or exchange of
characteristics between two members of the elite group
defined by selection, by which offspring is produced. There
are various types of crossover but the most frequently used
are: the one-point crossover, in which the parents are cut at a
specific point and the head of the first is pasted to the tail of
the second or vice versa and the two-point crossover, in which
a part from one of the parents is obtained and exchanged with
the part that lies in the same location of the other parent.
Table 1.1 - One and Two points Crossover
Parent 1
Parent 1
Parent 2
Parent 2
Offspring 1
Offspring 1
Offspring 2
Offspring 2
110 / 0100110
110 / 0100 / 110
101 / 1010101
101 / 1010 / 101
110 1010101
110 1010 110
101 0100110
101 0100 101
After the application of crossover on the population, a new
generation is produced. Whether parents are part of the new
generation or not is an option that depends on the problem. In
any case, before re-applying selection to the new population,
mutation takes place. Mutation is a random event, occurring
with a user-defined probability to only some of the new
offspring. It is used to maintain genetic diversity by altering
only a little piece of the new offspring.
Table 1.2 Mutation
Parent 1
110 / 0100110
Parent 2
101 / 1010101
Offspring 1
110 1010101
Offspring 2 (mutated on the 1st bit) 101 0100110
All the methods described above rely heavily on the nature of
the problem to be solved, the domain in which the solutions
are to be found, and the encoding of the solutions. More
complex encoding structures, such as digital trees, allow more
difficult problems to be solved, but also require more complex
methods to be defined for the manipulation of the generations.
However, the basic structure of the GAs remains the same and
is outlined below [2].
Table 1.3 Outline of the Basic Genetic Algorithm
1. [Start] Generate random population of n chromosomes
(suitable solutions for the problem).
2. [Fitness] Evaluate the fitness f(x) of each chromosome x in
the population.
3. [New population] Create a new population by repeating
following steps until the new population is complete.
3.1. [Selection] Select two parent chromosomes from a
population according to their fitness (the better fitness, the
bigger chance to be selected).
3.2. [Crossover] with a crossover probability cross over the
parents to form a new offspring (children). If no crossover was
performed, offspring is an exact copy of parents.
3.3. [Mutation] With a mutation probability mutate new
offspring at each locus (position in chromosome).
3.4. [Accepting] Place new offspring in a new population.
4. [Replace] Use new generated population for a further run
of algorithm.
5. [Test] If the end conditions are satisfied, stop, and return
the best solution in current population.
6. [Loop] Go to step 2
III. Genetic Algorithms for Protein Alignments
Multiple sequence alignment was shown to be difficult [4].
Similarly, seeking structure alignment even between a pair of
proteins, and clearly between multiple protein structures, is
difficult. Another related difficult problem is threading:
alignment of the sequence of one protein on the structure of
another, which was also shown to be nondeterministic
polynomial hard (NP-hard) [5]. Threading is useful for foldrecognition, a less ambitious task than ab initio folding, in
which the goal is not to predict the detailed structure of the
protein but rather to recognize its general fold, for example, by
assignment of the protein to a known structural class. Because
these are complex problems, it is not surprising that GAs have
been used to address them. In these questions the
representation issue is even more critical than in the protein
structure prediction, where the dihedral angles set provides a
“natural” solution.
SAGA [6] is a GA-based method for multiple sequence
alignments. Multiple sequence alignments are represented as
matrices in which each sequence occupies one row. The
genetic operators (22 types of operators are used!) manipulate
the insertions of gaps into the alignments. Since a multiple
sequence alignment induces a pair wise alignment on each
pair of sequences that participates in the alignment, and
then the fitness function simply sums the scores of the pair
wise alignments. It was claimed that SAGA performs better
than some of the common packages for multiple sequence
alignment.
The issue of structure alignment was addressed in several
studies. When two proteins with the same length and a very
similar structure are compared, they can be aligned by a
mathematical procedure [7] that finds the optimal rigid
superposition between them. However, if the proteins differ
International Journal of Electronics, Electrical and Computational System
IJEECS
in size or when their structures are only somewhat similar,
then there is a need to consider introducing gaps in the
alignment between them such that the regions where they
are most similar could be aligned on each other (Fig. 1).
sequence, 1 represents a match between the corresponding
positions in the sequence and in the structure, and a number
bigger than 1 represents insertion of one or more sequence
residues relative to the structure. The genetic operators
manipulated these strings by changing these numbers. The
changes were done in a coordinated manner such that the
string would always encode a valid alignment. In several test
cases, it was shown that this method is capable of finding good
alignments.
IV. Conclusion
GAs are efficient general search algorithms and as such are
appropriate for any optimization problem, including problems
related to protein folding. This is quite intriguing since in
reality protein folding occurs on the single-molecule level.
Protein molecules fold individually (at least in vitro) as single
molecules, and clearly not by a “mix-and-match” strategy on
the population level. The strength of the GA approach and its
ability to describe many biological processes comes from its
unique ability to model cooperative pathways.
Fig 1.Structural alignment of hemoglobin (b-chain) (the ribbon representation)
with allophycocyanin (the ball-and-stick representation). The gaps in the
structural alignment of one protein relative to the other are shown in a thick
line representation.
GA was used to produce a large number of initial rigid
superposition (using the six parameters of the
superposition, three for rotation, and three for translation)
as the manipulated objects. Then, a dynamic programming
algorithm was used to find the best way to introduce gaps
into the structural alignment. This method was extended to
identify local structure similarities amongst a large number
of structures. It was shown that the results are consistent
with other methods of structural alignments.
Structure alignment was addressed in a different way.
Secondary structure elements were identified for each protein,
and the structural alignment was done by matching, using a
GA, these elements across the two structures. The
representation was the paired list of secondary structure
elements. The genetic operators changed the pairing of these
elements to each other. A refinement stage was performed
later to determine the exact boundaries of each secondary
structure fragment. The results show very good agreement
with high-quality alignments made by human experts based on
careful structural examination. The threading problem, the
alignment of the sequence of one protein to the structure of
another. Again the crux of the problem is where to introduce
gaps in the alignment in one protein relative to the other.
Threading was encoded as strings of numbers where 0
represents a deletion of a structural element relative to the
V. References
[1]. Holland JH (1975) Adaptation in natural and artificial systems.
The University of Michigan Press, Ann Harbor, MI
[2]. Goldberg DH (1985) Genetic algorithms in search, optimization
and machine learning. Addison-Wesley, Reading, MA
[3]. Huberman BA (1990) Phys D 42:38
[4]. Clearwater SH, Huberman BA, Hogg T (1991) Science 254:1181
[5]. Ramakrishnan C, Ramachandran GN (1965) Biophys J 5:909
[6]. Anfinsen CB, Haber E, Sela M, White FH (1961) Proc Natl Acad
Sci USA 47:1309
[7]. Anfinsen CB (1973) Science 181:223
[8]. Burley SK, Bonanno JB (2003) Methods Biochem Anal 44:591
[9]. Karplus M (1987) The prediction and analysis of mutant
structures. In: Oxender DL, Fox CF (eds) Protein engineering. Liss,
New York
[10]. Roterman IK, Lambert MH, Gibson KD, Scheraga HA (1989) J
Biomol Struct Dyn 7:421
[11]. Even S (1979) Graph algorithms. Computer Science Press,
Rockville, MD
[12]. Unger R, Moult J (1993) Bull Math Biol 55:1183
[13]. Berger B, Leighton TJ (1998) J Comput Biol 5:27
[14]. Levitt M (1982) Annu Rev Biophys Bioeng 11:251
[15]. Dandekar T, Argos P (1997) Protein Eng 10:877.
Download