Final Paper - Villanova Department of Computing Sciences

advertisement
BIOINFORMATICS: SEQUENCE ALIGNMENT
Carmen Nigro
Department of Computing Sciences
Villanova University
Villanova, Pennsylvania 19085
carmen.nigro@villanova.edu
KEY WORDS
and segments which have remained the same over time.
Sequence alignment also has functional importance, as
sequences that are alike may have the same role or code
for the same entity. The Drug Industry has benefited from
applying this notion when designing new drugs to treat
certain diseases. Some diseases are caused by the lack of
certain parts of a protein sequence. Sequence alignment
can help to identify those regions, while the lack of parts
of a sequence may be compensated by injecting the
missing sequence into the protein. Sequence alignment
has also been useful for analyzing protein structure.
Protein molecules that are alike in sequence are also more
likely to have similar structures, as many of the same
bonds will form. In addition, similar protein sequences
have been used to determine protein structure-function
relationships [2].
sequence alignment, global alignment, local alignment,
dynamic programming, progressive alignment, iterative
alignment
2. GLOBAL AND LOCAL ALIGNMENTS
ABSTRACT
As more data from DNA and protein sequences is
discovered, sequence alignment programs are becoming
increasingly important for analyzing this data. These
alignments can help us learn more about the functions of
certain genes and proteins and these observations could
ultimately lead to the discovery of cures for certain
genetic diseases or even a better insight into the
evolutionary process. There are many different types of
alignment strategies and choosing the appropriate strategy
depends on the ultimate goal of the alignment. This paper
outlines and compares the basic approaches and
algorithms for sequence alignment.
1. INTRODUCTION
Sequence alignment is an important division of
bioinformatics, which attempts to analyze and compare
sequences that make up DNA or proteins. Sequence
alignment is a way of comparing two or more sequences
by searching for a series of individual characters that are
in the same order of both sequences [1]. As improved
methods were found for collecting biological data, such as
nucleotide and amino acid sequences, there were privacy
concerns for the creation of a database for easy storage,
retrieval, and revision of the data. Today, bioinformatics
scientists are interested in the analysis and interpretation
of that data. Because these sequences are too long to be
analyzed manually by people, efficient and accurate
alignment programs are essential for comparing DNA or
protein sequences. The sequences that are being compared
are usually represented by nitrogenous bases for DNA
sequences or amino acids for protein sequences. There are
four different nitrogenous bases which code for DNA,
while there are 20 different amino acids which code for
different proteins.
Through sequence alignments, attempts can be made
to identify homologous sequences, that is, sequences with
a common evolutionary origin [1]. The discovery of
homologous sequences may help to predict the
evolutionary process based on segments with mutations
Global and local alignments are two different methods
of aligning a sequence. Deciding which method to choose
depends on the purpose of the alignment. Global
alignments attempt to compare every residue of every
sequence and are best employed when the sequences are
similar and are of the same size, because different sized
sequences will produce mismatches at the ends of an
alignment. However, when attempting to align every
element of dissimilar sequences many gaps will be
produced because of the many mismatches between the
two sequences, as seen in figure 1. When comparing two
long sequences, these gaps can become difficult to
analyze. Local alignments are best employed for
dissimilar sequences that may have similar regions [3].
Local alignments are very useful for finding a particular
pattern that exists on both sequences, as that pattern may
also have a similar function. If both sequences are very
similar, it should not make a difference which method is
used, because the alignments should produce similar
results. There is also no difference in time efficiency
between the two methods. The most fundamental global
and local alignment algorithms are based on dynamic
programming. The Needleman-Wunsch algorithm is
based on dynamic programming and solves the global
alignment problem, while the Smith-Waterman algorithm
is also based on dynamic programming and solves the
local alignment problem.
Figure 1. Examples of local and global alignments
3. PAIRWISE AND MULTIPLE
ALIGNMENTS
Figure 2. A simple scoring function
Pairwise alignments attempt to align two sequences at
a time while multiple alignments attempt to align three or
more sequences at a time. Analyzing three or more
sequences at a time can be useful for studying molecular
evolution and analyzing sequence-structure relationships
[1]. Also, the detection of a pattern common to a set of
sequences may only be apparent through multiple
sequence alignment [1]. While the dynamic programming
techniques described above are reliable methods of
alignment, they are not practical to implement for
multiple alignments. By extending the dynamic
programming algorithm for multiple alignments, an
optimal alignment will be produced in time O(n k) for k
sequences [13]. The problem of multiple sequence
alignment grows exponentially every time another
sequence is added and becomes unreasonable for
comparing more than three sequences at a time [1]. Due
to the impracticality of using dynamic programming
algorithms to solve the multiple alignment problem, many
heuristic algorithms have been sought after, which
sacrifice accuracy for time efficiency. Heuristic
approaches attempt to optimize pairwise alignments rather
than searching for an overall optimal alignment [15].
Over 75 methods of solving the multiple alignment
problem have been identified and the problem continues
to be central to computational molecular biology [15].
4. SCORING FUNCTIONS
Scoring schemes are important for sequence
alignment programs, because they are a means of
comparing different alignments. In alignment algorithms
a scoring function must exist so that scores may be
assigned to different alignments based on the number of
gaps and the number of matches. Scores are assigned to
each possible pair of elements based on their similar
chemical properties and evolutionary probability of the
mutation. Gap costs are also an important part of any
sequence alignment program and have been studied
extensively. Gap costs may take into account that a
mutational event may insert or delete multiple elements
[2]. Gap costs must also take into account aligning
elements with nulls, when sequences are of different
lengths. Algorithms that have a fixed penalty for each gap
are popular and are easily extendable to multiple
alignments [2]. An example of a scoring scheme with
fixed penalties can be seen in figure 3.
One type of scoring for multiple alignments is the
Sum-of-pairs score, which increases with the number of
sequences aligned correctly [3]. For multiple alignments,
the sum of the pairs is the total of all alignment costs for
each pair of the sequences in the alignment. A column
score may also be implemented in a multiple sequence
alignment program which tests the capability of the
program to align all of the sequences correctly. Scoring
functions are crucial to any alignment program, because
they directly affect the choice of the optimal alignment.
5. SEQUENCE ALIGNMENT
ALGORITHMS
Significant research into algorithmic approaches to
sequence alignment has been performed over the past 20
years [10]. The most popular and current sequence
alignment algorithms in use today fall into the following
major classifications.
5.1 Dynamic Programming
Dynamic programming involves breaking a larger
problem down into smaller, more manageable pieces. The
basic dynamic programming approach for sequence
alignment finds an optimal path through a rectangular
path graph. It accomplishes this by turning one sequence
into another through a series of edits. Each edit to the
sequence is associated with a particular cost and the
purpose is to find the edits that produce the lowest cost
[1]. This method drastically reduces the number of
alignments to be considered while always producing an
optimal alignment.
Both the Needleman-Wunsch and Smith-Waterman
algorithms are based on the dynamic programming
method and have a time efficiency of O(nm), n and m
being the lengths of the two sequences. However, the
basic dynamic programming algorithm runs in O(n k) for
multiple alignment, where k is the number of sequences.
The Needle-Wunsch algorithm works by maximizing the
number of matches and minimizing the number of gaps
needed to align the two sequences. A scoring function
must exist so that scores may be assigned to the
alignments based on the number of matches and the
number of gaps of the alignment. The alignment with the
largest score will be the optimal alignment. It is
implemented through the use of a scoring matrix in which
the horizontal and vertical axes correspond to the two
sequences. The algorithm compares every element of a
sequence to every other element in the other sequence and
then traces back to find the optimal alignment. Execution
of the Needleman-Wunsch algorithm can be seen in figure
3.
Figure 3. Sample Execution of the Needleman-Wunsch
algorithm
The Smith-Waterman algorithm acts in a similar
manner, but produces a local alignment by finding the
region with the highest similarity. The Smith-Waterman
algorithm may be obtained from the Needleman-Wunsch
algorithm by adjusting the scoring function and changing
the method of tracing back to find the longest matching
subsequences.
5.2 Progressive Algorithms
Progressive alignment is the most widely used
heuristic method to align a large number of sequences and
operates in O(n2k2) time [4,15]. Progressive methods, also
known as hierarchical or tree methods, produce a multiple
sequence alignment by first aligning the most similar
sequences and then successively adding less related
sequences to the alignment until the entire set of
sequences has been aligned. A guide tree is produced that
determines the order in which the sequences are added to
the alignment. The most related sequences are aligned
first [4]. The tree describing sequence relatedness is
usually produced through pairwise comparisons that may
include heuristic pairwise alignment methods.
This technique is used in many multiple alignment
programs such as MULTALIGN, ClustalW, and T-Coffee
[4]. However, results of progressive alignments depend
heavily on the choice of the most related sequences,
which can sometimes be difficult to determine. Also,
because the alignment is built up progressively, errors
made at any stage in the alignment will be reflected in the
final result. These methods generally perform poorly on
distantly related sequences. Most progressive methods
modify their scoring function by incorporating a
weighting function which assigns scaling factors to
individual sequences based on their distance from their
neighbors in the guide tree. This is used to correct the
order in which the sequences are added to the alignment.
5.3 Iterative Algorithms
Iterative methods have been produced to help solve the
problems surrounding progressive algorithms [4].
Progressive alignments are largely dependent upon the
initial alignment, because it is incorporated into the final
result. In progressive methods, once a sequence has been
aligned, its alignment is not revisited [4]. Iterative
methods optimize an objective function based on an
alignment scoring function by creating an initial global
alignment and then realigning sequence subsets. The
realigned subsets are then aligned to produce the next
iteration’s alignment. This approach has been
implemented in programs such as, MUSCLE and
DIALIGN [4].
5.4 Summary
While dynamic programming algorithms produce the
most accurate sequence alignments, they are not always
practical to implement for multiple alignment as their
time efficiency grows exponentially as more sequences
are added to the alignment. Heuristic algorithms, such as
progressive and iterative methods, generally sacrifice
accuracy for the sake of time. These types of algorithms
have been implemented in the most widely used
alignment programs today. It is generally believed that
iterative methods are more accurate than progressive
methods, because they take into account past alignments
each time a new sequence is added.
6. PROPOSAL
My proposal aims to identify the specific effects that
iteration may have on a progressive alignment algorithm.
By default, the ClustalW program performs a progressive
alignment only; however, an option has been added which
allows for iteration at each step of the progressive
alignment. My proposed work aims to compare the scores
of multiple alignments when they are aligned with and
without iteration. This work also aims to trace the
magnitude of the effect that the number of sequences has
on the scores of the iterative alignments versus the noniterative alignments. This can be easily done by
subsequently adding more sequences to each alignment.
This proposal also aims to compare the progressive
alignments observed from the program, MULTALIGN,
with the both the iterative and non-iterative alignments
from the ClustalW tests. These two programs are easily
comparable as they both produce global alignments.
In order to compare alignments from different
programs, objective criteria are needed to determine the
quality of an alignment. The BAliBASE benchmark
alignment database would serve as a valuable tool for
comparing alignments from the two programs. The
database contains 142 reference alignments, which could
be used for this project [3].
This work will help to more clearly identify the
effectiveness of iterative methods compared to
progressive methods. It will also help to identify the most
accurate kinds of sequence alignment programs available
for biologists today. This will help to ensure that
biologists are using the most accurate tools available for
sequence alignment.
Studying and comparing these algorithms could also
help us gain better insight into other optimization
problems. These algorithms may also be applicable to
other fields within computer science, which makes the
refinement of such algorithms even more significant.
While the studies conducted by Julie D. Thompson et
al and Iain M. Wallace et al both conclude that iterative
methods are more accurate than progressive methods, this
study intends to take a closer look at the effect of the
number of sequences on the overall alignment [4,5]. It is
hypothesized that iteration will have an even greater
effect on multiple alignments as more sequences are
added to the alignment. This proposed work has been
influenced by and hopes to extend the works of
Thompson et al and Wallace et al in the field of iterative
multiple sequence alignment.
Both ClustalW and MULTALIGN programs are
available to download for free at
http://www.ebi.ac.uk/Tools/clustalw2/index.html and
http://www.faculty.ucr.edu/~tgirke/Links.htm
respectively. MULTALIGN runs on a Unix OS, while
ClustalW runs on Windows OS and has been provided
with a friendly user interface called ClustalX, where it
may be specified whether or not to perform iteration for a
specific alignment. One of the main reasons for the
success of the ClustalW program is its ease of use [6].
The BAliBASE database is also available for download at
http://bips.u-strasbg.fr/fr/Products/Databases/BAliBASE/.
All of the components needed for this project are readily
available and easily accessible from any internet
connection.
The work surrounding the project will include
becoming acquainted with both the ClustalW and
MULTALIGN programs and their underlying algorithms,
performing alignment tests for both the ClustalW and
MULTALIGN programs, and comparing the results using
the BAliBASE database. Tests will also be run on the
ClustalW program with and without iteration on a series
of different multiple alignments containing different
numbers of sequences.
Both my experiences as a computer science student
and a biology student will be useful for this project. The
Analysis of Algorithms class will have been particularly
useful in analyzing the efficiency of the algorithms, while
my biology class will have helped me to understand the
needs of the biologist when analyzing alignment
programs. My experiences in both fields will have helped
me to become familiar with the terminology in a field of
study which merges the two fields together.
This project is expected to last about two months and a
tentative timetable for the project can be seen in Table 1.
Week
1
2-4
5-6
7-8
Task
Become acquainted with ClustalW
Run ClustalW tests
Become acquainted with MULTALIGN
Run MULTALIGN tests and compare
results
Table 1. A tentative schedule for the project
Less time has been set aside to become acquainted with
the ClustalW program, because of its ease of use. It is
believed that the tests described will help us gain a better
understanding of the effects of iteration on progressive
multiple alignment algorithms.
7. CONCLUSION
Multiple sequence alignment is the backbone of
comparative and evolutionary genomics, as it allows for a
number of sequences to be matched against one another at
the same time [13]. Although dynamic programming
algorithms are the most accurate known algorithms for
sequence alignment, they are inefficient for multiple
sequence alignment. Currently heuristic algorithms are
implemented for the most popular sequence alignment
programs, because of their efficiency. However, these
algorithms sacrifice accuracy for time. Additional
research must be carried out to refine these algorithms in
order to increase their accuracy while also maintaining
their efficiency.
REFERENCES
[1]
D.G. Brown, “A survey of seeding for sequence
alignment”, University of Waterloo, Waterloo,
Ontario, Canada, 2007.
[2]
D.J. Lipman, S.F. Altschul, and J.D. Kececioglu,
“A Tool for Multiple Sequence Alignment”,
Proc. Nail. Acad. Sci. USA, Vol. 86, pp. 44124415, June 1989.
[3]
H. Rangwala and G. Karypis, “Incremental
window-based protein sequence
alignment algorithms”, Oxford Journals:
Bioinformatics, Vol. 23, pp. e17-e23, 2007.
[4]
I. M. Wallace , O. Orla, and D. G. Higgins,
“Evaluation of Iterative Alignment Algorithms
for Multiple Alignment”, Oxford Journals:
Bioinformatics, Vol. 21, pp. 1408-1414, 2005.
[5]
J.D. Thompson, F. Plewniak, and O. Poch, “A
comprehensive comparison of multiple sequence
alignment programs”, Oxford Journals: Nucleic
Acids Research, Vol. 27, pp. 2682-2690, 1999.
[6]
J.D. Thompson, T.J. Gibson, F. Plewniak, F.
Jeanmougin, and D. G. Higgins, “The
CLUSTAL_X windows interface: flexible
strategies for multiple sequence alignment aided
by quality analysis tools”, Oxford Journals:
Nucleic Acids Research, Vol. 25, pp. 4876-4882,
1997.
[7]
J. Hérisson, G. Payen, and R. Gherbi, “A 3D
pattern matching algorithm for DNA sequences”
, Oxford Journals: Bioinformatics, Vol. 23, pp.
680-686, 2007.
[8]
J. M. Sauder, J. W. Arthur, and .R L. Dunbrack,
Jr., “Large-Scale Comparison of
Protein Sequence Alignment Algorithms With
Structure Alignments,” Proteins: Structure,
Function, and Genetics, Vol. 40, pp. 6-22, 2000.
[9]
L. A. Newberg, “Memory efficient dynamic
programming backtrace and pairwise local
sequence alignment”, Oxford Journals:
Bioinformatics, Vol. 24, pp. 1772-1778, 2008.
[10]
L. Delcher, A. Phillippy, J. Carlton and S. L.
Salzberg, “Fast algorithms for large-scale
genome alignment and comparison”, Oxford
Journals: Nucleic Acids Research, Vol. 30, pp.
2478-2483, 2002.
[11]
M.S. Waterman, “Efficient Sequence Alignment
Algorithms”, J. theor. Biol., Vol. 108, pp. 333337, 1984.
[12]
R. Chenna, H. Sugawara, T. Koike, R. Lopez,
T.J. Gibson, D.G. Higgins, and J.D. Thompson,
“Multiple sequence alignment with the Clustal
series of programs”, Oxford Journals: Nucleic
Acids Research, Vol. 31, pp. 3497-3500, 2003.
[13]
S. Kumar, A. Filipski, “Multiple Sequence
Alignment: In pursuit of homologous DNA
positions”, Cold Spring Harbor Laboratory
Press: Genome Research, Vol. 17, pp. 127-135,
2007.
[14]
T. W. Lam, W. K. Sung, S. L. Tam, C. K.
Wong, and S. M. Yiu, “Compressed indexing
and local alignment of DNA”, Oxford Journals:
Bioinformatics, Vol. 24, pp. 791-797, 2008.
[15]
Y. Bilu, P. K. Agarwal, R. Kolodny, “Faster
Algorithms for Optimal Multiple Sequence
Alignment Based on Pairwise Comparisons”,
IEEE/ACM Transactions on Computational
Biology and Bioinformatics, Vol. 3, pp. 408-422,
2006.
Download