Annotated Bibliography

advertisement
Carmen Nigro
October 9, 2009
Topic: Bioinformatics: Sequence Alignment
Description: This research examines different algorithms for determining relationships between
sequences of amino acids or nucleotides from DNA, RNA, or proteins.
Motivation: Sequence alignment can help scientists hypothesize the function of a particular
sequence of DNA or protein. Similarities in different sequences can imply similarities in function
and structure.
References:

D.J. Lipman, S.F. Altschul, and J.D. Kececioglu, “A Tool for Multiple Sequence
Alignment”, Proc. Nail. Acad. Sci. USA, Vol. 86, pp. 4412-4415, June 1989.
This article offers an alternative to dynamic programming for multiple sequence
alignment. Until recently, dynamic programming has been impractical for multiple
sequence alignment, as too much computing time would be required to compare more
than three sequences. The article suggests a program that implements the Carillo –
Lipman algorithm, called Multiple Sequence Alignment (MSA). This program allows for
the comparison of up to six sequences. Dynamic programming involves breaking a larger
problem down into smaller, more manageable pieces. The basic dynamic programming
approach for sequence alignment finds an optimal path through a rectangular path graph.
It accomplishes this by turning one sequence into another through a series of edits. Each
edit to the sequence is associated with a particular cost and the purpose is to find the edits
that produce the lowest cost. The algorithm employed by MSA reduces the number of
cell comparisons, by calculating an upper bound for the cost of a projection for an
optimal sequence alignment on a pair of sequences.

R. Chenna, H. Sugawara, T. Koike, R. Lopez, T.J. Gibson, D.G. Higgins, and J.D.
Thompson, “Multiple sequence alignment with the Clustal series of programs”, Oxford
Journals: Nucleic Acids Research, Vol. 31, pp. 3497-3500, 2003.
The Clustal series of programs are the most widely used programs for sequence
alignment. This article describes the Clustal series’ main features. The programs would
be very suitable for conducting a student research project, because they were designed to
be a portable set of tools that provide accurate alignments in a reasonable response time.
Many web servers have been set up to run the program as a service; however, the size of
input must not be too large. But this can be over-come by simply downloading the free
software and running it locally. The article runs through a brief history of the Clustal
series and its different implementations, but does not go very deep into detail. The paper
touches upon the generation of trees from the multiple alignments through the NeighborJoining method. The paper also touches upon sequence weighting, position-specific gap
penalties and the automatic choice of a suitable residue comparison matrix at each step in
the multiple alignment. It may be useful to look further into these features.

J.D. Thompson, F. Plewniak, and O. Poch, “A comprehensive comparison of multiple
sequence alignment programs”, Oxford Journals: Nucleic Acids Research, Vol. 27, pp.
2682-2690, 1999.
This article compares the most widely used programs for multiple sequence alignment.
The paper compares ten different alignment programs using the BAliBASE, a database of
verified alignments, as a reference. The results show that iterative implementations often
offer improved accuracy; however, at the cost of more computing time. The paper also
compares global and local methods. Global alignments attempt to compare every residue
of every sequence and are best employed when the sequences are similar and are of the
same size. Local alignments are best employed for dissimilar sequences that may have
similar regions. The successfulness of an alignment strategy greatly depends on the
sequence to be aligned. The re-partition of sequences, the sequence length, and the
presence of N/C-terminal extensions may affect the results of any program and none of
the programs studied in this experiment performed well for all three variables. The paper
does not analyze computer time and space requirements when comparing the programs.

J.D. Thompson, T.J. Gibson, F. Plewniak, F. Jeanmougin, and D. G. Higgins, “The
CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment
aided by quality analysis tools”, Oxford Journals: Nucleic Acids Research, Vol. 25, pp.
4876-4882, 1997.
This article describes the Clustal_X user interface and useful new features. The article
also describes the algorithms used to check alignment quality. The Clustal series builds
up a multiple alignment progressively, a tree-like method. It achieves this by aligning the
most closely related groups first and then other similar groups are slowly aligned
together, without changing the earlier alignments. The initial tree is based on the
comparison of pairs in the sequence. Progressive alignment strategies are greatly
dependent on how the most related sequences are determined. A poor choice of strategy
for this initial tree may lead to inaccurate alignments. This approach works best when the
sequences are closely related. A mechanism has been added to the program that
highlights problem regions of an alignment and allows the user to manually realign these
residue ranges. The program also provides the user with the option of automatically
realigning low scoring regions.

I. M. Wallace , O. Orla, and D. G. Higgins, “Evaluation of Iterative Alignment
Algorithms for Multiple Alignment”, Oxford Journals: Bioinformatics, Vol. 21, pp. 14081414, 2005.
This article compares different iterative algorithms for multiple alignment. The paper
analyzes the results of several tests that were run on iterative algorithms. The paper
concludes that iteration may be incorporated into many other methods of sequence
alignment to produce even better results. The algorithms implemented and tested were
remove first, best first, random, tree-based iterative, and tree base splitting. Both the
remove first and best first algorithms performed the best overall. The remove first method
is based on removing a sequence from the alignment at each step of the iteration and
realigned to the remaining alignment. If this new alignment is better, it is used as the
input for the next iteration. The best first approach compensates for the greedy nature of
the remove first approach. At each iteration, every sequence is removed and realigned to
the rest. The alignment with the best score is used as input for the next iteration.
Other Useful Resources:

M.S. Waterman, “Efficient Sequence Alignment Algorithms”, J. theor. Biol., Vol. 108,
pp. 333-337, 1984.
[This article evaluates sequence alignment algorithms and compares them using big O
notation. The article proposes the use of concave weighting functions in order to increase
efficiency.]

H. Rangwala and G. Karypis, “Incremental window-based protein sequence
alignment algorithms”, Oxford Journals: Bioinformatics, Vol. 23, pp. e17-e23, 2007.
[This article proposes a new algorithm for sequence alignment, which is based on short
fixed-or-variable length high-scoring subsequences. The results show that this algorithm
gives comparable results to algorithms already in use.]
L. A. Newberg, “Memory efficient dynamic programming backtrace and pairwise local
sequence alignment”, Oxford Journals: Bioinformatics, Vol. 24, pp. 1772-1778, 2008.
[Because it is insufficient to store all intermediate sequences in a cache, this article
proposes a memory efficient algorithm for calculating these intermediate values as they
are needed. The article describes the results obtained from experiments with this checkpointing system on pairwise local sequences.]
J. Hérisson, G. Payen, and R. Gherbi, “A 3D pattern matching algorithm for DNA
sequences” , Oxford Journals: Bioinformatics, Vol. 23, pp. 680-686, 2007.
[The article proposes a 3D model for DNA rather than the traditional textual models. A
3D model would allow scientists to study syntax and other properties of DNA. ]
T. W. Lam, W. K. Sung, S. L. Tam, C. K. Wong, and S. M. Yiu, “Compressed indexing
and local alignment of DNA”, Oxford Journals: Bioinformatics, Vol. 24, pp. 791-797,
2008.
[The article focuses on finding local alignments of DNA sequences through indexing
certain sequences of DNA. This is a faster alternative to dynamic programming;
however, it is a heuristic-based approach and may not be as accurate.]




J. M. Sauder, J. W. Arthur, and .R L. Dunbrack, Jr., “Large-Scale Comparison of
Protein Sequence Alignment Algorithms With Structure Alignments,” Proteins:
Structure, Function, and Genetics, Vol. 40, pp. 6-22, 2000.
[This paper compares a number of sequence alignment algorithms and their accuracy for
protein sequence alignments.]

L. Delcher, A. Phillippy, J. Carlton and S. L. Salzberg, “Fast algorithms for large-scale
genome alignment and comparison”, Oxford Journals: Nucleic Acids Research, Vol. 30,
pp. 2478-2483, 2002.
[The article proposes a suffix tree algorithm for which they claim can align entire
genome sequences using minimal computer time and memory.]
Download