Carmen Nigro October 9, 2009 Topic: Bioinformatics: Sequence Alignment Description: This research examines different algorithms for determining relationships between sequences of amino acids or nucleotides from DNA, RNA, or proteins. Motivation: Sequence alignment can help scientists hypothesize the function of a particular sequence of DNA or protein. Similarities in different sequences can imply similarities in function and structure. References: D.J. Lipman, S.F. Altschul, and J.D. Kececioglu, “A Tool for Multiple Sequence Alignment”, Proc. Nail. Acad. Sci. USA, Vol. 86, pp. 4412-4415, June 1989. This article offers an alternative to dynamic programming for multiple sequence alignment. Until recently, dynamic programming has been impractical for multiple sequence alignment, as too much computing time would be required to compare more than three sequences. The article suggests a program that implements the Carillo – Lipman algorithm, called Multiple Sequence Alignment (MSA). This program allows for the comparison of up to six sequences. Dynamic programming involves breaking a larger problem down into smaller, more manageable pieces. The basic dynamic programming approach for sequence alignment finds an optimal path through a rectangular path graph. It accomplishes this by turning one sequence into another through a series of edits. Each edit to the sequence is associated with a particular cost and the purpose is to find the edits that produce the lowest cost. The algorithm employed by MSA reduces the number of cell comparisons, by calculating an upper bound for the cost of a projection for an optimal sequence alignment on a pair of sequences. R. Chenna, H. Sugawara, T. Koike, R. Lopez, T.J. Gibson, D.G. Higgins, and J.D. Thompson, “Multiple sequence alignment with the Clustal series of programs”, Oxford Journals: Nucleic Acids Research, Vol. 31, pp. 3497-3500, 2003. The Clustal series of programs are the most widely used programs for sequence alignment. This article describes the Clustal series’ main features. The programs would be very suitable for conducting a student research project, because they were designed to be a portable set of tools that provide accurate alignments in a reasonable response time. Many web servers have been set up to run the program as a service; however, the size of input must not be too large. But this can be over-come by simply downloading the free software and running it locally. The article runs through a brief history of the Clustal series and its different implementations, but does not go very deep into detail. The paper touches upon the generation of trees from the multiple alignments through the NeighborJoining method. The paper also touches upon sequence weighting, position-specific gap penalties and the automatic choice of a suitable residue comparison matrix at each step in the multiple alignment. It may be useful to look further into these features. J.D. Thompson, F. Plewniak, and O. Poch, “A comprehensive comparison of multiple sequence alignment programs”, Oxford Journals: Nucleic Acids Research, Vol. 27, pp. 2682-2690, 1999. This article compares the most widely used programs for multiple sequence alignment. The paper compares ten different alignment programs using the BAliBASE, a database of verified alignments, as a reference. The results show that iterative implementations often offer improved accuracy; however, at the cost of more computing time. The paper also compares global and local methods. Global alignments attempt to compare every residue of every sequence and are best employed when the sequences are similar and are of the same size. Local alignments are best employed for dissimilar sequences that may have similar regions. The successfulness of an alignment strategy greatly depends on the sequence to be aligned. The re-partition of sequences, the sequence length, and the presence of N/C-terminal extensions may affect the results of any program and none of the programs studied in this experiment performed well for all three variables. The paper does not analyze computer time and space requirements when comparing the programs. J.D. Thompson, T.J. Gibson, F. Plewniak, F. Jeanmougin, and D. G. Higgins, “The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools”, Oxford Journals: Nucleic Acids Research, Vol. 25, pp. 4876-4882, 1997. This article describes the Clustal_X user interface and useful new features. The article also describes the algorithms used to check alignment quality. The Clustal series builds up a multiple alignment progressively, a tree-like method. It achieves this by aligning the most closely related groups first and then other similar groups are slowly aligned together, without changing the earlier alignments. The initial tree is based on the comparison of pairs in the sequence. Progressive alignment strategies are greatly dependent on how the most related sequences are determined. A poor choice of strategy for this initial tree may lead to inaccurate alignments. This approach works best when the sequences are closely related. A mechanism has been added to the program that highlights problem regions of an alignment and allows the user to manually realign these residue ranges. The program also provides the user with the option of automatically realigning low scoring regions. I. M. Wallace , O. Orla, and D. G. Higgins, “Evaluation of Iterative Alignment Algorithms for Multiple Alignment”, Oxford Journals: Bioinformatics, Vol. 21, pp. 14081414, 2005. This article compares different iterative algorithms for multiple alignment. The paper analyzes the results of several tests that were run on iterative algorithms. The paper concludes that iteration may be incorporated into many other methods of sequence alignment to produce even better results. The algorithms implemented and tested were remove first, best first, random, tree-based iterative, and tree base splitting. Both the remove first and best first algorithms performed the best overall. The remove first method is based on removing a sequence from the alignment at each step of the iteration and realigned to the remaining alignment. If this new alignment is better, it is used as the input for the next iteration. The best first approach compensates for the greedy nature of the remove first approach. At each iteration, every sequence is removed and realigned to the rest. The alignment with the best score is used as input for the next iteration. Other Useful Resources: M.S. Waterman, “Efficient Sequence Alignment Algorithms”, J. theor. Biol., Vol. 108, pp. 333-337, 1984. [This article evaluates sequence alignment algorithms and compares them using big O notation. The article proposes the use of concave weighting functions in order to increase efficiency.] H. Rangwala and G. Karypis, “Incremental window-based protein sequence alignment algorithms”, Oxford Journals: Bioinformatics, Vol. 23, pp. e17-e23, 2007. [This article proposes a new algorithm for sequence alignment, which is based on short fixed-or-variable length high-scoring subsequences. The results show that this algorithm gives comparable results to algorithms already in use.] L. A. Newberg, “Memory efficient dynamic programming backtrace and pairwise local sequence alignment”, Oxford Journals: Bioinformatics, Vol. 24, pp. 1772-1778, 2008. [Because it is insufficient to store all intermediate sequences in a cache, this article proposes a memory efficient algorithm for calculating these intermediate values as they are needed. The article describes the results obtained from experiments with this checkpointing system on pairwise local sequences.] J. Hérisson, G. Payen, and R. Gherbi, “A 3D pattern matching algorithm for DNA sequences” , Oxford Journals: Bioinformatics, Vol. 23, pp. 680-686, 2007. [The article proposes a 3D model for DNA rather than the traditional textual models. A 3D model would allow scientists to study syntax and other properties of DNA. ] T. W. Lam, W. K. Sung, S. L. Tam, C. K. Wong, and S. M. Yiu, “Compressed indexing and local alignment of DNA”, Oxford Journals: Bioinformatics, Vol. 24, pp. 791-797, 2008. [The article focuses on finding local alignments of DNA sequences through indexing certain sequences of DNA. This is a faster alternative to dynamic programming; however, it is a heuristic-based approach and may not be as accurate.] J. M. Sauder, J. W. Arthur, and .R L. Dunbrack, Jr., “Large-Scale Comparison of Protein Sequence Alignment Algorithms With Structure Alignments,” Proteins: Structure, Function, and Genetics, Vol. 40, pp. 6-22, 2000. [This paper compares a number of sequence alignment algorithms and their accuracy for protein sequence alignments.] L. Delcher, A. Phillippy, J. Carlton and S. L. Salzberg, “Fast algorithms for large-scale genome alignment and comparison”, Oxford Journals: Nucleic Acids Research, Vol. 30, pp. 2478-2483, 2002. [The article proposes a suffix tree algorithm for which they claim can align entire genome sequences using minimal computer time and memory.]