How Does BLAST work? BLAST, the Basic Local Alignment Search Tool, was developed by Altschul et al. in 1990 (see reference [1]). The original version reported only ungapped alignments and is described in your textbook. In 1997, Altschul et al. released a new version BLAST 2.0 that runs faster than the original BLAST without sacrificing sensitivity and reports some gapped alignments (see reference [2]). In this note I will briefly describe how this later version works. Disclaimer: The following description is based on the information given in the article [2], not on knowledge of the actual source code of BLAST 2.0. Where the description in [2] is ambiguous, I have tried to resolve the ambiguity as best as I could, but the actual code may resolve it differently. BLAST 2.0 takes as input a query sequence s and searches it against a protein or DNA database for pairs of high scoring local alignments. For the purpose of this note, we may think of the database as a very long sequence t of amino acids or nucleotides. We will assume here that the search involves amino acid sequences; searches involving DNA sequences use the same ideas. At the heart of the BLAST 2.0 search strategy lies the notion of high scoring pairs of words. The program uses a parameter len that is typically about three for protein database searches (about 11 for DNA database searches). Every string w of length len is called a word. Every pair of words (w,u) receives a score sim(w,u) that is determined by the character substitution matrix p. In particular, if w = w[1]w[2]w[3] and u = u[1]u[2]u[3], then sim(w,u) = p(w[1],u[1]) + p(w[2],u[2]) + p(w[3],u[3]). BLAST looks for pairs of words (w,u) such that w is a word in s, u is a word in t, and sim(w,u) T, where the threshold T is another parameter of the model. Each such high scoring pair of words is called a hit. There is ample empirical evidence that almost all high-scoring local alignments contain at most two nearby hits. Thus once it has found a hit, BLAST 2.0 searches for a non-overlapping hit on the same diagonal within a window of size A. To understand the meaning of these terms, consider two hits (w,u) and (w',u'). Let x and x' denote the positions of the first letters of w and w' respectively (in the sequence s), and let y and y' denote the positions of the first letters of u and u' respectively (in the sequence t). Then (w,u) and (w',u') are on the same diagonal if x' - x = y' - y. (w,u) and (w',u') are within a window of size A if x' - x < A, and (w,u) and (w',u') are non-overlapping if |x' - x| len. Now suppose BLAST 2.0 has found a seed, that is, two non-overlapping hits (w,u) and (w',u') on the same diagonal within a window of size A. Assume (w,u) sits to the left of (w',u'), let x and y denote the positions of the first letters of w and u respectively, and let x' and y' denote the positions of the last (rightmost) letters of w' and u' respectively. Now the program does something like the following (actual implementation may be different; see above disclaimer): highscore 0 for j 0 to x' - x do highscore highscore + p(s[x+j],t[y+j]) //highscore is now equal to the score for aligning the segments flanked by the two hits //Next the seed is extended to the left m0 left 0 actualscore highscore repeat mm+ 1 actualscore actualscore + p(s[x-m],t[y-m]) if highscore < actualscore then highscore actualscore left m until actualscore < highscore - C //Now the seed is extended to the right m0 right 0 actualscore highscore repeat mm+ 1 actualscore actualscore + p(s[x'+m],t[y'+m]) if highscore < actualscore then highscore actualscore right m until actualscore < highscore - C The number C for determining the cutoff point for extending seeds is another parameter of the algorithm. After execution of the above instructions we have found an HSP (high-scoring sequence pair) at positions x - left through x' + right of the query sequence s and positions y - left through y' + right of the sequence t with score written in the variable highscore. Now if highscore is above a certain threshold Sg (another parameter of the algorithm), then BLAST 2.0 triggers a search for the optimal gapped extension of the HSP. This part of BLAST 2.0 is similar to the Smith-Waterman algorithm for local alignment, but gap penalties are affine and various timesaving tricks are used. References [1] Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990) A basic local alignment search tool. Journal of Molecular Biology 215, 403-410. [2] Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25(17) 3389-3402. Homework Consider the following amino acid "query sequence": GGACTWQQQEHIKFPIAACNR Assume BLAST 2.0 searches it against the "database": ARRRGHFYVQQQEKFPCTWQWQEHKLFPIDASCMKLTEVY with the following parameters: len = 3 A = 10 T=3 C=1 Assume that the character substitution matrix p awards a score of +1 for matches and -1 for mismatches. (a) Which hits will BLAST 2.0 find? (b) Which HSPs will BLAST 2.0 find, and what will be their scores? Report first and last positions of hits and HSPs. Ignore the problem whether or not a gapped extension will be triggered. Worth: 3 points