BLAST2

advertisement
How Does BLAST work?
BLAST, the Basic Local Alignment Search Tool, was developed by Altschul et al. in 1990 (see
reference [1]). The original version reported only ungapped alignments and is described in your
textbook. In 1997, Altschul et al. released a new version BLAST 2.0 that runs faster than the
original BLAST without sacrificing sensitivity and reports some gapped alignments (see
reference [2]). In this note I will briefly describe how this later version works. Disclaimer: The
following description is based on the information given in the article [2], not on knowledge of the
actual source code of BLAST 2.0. Where the description in [2] is ambiguous, I have tried to
resolve the ambiguity as best as I could, but the actual code may resolve it differently.
BLAST 2.0 takes as input a query sequence s and searches it against a protein or DNA database
for pairs of high scoring local alignments. For the purpose of this note, we may think of the
database as a very long sequence t of amino acids or nucleotides. We will assume here that the
search involves amino acid sequences; searches involving DNA sequences use the same ideas. At
the heart of the BLAST 2.0 search strategy lies the notion of high scoring pairs of words. The
program uses a parameter len that is typically about three for protein database searches (about 11
for DNA database searches). Every string w of length len is called a word. Every pair of words
(w,u) receives a score sim(w,u) that is determined by the character substitution matrix p. In
particular, if w = w[1]w[2]w[3] and u = u[1]u[2]u[3], then
sim(w,u) = p(w[1],u[1]) + p(w[2],u[2]) + p(w[3],u[3]).
BLAST looks for pairs of words (w,u) such that w is a word in s, u is a word in t, and
sim(w,u)  T, where the threshold T is another parameter of the model. Each such high scoring
pair of words is called a hit. There is ample empirical evidence that almost all high-scoring local
alignments contain at most two nearby hits. Thus once it has found a hit, BLAST 2.0 searches for
a non-overlapping hit on the same diagonal within a window of size A. To understand the
meaning of these terms, consider two hits (w,u) and (w',u'). Let x and x' denote the positions of
the first letters of w and w' respectively (in the sequence s), and let y and y' denote the positions of
the first letters of u and u' respectively (in the sequence t). Then (w,u) and (w',u') are on the same
diagonal if x' - x = y' - y. (w,u) and (w',u') are within a window of size A if x' - x < A, and (w,u)
and (w',u') are non-overlapping if |x' - x|  len.
Now suppose BLAST 2.0 has found a seed, that is, two non-overlapping hits (w,u) and (w',u') on
the same diagonal within a window of size A. Assume (w,u) sits to the left of (w',u'), let x and y
denote the positions of the first letters of w and u respectively, and let x' and y' denote the
positions of the last (rightmost) letters of w' and u' respectively. Now the program does
something like the following (actual implementation may be different; see above disclaimer):
highscore  0
for j  0 to x' - x do
highscore  highscore + p(s[x+j],t[y+j])
//highscore is now equal to the score for aligning the segments flanked by the two hits
//Next the seed is extended to the left
m0
left  0
actualscore  highscore
repeat
mm+ 1
actualscore  actualscore + p(s[x-m],t[y-m])
if highscore < actualscore then
highscore  actualscore
left  m
until actualscore < highscore - C
//Now the seed is extended to the right
m0
right  0
actualscore  highscore
repeat
mm+ 1
actualscore  actualscore + p(s[x'+m],t[y'+m])
if highscore < actualscore then
highscore  actualscore
right  m
until actualscore < highscore - C
The number C for determining the cutoff point for extending seeds is another parameter of the
algorithm. After execution of the above instructions we have found an HSP (high-scoring
sequence pair) at positions x - left through x' + right of the query sequence s and positions y - left
through y' + right of the sequence t with score written in the variable highscore. Now if highscore
is above a certain threshold Sg (another parameter of the algorithm), then BLAST 2.0 triggers a
search for the optimal gapped extension of the HSP. This part of BLAST 2.0 is similar to the
Smith-Waterman algorithm for local alignment, but gap penalties are affine and various timesaving tricks are used.
References
[1] Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990) A basic local
alignment search tool. Journal of Molecular Biology 215, 403-410.
[2] Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman
D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Research 25(17) 3389-3402.
Homework
Consider the following amino acid "query sequence":
GGACTWQQQEHIKFPIAACNR
Assume BLAST 2.0 searches it against the "database":
ARRRGHFYVQQQEKFPCTWQWQEHKLFPIDASCMKLTEVY
with the following parameters:
len = 3
A = 10
T=3
C=1
Assume that the character substitution matrix p awards a score of +1 for matches and -1 for
mismatches.
(a) Which hits will BLAST 2.0 find?
(b) Which HSPs will BLAST 2.0 find, and what will be their scores?
Report first and last positions of hits and HSPs. Ignore the problem whether or not a gapped
extension will be triggered.
Worth: 3 points
Download