handout

advertisement
Description and Pseudocode for the Dynamic Programing Algorithm
to solve the string alignment problem
The String alignment problem asks us to compute the “best” alignment of two strings
with given penalties for character mismatch and or character skipping.
For example if S1 = “VINTNER” and S2 = "WRITERS” and the cost of a mismatch is 1
and the cost of a skip is 1 then one possible optimal alignment is given by:
W R I T – E R S
V I N T N E R The penalty for this alignment is 5 since there are 3 mismatched characters and 2 skipped
characters.
We present a dynamic programming algorithm which solves this problem using a table
containing the solutions of subproblems. T(i, j) will solve the problem using the first i
characters of S1 and the first j characters of S2.
The algorithm:
Input Section
1) Input S1 and S2, two strings. (eg. S1 = “WRITERS”, S2 = “VINTNER”)
2) Compute n to be the length of S1 and m of S2 originally. (eg. n = 7, m = 7
3) Pad the beginning of each with a blank to simplify things.
(e.g. S1 = “_WRITERS” and S2 = “_VINTNER”)
4) Get skipPenalty and mismatchPenalty. (from the file or user depending on the
program)
Initialization
5) Create an initially empty n+1 by m+1 array, T.
6) fill the first column with multiples of skip penalty
i.e., T(i, 0) = i*skipPenalty for i from 0 to n inclusive
7) fill the first row with
i.e. T(0, j) = j*skipPenalty for j from 0 to m inclusive
Main Loop(s)
8) for i going from 1 to n inclusive
9)
10)
for j going from 1 to m inclusive
T(i, j) = the minimum of the following four things:
T(i-1, j) + skipPenalty,
// skipping a char in S2
T(i, j-1) + skipPenalty,
// skipping a char in S1
T(i-1, j-1)
// only if S1(i)=S2(j) ie match
T(i-1, j-1) + mismatchPenalty // only if S1(i) ≠ S2(j) ie mismatch
Output
11) The answer is sitting in T(n,m)
--------------------------------------------------------------------------------------------------------Bonus!! With another table we can figure out the alignment that produced the best result.
We use a second table to keep track of which of the four choices we made in step ten.
5b) create a second n+1 by m+1 array P of String
6b) initialize the 0th row of P to all “left“ this indicates T(0,j) came from T(0, j-1) +1
7b) initialize the first column to all “up”, this indicates the T(i,0) came from T(i-1,0) + 1
10)
If the first choice is the minimum put “up” in P(i, j)
If the second choice is the minimum put “left” in P(i, j)
If one of the last two are minimum put “diag” in P(i, j)
Recovering the path
12) set ans1 and ans2 to “” ( empty strings) and set i and j to n and m
13) while i>0 or j >0 do
14)
if P(i,j) is “”up”:
ans1 = S1(i) + ans1; ans2 = “-“ + ans2; i = i -1;
else if P(i,j) is “left”: ans1 = “-“+ans1; ans2 = S2(j) + ans2; j = j-1
else:
ans1 = S1(i) + ans 1; ans2 = S2(j)+ans2; j = j-1; i = i-1
ans1 and ans2 printed one above the other show the proper alignment
While this is a well-known algorithm, the immediate source of this handout and example
is: Gusfield, Dan. Algorithms on Strings, Trees and Sequences: Computer Science and
Computational Biology. Cambridge University Press, 1997.
Download