advertisement

Description and Pseudocode for the Dynamic Programing Algorithm to solve the string alignment problem The String alignment problem asks us to compute the “best” alignment of two strings with given penalties for character mismatch and or character skipping. For example if S1 = “VINTNER” and S2 = "WRITERS” and the cost of a mismatch is 1 and the cost of a skip is 1 then one possible optimal alignment is given by: W R I T – E R S V I N T N E R The penalty for this alignment is 5 since there are 3 mismatched characters and 2 skipped characters. We present a dynamic programming algorithm which solves this problem using a table containing the solutions of subproblems. T(i, j) will solve the problem using the first i characters of S1 and the first j characters of S2. The algorithm: Input Section 1) Input S1 and S2, two strings. (eg. S1 = “WRITERS”, S2 = “VINTNER”) 2) Compute n to be the length of S1 and m of S2 originally. (eg. n = 7, m = 7 3) Pad the beginning of each with a blank to simplify things. (e.g. S1 = “_WRITERS” and S2 = “_VINTNER”) 4) Get skipPenalty and mismatchPenalty. (from the file or user depending on the program) Initialization 5) Create an initially empty n+1 by m+1 array, T. 6) fill the first column with multiples of skip penalty i.e., T(i, 0) = i*skipPenalty for i from 0 to n inclusive 7) fill the first row with i.e. T(0, j) = j*skipPenalty for j from 0 to m inclusive Main Loop(s) 8) for i going from 1 to n inclusive 9) 10) for j going from 1 to m inclusive T(i, j) = the minimum of the following four things: T(i-1, j) + skipPenalty, // skipping a char in S2 T(i, j-1) + skipPenalty, // skipping a char in S1 T(i-1, j-1) // only if S1(i)=S2(j) ie match T(i-1, j-1) + mismatchPenalty // only if S1(i) ≠ S2(j) ie mismatch Output 11) The answer is sitting in T(n,m) --------------------------------------------------------------------------------------------------------Bonus!! With another table we can figure out the alignment that produced the best result. We use a second table to keep track of which of the four choices we made in step ten. 5b) create a second n+1 by m+1 array P of String 6b) initialize the 0th row of P to all “left“ this indicates T(0,j) came from T(0, j-1) +1 7b) initialize the first column to all “up”, this indicates the T(i,0) came from T(i-1,0) + 1 10) If the first choice is the minimum put “up” in P(i, j) If the second choice is the minimum put “left” in P(i, j) If one of the last two are minimum put “diag” in P(i, j) Recovering the path 12) set ans1 and ans2 to “” ( empty strings) and set i and j to n and m 13) while i>0 or j >0 do 14) if P(i,j) is “”up”: ans1 = S1(i) + ans1; ans2 = “-“ + ans2; i = i -1; else if P(i,j) is “left”: ans1 = “-“+ans1; ans2 = S2(j) + ans2; j = j-1 else: ans1 = S1(i) + ans 1; ans2 = S2(j)+ans2; j = j-1; i = i-1 ans1 and ans2 printed one above the other show the proper alignment While this is a well-known algorithm, the immediate source of this handout and example is: Gusfield, Dan. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997.