String Alignment II Computational Biology, Department Informatik ETH Zentrum Computational Biology – p.1/26 Review of Last Week Mutation Matrices Dynamic programming Tabular computation - Matches, Mismatches, or Spaces (gaps, indels, deletions, insertions) Traceback Global Align Global Align Cost-free end gaps Local Align C om put at i onal B i ol og y – p. 2/ 26 Organization Gaps Dyanmic programming - Formal definition - follows Gusfiled Algorithms on Strings, Trees and Sequences Chapter 11 Gap Placement- the unsolved problem Gap Penalties and dynamic programming •• • constant arbitrary linear (Affine) convex Time analysis Linear space dynamic programming C om put at i onal B i ol og y – p. 3/ 26 Gaps Random sequence mutated 200 PAM units with deletions default - gap open penalty - 10.00, gap extension penalty .10 one TLTKEATQMIVLNNIGLGAETEENNEVLAQPGHDDCERTTETVMVCIAKLYDCSEY two TGAGHNLFMIFLDHHNGTVKEGEKYMNAVVTGSDHLVENSVVLMI----LYRYGAY two ----NISPLWFSDTRGNIPKLSVWLDDPQGSEPDMFNHFA ** * * * * * ** * one YAMYWVSTLKFTNGLQDQITRKLIVKQPSTEVPSVLSYLS gap open penalty 1.0, gap extension penalty .05 one two TLTKEATQMIVLNN-IGLGAETEE-NNEVLAQPGHDDCERTTETVMVCIAKLYDCS TGAGHNLFMIFLDHHNGTVKEGEKYMNAVVT--GSDHLVENSVVLMI----LYRYG . one two ** * * * * * * * * * ** . .. .. SRYAMYWVSTLKFTNGLQDQITRKLIVKQPSTEVPSVLSYLS SNISPLWFSD---TRGNIPKLSVWLDDPQ-GSE-PDMFNHFA gap open penalty 0, gap extension pe nalty 0 one two . . one two TLTKEATQMIVLNN-IGLGAETEE-NNEVLAQPGHDDCERTTETVMVCIAKLYDCS TGAGHNLFMIFLDHHNGTVKEGEKYMNAVVT--GSDHLVENSVVLMI----LYRYG . ** * * * * * * * * * ** . .. .. SRYAMYWVSTLKFTNGLQDQITRKLIVKQ--PS-TEVPSVLSY& 4 SNISPLWFSD---TRG---NIP-KLSVWLDDPQGSE-PDMFNHFA . . utational Biology – p.4/ 26 Gap Weights A constant Gap Penalty implies that the cost of aligning and - --- __ -H---Y ---------------------- H Y are the same. A better model of gap placement says that it is easier to add a second space to an existing gap than to open a new gap. mechanisms of insertion-deletion events more likely to happen in loops slippage of DNA machinery It is more likely that 1 strand of 6 spaces is deleted than 6 strands of 1 space. A gap of more than one space can be created by one mutational event. A more plausible model treats the spaces in a gap not as separate events. C om put at i onal B i ol og y – p. 5/ 26 Review o f D y n a m i c P r o g r a m m i n g 1) recurrence relation -establishes a recursive relationship between D(i,j) and values of D with index pairs smaller than i and j. When there are no smaller indices then the value of D(i,j) must be stated explicitly in the base conditions for D(i,j). 2) base conditions - Cost to transform the first i characters of one string into zero characters of the other string. Cost of deleting the first i characters. 3) tabular computation - use the recurrence relations to compute all values for D(i,j). Find the optimal alignment score (similarity score) in D(n,m). 4) traceback Find the optimal alignment by tracing back the path that gave the optimal score. Y Match X, Y Delete X C om put at i onal B i ol og y – p. 6/ 26 Delete Y Time Analysis Dynamic Programming Constant Gap Penalty We have 2 strings- s of length n and t of length m. When computing the value for a specific cell (i,j), only cells (i-1, j-1), (i,j-1) and (i-1,j) are examined along with the two characters s[i] and t[j]. To fill in one cell takes a constant number of cell examinations, arithmetic operations, and comparisons. There are O(nm) cells in the table so the score U(n,m) can be computed in O(nm) time. C om put at i onal B i ol og y – p. 7/ 26 Alignment types To align strings S1 and S2, consider the p re fi xe s S1 [1..i] of S1 and S2 [1..j] of S2. Any alignment has to be of one of the following types: S1 _______________________________________ i E S2 _____________________________________________________ ______j 1 S1 ______________________________________________ i 2 ___________________________________ • i F j S2 S1 3 S2 ______________________________________________ j G 1 ) Alignments of S1 [l..i] and S 2 [ l . . j ] where character S, (i) is aligned to a character strictly to the left of c h a r a c t e r . Therefore, the alignment ends witha gap i n . Alignments of the two p r e f i x e s where S, (i) is aligned strictly to the right of S 2 ( j ) . The alignment ends with a gap in S 2. 2) 3) Alignments of the two p r e f i x e s where characters S, (i) and S 2 (j) are aligned These 3 are the only possible cases. D e f i n i t i o n : D e f i n e E(i, j) as the maximum value of any alignment of type 1; d e f i n e i, j) as the maximum value of any alignment of type 2; Define G (i, j) as the maximum value of any alignment of type 3; and V ( i , j) as the maximum value of the 3 terms = S2 (j) andthat. S E ( i , j), F ( i, j) and G ( i , j ) . boththecasethat S1(i) opposite each other.This includes S 2 ( j) C om put at i onal B i ol og y – p. 8/ 26 Recurrences for an arbitrary gap length i S1 E 1 S2 j S1 _________________________ 2 i i S1 3 _____ S2 j G jF ] S2 = (j)) where s is the scoring matrix Computational Biology – p.9/26 For Cost-free end gaps the base cases are: V N i 0) = 0 and V (O, 4) = 0 with the optimal value for the alignment found in any cell in row The base cases for a Global Align are: n) = 2, U) = - w(2 ) 117'(n 4) P .. =- U, J) = - w(j O„( 4 with the optimal value for the alignment found in cell m or row n. Time Analysis DP with Arbitrary Gap Penalty We have 2 strings- S1 of length n and S2 of length m. If the gap weight is a completely arbitrary function of gap length, the optimal alignment can be found in O( + ) time. To compute the value for a specific cell (i,j), all cells the same row and column (i-1,j), (i-2,j) ... (1,j) and (i,j-1), (i,j-2) ... (i,1) must be examined to compute the value (i-1,j), (i-2,j) .. (1,j). For mn cells, this gives mn(m + n) cells that must be examined. E N O U G H 0 -2 -4 -6 -8 -11 -12 G -3 -1 -3 E N -5 -7 U -9 G -12 C om put at i onal B i ol og y – p. 10 / 2 6 Affine Gap Functions Linear gap functions of the form k1 + k * k2 where k is the length of the gap with >> k2. (The cost of opening a new gap is more than adding to an existing gap.) k1 >> k2 Affine gap penalties are the most used by biologists today. Can make k1 and k2 depend on the PAM distance. C om put at i onal B i ol og y – p. 11 / 2 6 Affine Gap Functions The recurrence relations are: = max[,,], = 1, = max [,- V (i - j - 1) + s(S1(i), S2 (j)) where s is the scoring matrix ]]- = max [,The increase in the total weight of a gap contributed by each additional space is a constant Ws independent of the size of a gap up to that point. Because the gap increases by the same Ws for each space after the first one, when evaluating E(i, j) or i, j) we only need to know if it is a new gap or a continuation of an old gap. The base relations are: i, 0) = E(i,0) = - W9 - i where Ws is the penalty for adding another space onto a gap and W9 is the penalty for opening a new gap. C om put at i onal B i ol og y – p. 12 / 2 6 Consider the calculation of i, j) = max [,- W ]s By definition, S1(i) will be aligned to the left of Two possibilities exist: 1) S1(i) is exactly one place to the left of , in which case a gap begins in opposite character S2 (j) and E (i, j) = - W9 - . 2) S1(i) is to the left of S2 (j - 1) in which case the same gap in S1 is opposite both 11 C om put at i onal B i ol og y – p. 13 / 2 6 and S 9 ( 4 ) and E N . 4l =E N i 4- 1 1 - W.. The calculation of F(i,j) is similar. Time Analysis Dynamic Programming Affine Gap Penalty We have 2 strings- S1 of length n and S2 of length m. Examination of the recurrences shows that for any pair (i,j), each of the terms V(i,j), E(i,j), F(i,j) and G(i,j) is evaluated by a constant number of references to previously computed values, arithmetic operations and comparisons. Hence O(nm) time suffices to fill in all of the (n+1)(m +1) cells in the dynamic programming table. The affine gap term makes the alignment much richer but does not increase the running time used to find an optimal alignment (in an asymptotic worst-case sense). C om put at i onal B i ol og y – p. 14 / 2 6 Argument for a convex (concave) gap weight The log-cost gap model is the most popular non-linear one. cost of a gap length (k) is k1 + log(k) = max O k=1 . i 1 ( Uk, i - - 1 2 log(i-k)) Benner, Cohen and Gonnet "a non-linear gap penalty is the only one that is grounded in empirical data". J. Mol. Biol., 229:1065-82, 1993. It makes sense that increasing a gap from length 99 to 100 is less costly than increasing a gap from length 2 to length 3. Benner et al. proposed a gap of length q should be given the weight: 35.03 - 6.88 lo910 d +17.02 loglo q at d PAM units divergence. 35.03 to initiate a gap which declines with increasing PAM distance 17.02 lo910 q for the gap of length q. C om put at i onal B i ol og y – p. 15 / 2 6 Crossover: A Crucial Observation *Two cost curves intersect at most in one place* ) -k2 ln ( j - jl) = U 2jq - k l - k2 where ln ( j - j. Computational Biology – p.16/26 IF j < max( , ) - no crossover => discard one IF j < length(t) - no crossover => discard one Keep a stack of crossovers Check against the top of the stack. If no crossover discard one and repeat if necessary. If crossover place the new one in stack. Update stack. C om put at i onal B i ol og y – p. 17 / 2 6 A n e x a mp l e U[1,7] := -0.6792 k := 1 score := -0.6792, plot( 0 - 0.5 -0.1*ln(j-1),j=.1..7) k := 2 score := -1.1609, plot( -0.5000 - 0.5 -0.1*ln(j-2),j=.1..7) k := 3 score := -1.2079, plot( -0.5693 - 0.5 -0.1*ln(j-3),j=.1..7) k := 4 score := -1.2197, plot( -0.6099 - 0.5 -0.1*ln(j-4),j=.1..7) k := 5 score := -1.2079, plot( -0.6386 - 0.5 -0.1*ln(j-5),j=.1..7) k := 6 score := -1.1609, plot( -0.6609 - 0.5 -0.1*ln(j-6),j=.1..7) k[1,2] Computational Biology – p.18/26 Time Analysis DP with Convex gap weight O(nmlog m) time Computational Biology – p.19/26 Linear Space Dynnamic Programming Problem: 2000 X 4000 matrix- 8,000,000 computations may be ok, 8,000,000 *3 Matrices may not be ok our tolerance for time is not the same as storage can we do the same thing with less memory Hirschberg developed this space reduction algorithm reduces the required space from O(nm) to O(n) for n<m while doubling the worst-case time bound C om put at i onal B i ol og y – p. 20 / 2 6 Consider the folowing: What if we only wanted the best score of the alignment and not the alignment? What if we want only V(n,m)? Then the maximum space needed would be can be reduced to 2m. Compute the scores V(n,m) of row i, the only values needed for previous rows are from row i-1 ; any previous row can be discarded. After a row has been computed copy it to the space occupied by the previous row and start over. After n iterations, the last cell in the row has the highest score of the table in V(n,m). In this way, V(n,m) can be computed in O(m) space and O(nm) time. Any single row or column can be calculated in these same time and space bounds. C om put at i onal B i ol og y – p. 21 / 2 6 Finding the optimal alignment in linear space In normal DP, we find the optimal alignment by traversing the pointers set when computing the full DP table. But when using linear space we can’t store the whole set of pointers or scores. Define S; as the reverse of string S1. VI (i, j) as the score of the string consisting of the first i characters of Sr and the string consisting of the first j characters of Sr . Define The optimal path connects the start cell (0,0) with the end cell (n,m) and passes through an unknown middle cell (k, 2 ). Let k* be a position that maximizes [V( ,k) + ( ' ,m-k)] This position k* can be found in O(nm) time and O(m) space. The optimal subpath that goes through row can be found and stored in those time and space bounds. Let k* be a position that maximizes [V( C om put at i onal B i ol og y – p. 22 / 2 6 The optimal alignment (traceback) goes through this cell. (n/2,k*) Linear Space DP computation of optimal value is trivial do backtracking select optimal cost in middle row d Computational Biology – p.23/26 split into two rows To obtain the optimal alignment Do the same procedure recursively on each substring. Each time the computational time reduces by a factor of two. (1 + 1 + . + Q ...)nm < 2nm This reduces the space from O(nm) to O(m) while only doubling the worst-case time C om put at i onal B i ol og y – p. 24 / 2 6 needed for the computation.