slides-alignment

advertisement
String Alignment II
Computational Biology, Department Informatik
ETH Zentrum
Computational Biology – p.1/26
Review of Last Week
Mutation Matrices
Dynamic programming
 Tabular computation - Matches, Mismatches, or Spaces (gaps, indels,
deletions, insertions)

Traceback
Global Align
Global Align Cost-free end gaps
Local Align
C om put at i onal B i ol og y – p. 2/ 26
Organization
Gaps
Dyanmic programming - Formal definition - follows Gusfiled Algorithms on Strings,
Trees and Sequences Chapter 11
Gap Placement- the unsolved problem
Gap Penalties and dynamic programming
••
•
constant
arbitrary
linear (Affine)
convex
Time analysis
Linear space dynamic programming
C om put at i onal B i ol og y – p. 3/ 26
Gaps
Random sequence mutated 200 PAM units with deletions
default - gap open penalty - 10.00, gap extension penalty .10
one
TLTKEATQMIVLNNIGLGAETEENNEVLAQPGHDDCERTTETVMVCIAKLYDCSEY
two
TGAGHNLFMIFLDHHNGTVKEGEKYMNAVVTGSDHLVENSVVLMI----LYRYGAY
two
----NISPLWFSDTRGNIPKLSVWLDDPQGSEPDMFNHFA

** * * * * *
**
* one
YAMYWVSTLKFTNGLQDQITRKLIVKQPSTEVPSVLSYLS
gap open penalty 1.0, gap extension penalty .05
one
two
TLTKEATQMIVLNN-IGLGAETEE-NNEVLAQPGHDDCERTTETVMVCIAKLYDCS
TGAGHNLFMIFLDHHNGTVKEGEKYMNAVVT--GSDHLVENSVVLMI----LYRYG

.
one
two
** * * * * * * * * * **
.
..
..
SRYAMYWVSTLKFTNGLQDQITRKLIVKQPSTEVPSVLSYLS
SNISPLWFSD---TRGNIPKLSVWLDDPQ-GSE-PDMFNHFA
gap open penalty 0, gap extension pe nalty 0
one
two
. .
one
two
TLTKEATQMIVLNN-IGLGAETEE-NNEVLAQPGHDDCERTTETVMVCIAKLYDCS
TGAGHNLFMIFLDHHNGTVKEGEKYMNAVVT--GSDHLVENSVVLMI----LYRYG

.
** * * * * * * * * * **
.
..
..
SRYAMYWVSTLKFTNGLQDQITRKLIVKQ--PS-TEVPSVLSY& 4
SNISPLWFSD---TRG---NIP-KLSVWLDDPQGSE-PDMFNHFA
. .
utational
Biology – p.4/ 26
Gap Weights
A constant Gap Penalty implies that the cost of aligning
and
- ---
__
-H---Y
---------------------- H Y
are the same.
A better model of gap placement says that it is easier to add a second space to an
existing gap than to open a new gap.
mechanisms of insertion-deletion events
more likely to happen in loops slippage of
DNA machinery
It is more likely that 1 strand of 6 spaces is deleted than 6 strands of 1 space.
A gap of more than one space can be created by one mutational event. A more plausible
model treats the spaces in a gap not as separate events.
C om put at i onal B i ol og y – p. 5/ 26
Review o f D y n a m i c P r o g r a m m i n g
1) recurrence relation -establishes a recursive relationship between D(i,j) and
values of D with index pairs smaller than i and j. When there are no smaller indices
then the value of D(i,j) must be stated explicitly in the base conditions for D(i,j).
2) base conditions - Cost to transform the first i characters of one string into zero
characters of the other string. Cost of deleting the first i characters.
3) tabular computation - use the recurrence relations to compute all values for
D(i,j). Find the optimal alignment score (similarity score) in D(n,m).
4) traceback Find the optimal alignment by tracing back the path that gave the
optimal score.
Y
Match X, Y
Delete X
C om put at i onal B i ol og y – p. 6/ 26
Delete Y
Time Analysis Dynamic Programming Constant Gap
Penalty
We have 2 strings- s of length n and t of length m. When computing the value for a
specific cell (i,j), only cells (i-1, j-1), (i,j-1) and (i-1,j) are examined along with the two
characters s[i] and t[j]. To fill in one cell takes a constant number of cell examinations,
arithmetic operations, and comparisons. There are O(nm) cells in the table so the score
U(n,m) can be computed in O(nm) time.
C om put at i onal B i ol og y – p. 7/ 26
Alignment types
To align strings S1 and S2, consider the p re fi xe s S1 [1..i] of S1 and S2 [1..j] of S2.
Any alignment has to be of one of the following types:
S1 _______________________________________ i
E
S2 _____________________________________________________
______j
1
S1 ______________________________________________ i
2 ___________________________________
•
i
F
j S2
S1
3 S2 ______________________________________________ j
G
1 ) Alignments of S1 [l..i] and S 2 [ l . . j ] where character S, (i) is aligned to a
character strictly to the left of c h a r a c t e r . Therefore, the alignment ends witha
gap i n .
Alignments of the two p r e f i x e s where S, (i) is aligned strictly to the right
of S 2 ( j ) . The alignment ends with a gap in S 2.
2)
3) Alignments of the two p r e f i x e s where characters S,
(i) and S 2 (j) are aligned
These 3 are the only possible cases.
D e f i n i t i o n : D e f i n e E(i, j) as the maximum value of any alignment of type 1; d e f i n e
i, j) as the maximum value of any alignment of type 2; Define G (i, j) as the maximum
value of any alignment of type 3; and V ( i ,
j) as the maximum value of the 3 terms
= S2 (j) andthat. S
E ( i , j), F ( i, j) and G ( i , j ) . boththecasethat S1(i)
opposite
each
other.This
includes
S
2 ( j)
C om put at i onal B i ol og y – p. 8/ 26
Recurrences for an arbitrary gap length
i
S1
E 1 S2
j
S1 _________________________
2
i
i
S1
3
_____
S2
j
G
jF
]
S2
=
(j)) where s is the scoring matrix
Computational Biology – p.9/26
For Cost-free end gaps the base cases are:
V N i 0) = 0 and V (O, 4) = 0
with the optimal value for the alignment found in any cell in row
The base cases for a Global Align are:
n) = 2,
U) = - w(2 )
117'(n 4)
P ..
=-
U, J) = - w(j
O„( 4
with the optimal value for the alignment found in cell
m
or row n.
Time Analysis DP with Arbitrary Gap Penalty
We have 2 strings- S1 of length n and S2 of length m. If the gap weight is a completely
arbitrary function of gap length, the optimal alignment can be found in O(
+
)
time. To compute the value for a specific cell (i,j), all cells the same row and column
(i-1,j), (i-2,j) ... (1,j) and (i,j-1), (i,j-2) ... (i,1) must be examined to compute the value
(i-1,j), (i-2,j) .. (1,j). For mn cells, this gives mn(m + n) cells that must be examined.
E
N
O
U
G
H
0
-2
-4
-6
-8
-11
-12
G
-3
-1
-3
E
N
-5
-7
U
-9
G
-12
C om put at i onal B i ol og y – p. 10 / 2 6
Affine Gap Functions
Linear gap functions of the form k1 + k * k2 where k is the length of the gap with >>
k2. (The cost of opening a new gap is more than adding to an existing gap.)
k1 >> k2
Affine gap penalties are the most used by biologists today.
Can make k1 and k2 depend on the PAM distance.
C om put at i onal B i ol og y – p. 11 / 2 6
Affine Gap Functions
The recurrence relations are:
= max[,,],
=
1,
= max [,-
V (i -
j - 1) + s(S1(i), S2 (j)) where s is the scoring matrix
]]-
= max [,The increase in the total weight of a gap contributed by each additional space is a
constant Ws independent of the size of a gap up to that point. Because the gap increases
by the same Ws for each space after the first one, when evaluating E(i, j) or i, j) we only
need to know if it is a new gap or a continuation of an old gap. The base relations are:
i, 0) = E(i,0) = - W9 - i
where Ws is the
penalty for adding
another space onto a gap and W9 is the penalty for opening a new gap.
C om put at i onal B i ol og y – p. 12 / 2 6
Consider the calculation of
i, j) =
max [,-
W ]s
By definition, S1(i) will be aligned to the left of
Two possibilities exist:
1) S1(i) is exactly one place to the left of
,
in which case a gap begins in opposite character S2 (j) and E (i, j) =
-
W9 -
.
2) S1(i) is to the left of
S2 (j - 1) in which case the same gap in S1 is opposite both 11
C om put at i onal B i ol og y – p. 13 / 2 6
and
S 9 ( 4 ) and E N . 4l =E N i 4- 1 1 - W..
The calculation of F(i,j) is similar.
Time Analysis Dynamic Programming Affine Gap Penalty
We have 2 strings- S1 of length n and S2 of length m. Examination of the recurrences
shows that for any pair (i,j), each of the terms V(i,j), E(i,j), F(i,j) and G(i,j) is evaluated by a
constant number of references to previously computed values, arithmetic operations and
comparisons. Hence O(nm) time suffices to fill in all of the (n+1)(m +1) cells in the
dynamic programming table.
The affine gap term makes the alignment much richer but does not increase the running
time used to find an optimal alignment (in an asymptotic worst-case sense).
C om put at i onal B i ol og y – p. 14 / 2 6
Argument for a convex (concave) gap weight
The log-cost gap model is the most popular non-linear one.
cost of a gap length (k) is k1 + log(k)
= max
O
k=1 . i 1 ( Uk, i -
- 1 2 log(i-k))
Benner, Cohen and Gonnet "a non-linear gap penalty is the only one that is grounded in
empirical data". J. Mol. Biol., 229:1065-82, 1993.
It makes sense that increasing a gap from length 99 to 100 is less costly than increasing a
gap from length 2 to length 3.
Benner et al. proposed a gap of length q should be given the weight:
35.03 - 6.88 lo910 d +17.02 loglo q
at d PAM units divergence.
35.03 to initiate a gap which declines with increasing PAM distance
17.02 lo910 q for the gap of length q.
C om put at i onal B i ol og y – p. 15 / 2 6
Crossover: A Crucial Observation
*Two cost curves intersect at most in one place*
)
-k2 ln ( j - jl) =
U
2jq - k l - k2 where
ln (
j
- j.
Computational Biology – p.16/26
IF j < max( , ) - no crossover => discard one IF j
< length(t) - no crossover => discard one
Keep a stack of crossovers
Check against the top of the stack. If no crossover discard one and repeat if necessary. If
crossover place the new one in stack. Update stack.
C om put at i onal B i ol og y – p. 17 / 2 6
A n e x a mp l e
U[1,7] := -0.6792
k := 1 score := -0.6792, plot( 0 - 0.5 -0.1*ln(j-1),j=.1..7)
k := 2 score := -1.1609, plot( -0.5000 - 0.5 -0.1*ln(j-2),j=.1..7) k
:= 3 score := -1.2079, plot( -0.5693 - 0.5 -0.1*ln(j-3),j=.1..7) k
:= 4 score := -1.2197, plot( -0.6099 - 0.5 -0.1*ln(j-4),j=.1..7) k
:= 5 score := -1.2079, plot( -0.6386 - 0.5 -0.1*ln(j-5),j=.1..7) k
:= 6 score := -1.1609, plot( -0.6609 - 0.5 -0.1*ln(j-6),j=.1..7)
k[1,2]
Computational Biology – p.18/26
Time Analysis DP with Convex gap weight
O(nmlog m) time
Computational Biology – p.19/26
Linear Space Dynnamic Programming
Problem: 2000 X 4000 matrix- 8,000,000 computations may be ok, 8,000,000 *3
Matrices may not be ok
our tolerance for time is not the same as storage can
we do the same thing with less memory Hirschberg
developed this space reduction algorithm
reduces the required space from O(nm) to O(n) for n<m while doubling the
worst-case time bound
C om put at i onal B i ol og y – p. 20 / 2 6
Consider the folowing:
What if we only wanted the best score of the alignment and not the alignment? What if
we want only V(n,m)?
Then the maximum space needed would be can be reduced to 2m.
Compute the scores V(n,m) of row i, the only values needed for previous rows are from
row i-1 ; any previous row can be discarded.
After a row has been computed copy it to the space occupied by the previous row and
start over.
After n iterations, the last cell in the row has the highest score of the table in V(n,m). In
this way, V(n,m) can be computed in O(m) space and O(nm) time. Any single row or
column can be calculated in these same time and space bounds.
C om put at i onal B i ol og y – p. 21 / 2 6
Finding the optimal alignment in linear space
In normal DP, we find the optimal alignment by traversing the pointers set when
computing the full DP table. But when using linear space we can’t store the whole set of
pointers or scores.
Define
S; as the reverse of string S1.
VI (i, j) as the score of the string consisting of the first i characters of Sr and the
string consisting of the first j characters of Sr
.
Define
The optimal path connects the start cell (0,0) with the end cell (n,m) and passes through
an unknown middle cell (k, 2 ).
Let k* be a position that maximizes [V( ,k) + ( ' ,m-k)]
This position k* can be
found in O(nm) time and O(m) space.
The optimal subpath
that goes through row can be found and stored in those time and space bounds.
Let k* be a position that maximizes
[V(
C om put at i onal B i ol og y – p. 22 / 2 6
The optimal alignment (traceback) goes through this cell. (n/2,k*)
Linear Space DP
computation of optimal value is trivial
do backtracking
select optimal cost in middle row
d
Computational Biology – p.23/26
split into two rows
To obtain the optimal alignment
Do the same procedure recursively on each substring.
Each time the computational time reduces by a factor of two.
(1 + 1 + . +
Q ...)nm < 2nm
This reduces the space from
O(nm) to O(m) while only doubling the worst-case time
C om put at i onal B i ol og y – p. 24 / 2 6
needed for the computation.
Download