Lect2

advertisement
By Cruchemor, Landau and Ziv-ukelson
Abstract
We present an O(n²/log n) algorithm for computing the optimal
global alignment value of two strings,of size n each, over a
constant alphabet.the algorithm is even faster when the sequence
is compressible. In fact, for most texts, the complexity of the
algorithm is actually O(hn²/log n).
Let A and B be two run-length encoded strings of encoded
lengths m’ and n’, respectively.
we will show an algorithm for comparing A and B in
O(m’n+n’m) complexity, using any distance or similarity
schemes with additive gap.
Definition 1 – global alignment problem:
given a scoring matrix  over the alphabet , the
similarity of two strings A and B is defined as the value
maxV of the alignment of A and B, that maximizes the
total alignment value.
* The score value maxV is called the optimal global
alignment value of A and B.
* A description of a maxV – scoring transformation of A
into B is called a global alignment trace.
Global alignment via dynamic
programming
Reminder:
The classical dynamic programming algorithm
creates the alignment graph, row by row in a left to
right order :
V(I,j)= max [V(I,j-1) + (,Bj),
V(I-1,j) + (Ai,),
V(I-1,j-1) + (Ai,Bj)]
Complexity: O(n²).
Alignment graph
B
A c
t
a
c
g
a
g
a
0
a
1
a
2
c
3
g
4
a
5
c
6
g
7
1
2
3
4
5
6
7
8
Alignment graph for comparing the strings A=“ctacgaga” and B=“aacgacga”.
a
8
Scoring matrix
Scoring matrix , specifies the substitution score for
each pair of characters and the insertion/deletion
scores for each character from the alphabet.
-
a
c
g
t
-1
-1
-1
-1
a
-1
1
-1
-1
-1
c
-1
-1
1
-1
-1
g
-1
-1
-1
1
-1
t
-1
-1
-1
-1
1
A gap is the result of the deletion of one or more consecutive
characters in one of the sequences. Additive gap scores
assign a constant weight to each of the consecutive
characters.
The algorithm assume a scoring scheme with additive gap.
A block partition of the alignment
graph based on LZ
LZ Main idea:
based on the idea of self reference. while scanning text,
the text seen so far is parsed into phrases, where each
phrase is the longest matching phrase seen previously
plus one character. Each phrase is encoded as an index
to it’s prefix, plus the extra character.
LZ example
Example:
S=“aacgacga” is divided into four phrases:
a, ac, g, acg.
a
The trie for S:
c
g
4
2
1
0
g
3
LZ continuous
Theorem (Lempel and Ziv): given a sequence S of
size n over a constant alphabet. The maximal
number of distinct phrases in S is O(n/log n).
A tighter upper bound will be O(hn/log n) where h
is the entropy factor – a real number, 0 < h  1,
which measures the amount of disorder in a
sequence. Entropy is small when there is a lot of
order (I.e. the sequence is repetitive) and large when
there is a lot of disorder.
LZ – partitioned alignment graph
0
1
2
3
4
a
ac
g
acg
Trie for B
a
a
1 c
c
2
t
g
3
a
4
4
c
g
a
5
g
a
g
0
1
3
2
Trie for A
a
g
5
3
0
c
t
2
1
g
4
Computing the optimal global similarity value
• Let Xa and Yb denote phrases in A and B respectively , obtained
by extending a previous phrase X/Y with character a/b respectively.
• block G corresponds to the comparison of Xa and Yb.
• input border I – the left and top borders of G.
• output border O – the bottom and right borders of G.
• we define the following three prefix blocks of G:
1. The left prefix of G denotes the block comparing phrase Xa of A
and phrase Y of B.
2. The diagonal prefix of G denotes the block comparing phrase X
of A and phrase Y of B.
3. The top prefix of G denotes the block comparing phrase X of A
and phrase Yb of B.
Computing block G
0
1
2
3
4
a
ac
g
acg
a
1 c
a 0 g
c 1
g 2
4
2 t
3/2
3 a
3/4
a 0t
g 3c
2
3
1
5
g
4
c
4
g
5
Trie for A
Trie for B
a
a
g
c
g
a
5/2
5/4
g
a
a
c
a
g
left prefix (5,2)
a
c
a
diagonal prefix (3,2)
a
c
g
a
top prefix (3,4)
Graph G for
block (5,4)
Note that the graph for the left prefix of G is identical to the sub-graph
of G containing al columns but the last one.
a
a
c
c
g
a
a
g
g
left prefix
G
More specifically, both the structure and the weights of edges are
identical, but the weights to be assigned during the computation
may vary according to input border values.
Similarly for the top and diagonal prefix graphs.
The only new cell in G, which doesn’t appear in any of it’s prefix
blocks is the cell for comparing a and b. This cell consists of one
a c g
new vertex and three new edges.
a
g
I/O propagation Across G
The work for each block consists of two stages:
1. Encoding: study the structure of G and represent it in an efficient
way.
2. Propagation: given I and the encoding of G, compute O for G.
The structure of G is encoded by computing weights of optimal
path connecting each entry of its input border with each entry of
its output border.
DIST matrix is used.
DIST[i,j] stores the weight of the optimal path from entry i of the
input border of G to entry j of its output border.
OUT matrix – merges the information from input row I and
DIST. given I and DIST, OUT[r,j]=Ir + DIST[r,j].
Vertex Oj of O is the maximum of column j of OUT.
DIST and OUT matrix example
Block G – subsequences “acg”, “ag”
DIST matrix
0
-1
-2
-3


-1
-1
-2
-1
-3

-2
0
0
1
-1
-3

-2
-2
0
-2
2

-2
0
-1
-1



-2
-1
0
I (input borders)
I
0
1
0
O
OUT matrix
0
-2
-
-
I1=2
1
1
0
1
-1
-
I2=3
1
3
3
4
2
0
I3=2
-12
0
0
2
0
0
-13 -13
-1
1
0
0
-14 -14
-14
1
2
3
I5=3
O (output borders) O0
1
O1 O2
3 3
O3 O4
4 2
g
5
4
g
-1
I4=1
4
1
0
I0=1
c
5
1

I0 + DIST[0,j]
3
a
-2

a
O5
3
G
2
3
the rectangle problem
Since paths in an alignment graph can only assume a left-to-right,
top-to-bottom direction, connection between some input border
vertices and some output vertices are impossible. Therefore, the
matrices are missing both a lower-left triangle and upper-right
triangle.
0 -1 -2 -3  
-3

-1
-1
-2
-1
-2
0
0
1
-1 -3

-2
-2
0
-2
-2


-2
0
-1
-1



-2
-1
0
For the next stage, OUT matrix have to be a full, rectangular matrix.
Addressing the rectangle problem
The undefined entries of OUT can be complemented in constant
time each, as follows:
1. The missing upper-right triangle can be completed by setting all
entries to -.
2. Let k denote the maximal absolute value of a score in . The
missing lower-left triangle entries will be setting to: -(n+i+1)*k.
where n is the shortest path in the original alignment graph.
Lemma : complementing the undefined entries as
described above preserves the concave total monotonicity
condition of out, and does not introduce new row-maxima.
DIST and OUT properties
Definition: a matrix M[0…m,0…n] is Monge if either condition 1
or 2 below holds for all a,b=0…m; c,d=0…n:
1. Convex condition: M[a,c] + M[b,d]  M[b,c] + M[a,d] for all
a<b and c<d.
2. Concave condition: M[a,c] + M[b,d]  M[b,c] + M[a,d] for all
a<b and c<d.
c
e
d
a
u
v
w
b
x
y
z
Previous researches observed that DIST matrices are Monge arrays.
Since DIST is Monge, so is OUT, which is a DIST with constants
added to its rows.
An important property of Monge arrays is that of being totally
monotone.
DIST and OUT properties continuous
Definition: a matrix M[0…m,0…n] is totally monotone if either
condition 1 or 2 below holds for all a,b=0…m; c,d=0…n:
1. Convex condition: M[a,c]  M[b,c] M[a,d]  M[b,d] for all
a<b and c<d.
2. Concave condition: M[a,c]  M[b,c]  M[a,d]  M[b,d] for
all
a<b and c<d.
Both DIST and OUT are totally monotone by the concave
condition.
a
b
The concave condition
o
If b-c is better than a-c, than b-d is better than a-d.
c
d
Computing O
An algorithm called SMAWK can compute in O(n) time all
row and column maximum of an n x n totally monotone
matrix, by querying only O(n) elements of the array. Hence
we can use SMAWK on OUT to compute output row O.
clearly, if both the full DIST and I are available, than
computing an element of OUT is O(1) work. Thus, there is
no need to compute and store the full matrix OUT but only
the entries of OUT which are needed by SMAWK.
Computing O
In order to efficiently utilize Monge and Monotonicity
properties, we need to address two problems:
1. SMAWK is intended for a full, rectangular matrix –
already explained.
2. How to efficiently compute DIST and represent it in a
format which allows direct access to its entries.
Incremental update of the new DIST
information for G
As shown, in block G, there is only one “new” cell, in its
lowest – rightmost corner. When processing G, we compute
the scores of t (t= |I|) new optimal paths, leading from input
border i to the new vertex (lr,lc). This values correspond to
column lc of the DIST matrix for G and can be computed as
follows:
a
a
g
c
g
Incremental update of the new DIST
information for G
The weight of the optimal path from entry Ii to (lr,lc) is equal to the
maximum among:
1. Entry [i] of column lc –1 of the DIST for the left prefix of G, plus
the weight of the horizontal edge leading into (lr,lc).
a
Left prefix DIST[i , lc –1] + edge
a
g
c
g
Incremental update of the new DIST
information for G
The weight of the optimal path from entry Ii to (lr,lc) is equal to the
maximum among:
1. Entry [i] of column lc –1 of the DIST for the left prefix of G, plus
the weight of the horizontal edge leading into (lr,lc).
2. Entry [i] of column lc –1 of the DIST for the diagonal prefix of G,
plus the weight of the diagonal edge leading into (lr,lc).
a
diagonal prefix DIST[i , lc –1] + edge
a
g
c
g
Incremental update of the new DIST
information for G
The weight of the optimal path from entry Ii to (lr,lc) is equal to the
maximum among:
1. Entry [i] of column lc –1 of the DIST for the left prefix of G, plus
the weight of the horizontal edge leading into (lr,lc).
2. Entry [i] of column lc –1 of the DIST for the diagonal prefix of G,
plus the weight of the diagonal edge leading into (lr,lc).
3. Entry [i] of column lc of the DIST for the top prefix of G, plus the
weight of the vertical edge leading into (lr,lc).
a
top prefix DIST[i , lc] + edge
a
g
c
g
Maintaining direct access to DIST columns
As shown, for each block only one new DIST column has been computed and
stored. All other columns besides column lc of the DIST for G need to be
obtained from G’s prefix ancestor blocks.
Therefore, before the execution of SMAWK begins, a vector with pointers to
all columns of the DIST for G is constructed. This vector can be deleted after
the computation for G.
Column lc is set to the newly constructed DIST vector for G.
All columns of indices smaller than lc are obtained via lc recursive calls to left
prefix blocs of G. All columns of indices greater than lc are obtained via lc
recursive calls to top prefix blocs of G.
DIST(5,4)
(Block G)
0
-1
-2
-3
-1
-1
-2
-1
-3
-2
0
0
1
-1
-3
-2
-2
0
-2
-2
-2
0
-1
-1
-2
-1
0
Querying a prefix block and obtaining its
DIST column in constant time
Each block will be identified by a pair of numbers, composed of the
serial numbers for its corresponding phrases in the LZ tries for A
and B.
Trie for A
Trie for B
For example:
Comparison of “ag” and “acg” – (5,4).
g
4
a 0 g
c 1
3
2
a 0 t
g 3c
2
1
5
g
4
The left prefix of G can be identified in constant time as follows:
The first number identical to the serial number of Xa, and the
second corresponding to the serial number of y, which is the direct
ancestor of Yb in the trie for B.
Querying a prefix block and obtaining its
DIST column in constant time
Each block will be identified by a pair of numbers, composed of the
serial numbers for its corresponding phrases in the LZ tries for A
and B.
For example:
Comparison of “ag” and “acg” – (5,4).
Trie for B
a 0 g
c 1
3
g 2
4
Trie for A
a 0 t
g 3c
2
1
5
g
4
The left prefix of G can be identified in constant time as follows:
The first number identical to the serial number of Xa, and the
second corresponding to the serial number of y, which is the direct
ancestor of Yb in the trie for B.
Example: left prefix of G (5,4) will be: (5,2).
Similarly, Top prefix (3,4) can be identified in constant time.
Block table
Another data structured to be constructed is a Block Table,
containing an entry for each partitioned block of the alignment
graph. The entry for each block points to the start of its new
DIST column, and can directly accessed via the block’s phrase
number index pair.
0
1
2
3
4
Trie for B
a 0 g
c 1
3
g 2
4
0
1
2
3
4
5
-3
-1
New DIST column for (5,4)
1
0
0
-2
Trie for A
a 0 t
g 3c
2
1
5
g
4
Maintaining direct access scheme
Block Table
0
1
2
3
4
Trie for A
Trie for B
0
a 0 g
c 1
3
2
1
2
g
3
4
4
a
g
5
0 t
3c
g
2
1
4
5
DIST(5,4)
(Block G)
0
-1
-2
-3
-1
-1
-2
-1
-3
-2
0
0
1
-1
-3
-2
-2
0
-2
-2
-2
0
-1
-1
-2
-1
0
complexity
Assuming |A|=|B|=n.
LZ algorithm – O(n).
Number of phrases in both A and B – O(hn/log n).
Number of blocks in the alignment graph – O(h²n²/(log n)²)
For each block:
1. Updating the encoding structure for G – the new DIST column for G can
be constructed in o(t) (linear with the size of its I/O borders).
2. Maintaining direct access to DIST columns – the vector with pointers to
columns of the DIST for G can be set in O(t).
3. Propagating I/O values across the block – SMAWK in O(t).
Total complexity – since the work for every block is linear with its I/O
borders, the total complexity is linear with the total size of the borders of the
blocks. In the alignment graph there are O(hn/log n) rows of size |B|, and
O(hn/log n) columns of size |A|.
Therefore, the total complexity is O(hn²/(log n)).
Global similarity optimal alignment
trace recovery
The recovery of an optimal global alignment trace between A
and B goes back from vertex (n,n) to vertex (0,0).for each block
crossed, the internal alignment trace is reported.
In order to support the recovery of block crossing paths in linear
time, the following information for a given block G is required:
1. During propagation stage, input border entry i, which is the
source of the highest scoring path to output border entry j, is
saved for each output entry j.
2. During encoding, an additional O(t) sized vector of pointers
called ancestors vector, is computed for G. the value of
ancestor[lc] is set to G. all columns of indices smaller than lc are
obtained via lc recursive calls to left prefix blocks of G. columns of
indices greater than lc are obtained via lc recursive calls to top
prefix blocks of G.
2
Ancestor[3]=G
Ancestor[2]=“ag” x “ac”
a
3
c
4
g
5
5
a
1
4
0
3
g
0
O
1
G
2
2. During encoding, an additional O(t) sized vector of pointers
called ancestors vector, is computed for G. the value of
ancestor[lc] is set to G. all columns of indices smaller than lc are
obtained via lc recursive calls to left prefix blocks of G. columns
of indices greater than lc are obtained via lc recursive calls to top
prefix blocks of G.
Ancestor[3]=G
2
Ancestor[4]=“a” x “acg”
Ancestor[2]=“ag” x “ac” Ancestor[5]=“ ” x “acg”
a
Ancestor[1]=“ag” x “a”
g
Ancestor[0]= “ag” x “ “
a
3
c
4
g
5
5
1
4
0
3
0
O
1
G
2
3. During encoding, G’s new vertex is annotated with an additional
O(t) sized vector of pointers called direction. These pointers are
set during the DIST column computation, according to the
direction of the last edge in the optimal path from I[i] to the new
vertex.
Given that the optimal path enters through entry j of the output
border of G,The trace-back of the part of the path going through G
has two stages:
• destination and origin initialization: fetching I[i] which was
stored as the origin for the highest path to G’s O[j]. Entry i is the
destination for the trace-back. In addition, the ancestor prefix P
pointed by ancestor[j] is fetched. The edge recovery begins in
block P.
• trace-back the part of the path contained in P, from entry j on P’s
output (the new vertex of P), to entry i on its input border. This
is done by backtracking a dynasty of prefix ancestor blocks
internal to P, using the direction vector computed for each of the
traversed blocks. If direction[I] of the traversed block specifies a
horizontal edge, than the trace-back retreats to the left prefix of
P, and an “insertion” operation is reported in the alignment trace.
Correspondingly, “substitution” and “deletion” are reported when
backtracking to diagonal and top prefix blocks.
The recovery continues until the full optimal alignment trace is
recovered.
Trace-back example
Trace-back in G – j=3
Weight of optimal path = 1 (from I[2] to O[3])
2
Stage 1:
I
g
5
5
4
0
0
a
Direction[2]= diagonal
c
a
1
G
2
O
Alignment trace:
a
Direction[1]=horizontal
Done
4
g
Stage 2:
Direction[1]=diagonal
c
1
Ancestor[3] is fetched (which is G)
Left prefix is fetched
3
a
Entry I[2] is the destination.
Diagonal prefix is fetched
a
a
- substitution
- insertion
- substitution
3
Trace complexity
The vectors direction and ancestor, and the input border entry I,
can be computed during the encoding and propagation stages in
O(t).
The first stage in the trace-back can be done in constant time.
Each edge in The recovery stage can be computed in constant
time (including accessing to prefix blocks and its direction
vectors). The highest scoring global alignment between strings
A and B can be recovered in additional time linear in its size,
using the O(hn²/log n) of the encoding/propagation..
Comparing run length encoded strings
A string S is run-length encoded if it is described as an ordered
sequence of pairs (,i):
aaabbcccaa = (a,3),(b,2),(c,3),(a,2) or a³b²c³a²
Former algorithms achieved O(m’n+n’m) complexity for
computing edit distance between two run-length strings.
This algorithm runs at the same complexity and is suitable to
any similarity scoring scheme with additive gap.
Let m and n be the lengths of two run-length encoded strings
X and Y, of encoded length m’ and n’, respectively.
The alignment graph is partitioned into blocks, each block
consists of two runs – one of X and one of Y. therefore the
alignment graph is partitioned into m’ n’ blocks.
The algorithm also propagates accumulated scores from the
left-upper borders to the bottom-right borders of each block.
a
a
a
a
b
b
b
a
a
a
b
b
b
b
b
b
Consider the block R for comparing i of X and  j of Y.
An edge in R could be assigned one of three values: D(diagonal),
H (horizontal) and V (vertical) according to scoring scheme.
h – the difference in row index between entry i on the input
border and entry j on the output border of R.
w – the difference in column index between entry i on the input
border and entry j on the output border of R.
We can compute DIST[i,j] in constant time, given h and w, and the
values D, H and V.
Computing DIST[i,j]
• H + V  D. clearly, the optimal path can use all possible
diagonal edges and only than the minimal H and V edges.
DIST[i,j] obtains one of three values:
1. If w = h,
DIST[i,j]=
D x h.
2. If w > h,
DIST[i,j]=
D x h + H x (w - h).
3. If w < h,
DIST[i,j]=
D x w + V x (h - w).
• H + V >D. in this case an optimal path never uses any diagonal
edge.
DIST[i,j]= H x w + V x h.
Therefore DIST[i,j] can be easily computed in constant time
given a scoring scheme.
Correctness and complexity
Block R consists of   X and   Y.
The values D,H and V are the same for all edges in R, and can
be find in the scoring scheme in constant time.
The O vector for each block is computed using SMAWK.
Vector I for block R can be obtained from the O vectors for the
blocks above R and the block in its left.
Any value OUT[i,j]= I[i] + DIST[i,j], needed by SMAWK, can
be computed in constant time.
Since The work for each block is linear with its I/O borders, the
total time complexity is linear with the total size of the borders
of the blocks – O(m’n + n’m).
Note that by computing only the relevant DIST entries (and
relevant OUT entries) in constant time, and “on the fly”, these
values do not need to be stored, and so we can improve space
complexity.
Download