Greedy Algorithms for the Shortest Common Superstring Original paper by Overview by

advertisement
Greedy Algorithms for the
Shortest Common Superstring
Overview by
Anton Nesterov
Saint Petersburg State University
Russia
Original paper by
•A. Frieze, Carnegie Mellon University, USA
•W. Szpankowski, Purdue University, USA
Contents of the report
• Description of the problem
• Formulation of main results
• Ideas of the proof
Shortest common superstring
Problem
(SCS)
• Given a collection of strings:
1
2
x , x ,, x
n
• We want to find a superstring that
i
contains each x as a substring
Example
• Three words:
fear nature
arena
• Superstring:
fearenature
Algorithms
• It is known, that the problem is NP-hard.
• It is of a great interest to develop a good
approximation algorithm.
• Some greedy algorithms will be presented
Definitions
• Alphabet
  { 1, 2,...,M }
• We define an overlap and a special sum of two strings
x  x1 x2  xr y  y1 y2  y s
 ( x, y )  max{ j : yi  xr  j i ,1  i  j}
x  y  x1 x2  xr yk 1 yk 2  y s
Example
x  are na
y  nature
 ( x, y )  2
x  y  are na ture
Total optimal overlap
• Let us have a set of n strings
• Assume that we had solved the SCS
• The minimal superstring is S

n
opt
n
x  S
i
i 1
Descriptions of algorithms
• Generic greedy algorithm:
I  {x1 , x 2 , x 3 ,, x n }; ngr  0;
repeat
choose x, y  I ; z  x  y;
(*)
I  ( I \ {x, y})  {z};
ngr  ngr   ( x, y );
until I  1
• GREEDY
•In step (*) we choose x,y in order to maximize o(x,y);
• RGREEDY
•In step (*) x is a string z from the previous iteration. It
means that we have only one long string, which grows by
addition of strings at the right-hand end.
Bernoulli model
• All strings are of the same length
• Symbols are independent on the previous ones.
• Let
m
H    pi log pi
i 1
be the associated entropy for the Bernoulli model where
pi  ( xi  i )
Main result
• Theorem.
Let us consider the SCS problem under the Bernoulli model.
Then, with high probability:

1

1
lim
 , lim
 ,
n n log n
H n n log n H
opt
n
provided
4
i
x 
log n
log P
gr
n
M
where
P   pi
j 1
2
Another models
• Markovian model:
each symbol depends only on the previous one
• Mixing model
When we are solving more complicated problems using SCS, two
previous models are too unrealistic. Then it needs to use mixing model.
The main idea of it is as follows:
Let we have the string:
...100110110010
The farther the symbol the lesser the dependence on it
Compression
• SCS can be used to compress strings.
• Instead of storing all strings of total length nℓ we can store
the SCS and n pointers indicating the beginning of an
original strings plus lengths of all strings.
• Compression ratio will be:
1
1
nl  n log n  log2 (nl  n log n )
H
H
Cn 
nl
• But when the length of a string grows faster then logn,
then compression ratio will be 1
Ideas of the proof
• First of all we shall show that it is unlikely that there is a
pair of strings with overlap more than half of their length
l
(x , x ) 
2
i
j
• Let E denotes the event that there is no such a pair
• If l  K log n , then
l
()  ( ) P  ( n
n
2
provided
k
k  2l
4
K
log P
P)
2  ( K log
2
)   (1)
Two halves of a string
• Let’s consider a part of our superstring:
...ancneanaosaasunanssana..
• Overlaps of two nearby strings are marked by red.
• Each string has two marked overlaps the “tail” and the “nose”.
• Knowing that such overlaps practically never takes more than
half of the string, we would divide every string into two parts:
the first half and the second one, each has a length of l/2.
• Then, if we want to consider an overlap of two strings a and b,
we should operate only with the first half of a and the second
part of b.
RGREEDY and trees
• We now consider a tree process
that is equal to RGREEDY
• Tree T would be an infinite rooted
M-ary tree.
• M (size of an alphabet) edges
leading down from each vertex
will be labeled with
 ,  ,..., 
1
2
M
• Thus, each vertex of depth d is
identified with string of length d.
Modeling RGREADY
• For the each string we
will mark each vertex
with a number of strings
that has prefix
associated with this
vertex.
• “k” means that there are
k strings starting from
 M 1
Example
1. We will climb down to our tree,
following letters from the second
half of the first string, to the depth
of l/2.
2. We’ll stop at the highest
positive integer, let’s call it t.
3. So, we can find t strings that
have suffixes equal to the prefix
of the first string.
4. Let’s consider one of these t
stings in the similar way (see 1.).
We’ll repeat such procedure n
times.
GREEDY
• Let D be the digraph ([n], A)
– with edges weights i , j   (bi , a j )
• We sort the edges A into e1 , e2 ,..., eN ,
so that  (ei )   (ei 1 )
N  n2
• Thus the problem is to find a path along the edges,
which has the maximum weight.
Download