Document

advertisement
Combinatorial aspects of the
Burrows-Wheeler transform
Sabrina Mantaci
Antonio Restivo
Marinella Sciortino
University of Palermo
Burrows-Wheeler Transform
In 1994 M. Burrows and D. Wheeler introduced a new data
compression method based on a preprocessing on the input
string. Such a preprocessing, called after them the BurrowsWheeler Transform (BWT), produces a permutation of the letters
in the input string such that:
• the transformed string is easier to compress than the original
one.
• the original string can be recovered;
The use of this preprocessing allowed to define a class of lossless
data compression algorithms that:
• achieve speed comparable to the algorithms based on the
techniques by Lempel and Ziv;
• obtains a compression ratio close to the best statistical
modelling techniques.
How does BWT work ?
• INPUT: w = abraca
• Lexicographically sort the cyclic rotations of w
F
0
1
2
3
4
5
L
aabrac
abraca
acaabr
bracaa
caabra
racaab
The following properties hold:
I
• the character L[i] is followed in w
by F[i];
• for each character ch, the i-th
occurrence of ch in F corresponds to
the i-th occurrence of ch in L.
• OUTPUT: BWT(w)=L=caraab and the index I=1, denoting the
position of the original word w after the lexicographic
ordering.
Reversibility
The Burrows-Wheeler transform is reversible, in the sense that given
BWT(w) and an index I, it is possible to recover w.
Given L=BWT(w)=caraab and I=1:
• Construct F by alphabetically sorting the letters in L
F
L
I
:
0
1
2
3
4
5
a
a
a
b
c
r
c
a
r
a
a
b
0
1
2
3
4
5
=
012345
134502
w= a b r a c a
• Define a permutation  on {0,1,…,n-1}, establishing a correspondence
between the positions of the same letters in F and in L;
• Starting from position I, we can recover w=w0 … wn as follows:
wi =F[i(I)], where 0(x)=x, i+1(x)= (i(x))
We can deduce that:
REMARK: Two words x and y are conjugate  BWT(x)=BWT(y)
PROPOSITION:
d d
d
d
• If u  v and BWT(v)=a0a1…an-1 then BWT(u)= a 0 a1 ...an 1;
d d
d
• If BWT(v)=a0a1…an-1 and BWT(u)= a 0 a1 ...an 1 then there exists a
conjugate u’ of u such that u’=vd.
Therefore we can study combinatorial properties of the BWT by
studying the conjugacy classes of primitive words.
Standard Words
d1, d2,…,dn,…
a sequence of natural numbers
d10, d i >0
i =2,…,n
Consider the sequence {sn}n 0 defined as:
s0  b


s1  a

dn
s

s
 n 1
n s n 1 , n  0
sn  s
• lim
n 
s is a characteristic Sturmian word
• {sn} 0 is called approximating sequence of s
• (d1, d2,…,dn,… ) is the directive sequence of s
Each finite word sn is a standard word
Characterization of standard words
• A word w is standard if and only if it is a letter or w=vab (or
equivalently w=vba) and v has periods p,q such that gcd(p,q)=1
and |v|=p+q-2.
(extremal case of Fine and Wilf theorem)
• A word w is standard if and only if it is a letter or there exist
palindrome words P,Q,R, such that w = QR= Pxy where
{x,y}={a,b}.
• Standard words correspond to an extremal case of KnuthMorris-Pratt algorithm.
Rotations
Standard words can also be generated by rotations.
Let p,q2 such that gcd(p,q)=1 and n=p+q.
p:{0,1,…,n-1}{0,1,…,n-1} defined as p(z)=z+p (mod n)
b
7
a
0
1
b 6
b
5
Ia={0,1,…,q-1} Ib={q,q+1,…,n-1}
a
4
a
2 a
3
a
 : {0,1,…n-1} {a,b} defined as:
 (x )=a if x Ia, b otherwise.
If n=8, p=3, q=5,…
w=abaababa
THEOREM: Let w=x0x1…xn-1 in {a,b}* , |w|a=q and |w|b=p.
i
• w is a standard word with suffix ba  xi=  (  p ( p ))
• w is a standard word with suffix ab  xi=  (  pi ( p  1))
REMARK : Let u=u0u1…un-1 , v=v0v1…vn-1
If ui=  (  pi ( p )) and vi=  (  pi (t )) then u and v are conjugate.
A new characterization of standard words
THEOREM: Let u be a word over the alphabet {a,b}.
BWT(u)=bpaq with gcd(p,q)=1 if and only if u is a conjugate of a
standard word.
In particular, in order to reconstruct u from BWT(u) and the index I:
if I=p then u is a standard word with suffix ba
if I=p-1 then u is a standard word with suffix ab
COROLLARY: BWT(u) =bkah with gcd(k,h)=d if and only if u=vd where
v is a conjugate of a standard word.
Idea of the proof:
:
0
1
2
3
4
5
6
7
F
L
a
a
a
a
a
b
b
b
b
b
b
a
a
a
a
a
0
1
2
3
4
5
6
7
The permutation  giving the correspondence between the positions of
characters in F and L is (z)=z+p(mod n).
Starting, for example, from the position I=p we can recover the word u,
ui=F(i(p)).
Further Research
• Study extremal case of the BWT for k-letters alphabets with k>2.
For instance for k=3, characterize the words w such that BWT(w)
belongs to c*a*b* or b*c*a*.
This property does work neither with 3-Standard words nor with
balanced words.
• Does a relation between the complexity function of a word w and the
structure of BWT(w) exist?
• Given a language L, one can define BWT(L)={BWT(w) | w in L}. One
can ask whether BWT preserves some properties of a language L,
such as belonging to a certain family of languages in the Chomsky
Hierarchy.
We found negative results
L1=(ab)*, BWT(L1)={bnan | n≥0} a context free language
L2=(abc)*, BWT(L2)={cnanbn | n≥0} a context sensitive language
Further Research
• Is it possible to characterize interesting families of words in terms
of their BWT?
Consider for instance the words generated by finite iterations of the
Thue-Morse morphism m(a)=ab m(b)=ba.
Denote by vR the reversal word of v and by v the word obtained by
interchanging a with b and vice-versa.
Then:
BWT(mn(a))=vvR
Where
n-2
n-3
n-4
0
if n is even
n-2
n-3
n-4
0
if n is odd
v=b2 a2 b2 ...b2 a
v=b2 a2 b2 ...a2 b
Download