Combinatorial aspects of the Burrows-Wheeler transform Sabrina Mantaci Antonio Restivo Marinella Sciortino University of Palermo Burrows-Wheeler Transform In 1994 M. Burrows and D. Wheeler introduced a new data compression method based on a preprocessing on the input string. Such a preprocessing, called after them the BurrowsWheeler Transform (BWT), produces a permutation of the letters in the input string such that: • the transformed string is easier to compress than the original one. • the original string can be recovered; The use of this preprocessing allowed to define a class of lossless data compression algorithms that: • achieve speed comparable to the algorithms based on the techniques by Lempel and Ziv; • obtains a compression ratio close to the best statistical modelling techniques. How does BWT work ? • INPUT: w = abraca • Lexicographically sort the cyclic rotations of w F 0 1 2 3 4 5 L aabrac abraca acaabr bracaa caabra racaab The following properties hold: I • the character L[i] is followed in w by F[i]; • for each character ch, the i-th occurrence of ch in F corresponds to the i-th occurrence of ch in L. • OUTPUT: BWT(w)=L=caraab and the index I=1, denoting the position of the original word w after the lexicographic ordering. Reversibility The Burrows-Wheeler transform is reversible, in the sense that given BWT(w) and an index I, it is possible to recover w. Given L=BWT(w)=caraab and I=1: • Construct F by alphabetically sorting the letters in L F L I : 0 1 2 3 4 5 a a a b c r c a r a a b 0 1 2 3 4 5 = 012345 134502 w= a b r a c a • Define a permutation on {0,1,…,n-1}, establishing a correspondence between the positions of the same letters in F and in L; • Starting from position I, we can recover w=w0 … wn as follows: wi =F[i(I)], where 0(x)=x, i+1(x)= (i(x)) We can deduce that: REMARK: Two words x and y are conjugate BWT(x)=BWT(y) PROPOSITION: d d d d • If u v and BWT(v)=a0a1…an-1 then BWT(u)= a 0 a1 ...an 1; d d d • If BWT(v)=a0a1…an-1 and BWT(u)= a 0 a1 ...an 1 then there exists a conjugate u’ of u such that u’=vd. Therefore we can study combinatorial properties of the BWT by studying the conjugacy classes of primitive words. Standard Words d1, d2,…,dn,… a sequence of natural numbers d10, d i >0 i =2,…,n Consider the sequence {sn}n 0 defined as: s0 b s1 a dn s s n 1 n s n 1 , n 0 sn s • lim n s is a characteristic Sturmian word • {sn} 0 is called approximating sequence of s • (d1, d2,…,dn,… ) is the directive sequence of s Each finite word sn is a standard word Characterization of standard words • A word w is standard if and only if it is a letter or w=vab (or equivalently w=vba) and v has periods p,q such that gcd(p,q)=1 and |v|=p+q-2. (extremal case of Fine and Wilf theorem) • A word w is standard if and only if it is a letter or there exist palindrome words P,Q,R, such that w = QR= Pxy where {x,y}={a,b}. • Standard words correspond to an extremal case of KnuthMorris-Pratt algorithm. Rotations Standard words can also be generated by rotations. Let p,q2 such that gcd(p,q)=1 and n=p+q. p:{0,1,…,n-1}{0,1,…,n-1} defined as p(z)=z+p (mod n) b 7 a 0 1 b 6 b 5 Ia={0,1,…,q-1} Ib={q,q+1,…,n-1} a 4 a 2 a 3 a : {0,1,…n-1} {a,b} defined as: (x )=a if x Ia, b otherwise. If n=8, p=3, q=5,… w=abaababa THEOREM: Let w=x0x1…xn-1 in {a,b}* , |w|a=q and |w|b=p. i • w is a standard word with suffix ba xi= ( p ( p )) • w is a standard word with suffix ab xi= ( pi ( p 1)) REMARK : Let u=u0u1…un-1 , v=v0v1…vn-1 If ui= ( pi ( p )) and vi= ( pi (t )) then u and v are conjugate. A new characterization of standard words THEOREM: Let u be a word over the alphabet {a,b}. BWT(u)=bpaq with gcd(p,q)=1 if and only if u is a conjugate of a standard word. In particular, in order to reconstruct u from BWT(u) and the index I: if I=p then u is a standard word with suffix ba if I=p-1 then u is a standard word with suffix ab COROLLARY: BWT(u) =bkah with gcd(k,h)=d if and only if u=vd where v is a conjugate of a standard word. Idea of the proof: : 0 1 2 3 4 5 6 7 F L a a a a a b b b b b b a a a a a 0 1 2 3 4 5 6 7 The permutation giving the correspondence between the positions of characters in F and L is (z)=z+p(mod n). Starting, for example, from the position I=p we can recover the word u, ui=F(i(p)). Further Research • Study extremal case of the BWT for k-letters alphabets with k>2. For instance for k=3, characterize the words w such that BWT(w) belongs to c*a*b* or b*c*a*. This property does work neither with 3-Standard words nor with balanced words. • Does a relation between the complexity function of a word w and the structure of BWT(w) exist? • Given a language L, one can define BWT(L)={BWT(w) | w in L}. One can ask whether BWT preserves some properties of a language L, such as belonging to a certain family of languages in the Chomsky Hierarchy. We found negative results L1=(ab)*, BWT(L1)={bnan | n≥0} a context free language L2=(abc)*, BWT(L2)={cnanbn | n≥0} a context sensitive language Further Research • Is it possible to characterize interesting families of words in terms of their BWT? Consider for instance the words generated by finite iterations of the Thue-Morse morphism m(a)=ab m(b)=ba. Denote by vR the reversal word of v and by v the word obtained by interchanging a with b and vice-versa. Then: BWT(mn(a))=vvR Where n-2 n-3 n-4 0 if n is even n-2 n-3 n-4 0 if n is odd v=b2 a2 b2 ...b2 a v=b2 a2 b2 ...a2 b