Backward Search FM

BACKWARD SEARCH FM-INDEX (FULL-TEXT INDEX IN MINUTE SPACE) Paper by Ferragina & Manzini Presentation by Yuval Rikover  Combine Text compression with indexing (discard original text).  Count and locate P by looking at only a small portion of the compressed text.  Do it efficiently: Time: O(p)  Space: O(n Hk(T)) + o(n)   Exploit the relationship between the BurrowsWheeler Transform and the Suffix Array data structure.  Compressed suffix array that encapsulates both the compressed text and the full-text indexing information.  Supports two basic operations: Count – return number of occurrences of P in T.  Locate – find all positions of P in T.   Process T[1,..,n] using Burrows-Wheeler Transform   Receive LMTF[1,…,n] Encode runs of zeroes in LMTFusing run-length encoding   (permutation of T) Run Move-To-Front encoding on L   Receive string L[1,..,n] Receive Lrle Compress Lrle using variable-length prefix code  Receive Z (over alphabet {0,1} ) • Every column is a permutation of T. F mississippi# • Given row i, char L[i] precedes F[i] in ississippi#m original T. ssissippi#mi sissippi#mis the rows • Consecutive char’s in L areSort adjacent issippi#miss ssippi#missi to similar strings in T. sippi#missis ippi#mississ • Therefore – L usually contains long ppi#mississi runs ofpi#mississip identical char’s. i#mississipp #mississippi # i i i i m p p s s s s L mississipp #mississip ppi#missis ssippi#mis ssissippi# ississippi i#mississi pi#mississ ippi#missi issippi#mi sippi#miss sissippi#m i p s s m # p i s s i i Reminder: Recovering T from L 1. 2. 3. 4. 5. Find F by sorting L First char of T? m Find m in L L[i] precedes F[i] in T. Therefore we get mi How do we choose the correct i in L?   6. 7. The i’s are in the same order in L and F As are the rest of the char’s i is followed by s: And so on…. mis F L # i i i i m p p s s s s i p s s m # p i s s i i  L i p s s m # p Replace each char in L 1 3 4 0 4 4 3 with the number of distinct char’s seen since its last occurrence. # i m p s  Keep MTF[1,…,|Σ|] array, sorted lexicographically.  Runs of identical char’s are transformed into runs of zeroes in LMTF i s s i i 4 4 0 1 0 0 1 2 3 4 • Bad example i # m p s • For larger, English texts, we will receive 0 1 2 3 4 And so on… more runs of zeroes, and dominancy of smaller p i numbers. # m s 2 3being 4 that BWT creates • 0The1reason clusters of similar char’s. s p i # m 0 1 2 3 4  ii p p p sp ss smm #e ip p i # si ss s i i ii L Replace any sequence of m zeroes with: LMTF21 43 0 40 50 0 45 45 43 4 45 42 50 01 1 00 0 (m+1) in binary LSB first Discard MSB      Add 2 new symbols – 0,1 rle L is defined over { 0,1,1,2,…,|Σ|} Lrle 2 4 1 5 0 5 5 4 4 5 2 5 0 1 0 To give a meatier example (not really), we’ll change our text to: Example 1. 2. 3. 4. 5. 6. 7. 0  1+1  10  01  0 T == 2pipeMississippi# 00  2+1 = 3  11 11  1 000  3+1 = 4  100  001  00 0000  4+1 = 5  101  101  10 00000 5+1 = 6  110  011  01 000000  6+1 = 7  111  111  11 0000000  7+1 = 8  1000  0001000  Replace any sequence of m zeroes with: 0 (m+1) in binary LSB first Discard MSB      How to retrieve m • given a binary number b0b1 ,..., bk •Replace each bit b j with a sequence of b j  1  2 j zeroes  0 1 •10  1  1  2  (0  1)  2  4 Zeroes Example Add 2 new symbols – 0,1 rle L is defined over { 0,1,1,2,…,|Σ|}  1. 2. 3. 4. 5. 6. 7. 0  1+1 = 2  10  01  0 00  2+1 = 3  11 11  1 000  3+1 = 4  100  001  00 0000  4+1 = 5  101  101  10 00000 5+1 = 6  110  011  01 000000  6+1 = 7  111  111  11 0000000  7+1 = 8  1000  0001000 i p p p s s m e i p # i s s i i 011 00101 11 00110 10 00110 00110 00101 00101 00110 011 00110 10 010 10 MTF 2 4 0 0 5 0 5 5 4 4 5 2 5 0 1 0 rle  Compress L as follows, L L over alphabet {0,1}:   Lrle 2 4 1 5 0 5 5 4 4 5 2 5 0 1 0 1  11 0  10 For i = 1,2,…, |Σ| - 1    log( i  1) zeroes Followed by binary representation of i+1 which takes 1+ log( i  1) For a total of 1+2 log( i  1) bits Example 1. 2. 3. 4. 5. 6. 7. i=1  log( 2) i=2  log( 3) i=3  log( 4) i=4  log( 5) i=5  log( 6) i=6  log( 7) i=7  log( 8) 0’s, bin(2)  010 0’s, bin(3)  011 0’s, bin(4)  00100 0’s, bin(5)  00101 0’s, bin(6)  00110 0’s, bin(7)  00111 0’s, bin(8)  0001000  In 1999, Manzini showed the following upper bound for BWT compression ratio:   8n H k T   0.08  1  n  log n  g k ,   The article presents a bound using T  : k  Modified empirical entropy  The maximum compression ratio we can achieve, using for each symbol a codeword which depends on a context of size at most k (instead of always using a context of size k). H * n Hk T   n H k T   n H k T   Οlogn *   In 2001, Manzini showed that for every k, the above compression method is bounded by: 5n H k T   g k ,   *    Backward-search algorithm Uses only L (output of BWT) Relies on 2 structures:  C[1,…,|Σ|] : C[c] contains the total number of text chars in T which are alphabetically smaller then c (including repetitions of chars)  Occ(c,q): number of occurrences of char c in prefix L[1,q] Example • C[ ] for T = mississippi# • • occ(s, 5) = 2 occ(s,12) = 4 Occ 1 5 6 8 i m p s  Rank 1 2 3 4 5 6 7 8 9 10 11 12  Works in p iterations, from p down to 1 P = msi  msi i=2 c = ‘s’ msi i=1 c = ‘m’ msi Remember that the BWT matrix rows = sorted suffixes of T      i=3 c = ‘i’ All suffixes prefixed by pattern P, occupy a continuous set of rows This set of rows has starting position First and ending position Last So, (Last – First +1) gives total pattern occurrences At the end of the i-th phase, First points to the first row prefixed by P[i,p], and Last points to the last row prefiex by P[i,p]. SUBSTRING SEARCH IN T (COUNT THE PATTERN OCCURRENCES) C P = si First step unknown rows prefixed by char “i” fr lr occ=2 [lr-fr+1] fr lr #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m L i p s s m # p i s s i i # i m p S 1 2 7 8 Paolo Ferragina, Università di Pisa P[ j ] 10 Inductive step: Given fr,lr for P[j+1,p] Take c=P[j] Œ Find the first c in L[fr, lr] Find the last c in L[fr, lr] L-to-F mapping of these chars Occ() oracle is enough  P = pssi  1 2 3 4 5 6 7 8 9 10 11 12 First i = 34 Last ‘i’ ‘s’  c=  First = C[‘s’] C[‘i’]++Occ(‘s’,1) 1=2 +1 = 8+0+1 = 9  Last = C[‘i’ C[‘s’] + 1] + Occ(‘s’,5) = C[‘m’] = 5= 8+2 = 10  (Last – First + 1) = 4 2 C[ ] = 1 i 5 6 8 m p s  P = pssi  i = 23  c = ‘s’  First = 1 2 3 4 5 6 7 8 9 10 11 12 C[‘s’] + Occ(‘s’,1) Occ(‘s’,8) +1 = 8+0+1 8+2+1 = 911  Last =  (Last – First + 1) = C[‘s’] C[‘s’]++Occ(‘s’,10) Occ(‘s’,5) ==8+2 8+4==10 12 First Last 2 C[ ] = 1 i 5 6 8 m p s  P = pssi  i = 12  ‘s’ c = ‘p’  First =  Last =  (Last – First + 1) = 1 2 3 4 5 6 7 8 9 10 11 12 C[‘s’] C[‘p’] ++ Occ(‘s’,8) Occ(‘p’,10)+1+1= =8+2+1 6+2+1= =119 C[‘s’] C[‘p’]++Occ(‘s’,10) Occ(‘p’,12) ==8+4 6+2==12 8 2 0 First Last C[ ] = 1 i 5 6 8 m p s  Backward-search makes P iterations, and is dominated by Occ( ) calculations. Compressed text  Implement Occ( ) to run in O(1) time, using    log log n   bits. 5n H k T    n  log n   We saw a partitioning of binary strings into Buckets and Superblocks for answering Rank( ) queries.    So Count will run in O(p) time, using  log log n   Z   n  log n   We’ll use a similar solution With the help of some new structures General Idea: Sum character occurrences in 3 stages Superblock Bucket Intrabucket bits.  Buckets l   log n  Partition L into substrings of  This partition induces a partition on T = pipeMississippi# L chars each, denoted LMTF, denoted BLi |T| = |L| = 16 i p p p s s m e i p # i s s i i  1,.., n / l MTF BLi MTF BLi 4 i LMTF 2 4 0 0 5 0 5 5 4 4 5 2 5 0 1 0  Applying run-length encoding and prefix-free encoding on each bucket will result in n/log(n) variable-length buckets, denoted BZ i Lrle 2 4 1 5 0 5 5 4 4 5 2 5 0 1 0 011 00101 11 | 00110 10 00110 00110 | 00101 00101 00110 011 | 00110 10 010 10 BZ1 BZ 4  Superblocks We also partition L into superblocks of size  T  L  256 BLi MTF Chars SuperB j Chars l 2   log 2 n  each. 8 i  1,2,...,32  64 j  1,2,3,4 Create a table for each superblock, holding for each character c Є Σ, the  number of c’s occurrences in L, up to the start of the specific superblock. SuperB j , store occurrences of c in range L 1,..., j  l 2  Meaning, for  BL16 128 BL17 136 c1,..,c8 c1,..,c8 BL18 144 c1,..,c8 152 c1,..,c8 160 c1,..,c8 168 c1,..,c8  176 c1,..,c8 184 c1,..,c8 192 BL24 c1,..,c8 SuperB2 a 32 b 25 : : z 7 SuperB3  n  n    2  log n      l   log n  NO2 c1,..,c8 NO3 a 55 b 38 : : z 8 Back to Buckets  Create a similar table for buckets, but count only from current superblock's start. Denote tables as NOi  BL16 128 BL17 136 c1,..,c8 c1,..,c8 BL18 144 c1,..,c8 152 c1,..,c8 160 c1,..,c8 BL21 168 c1,..,c8 176 184 c1,..,c8 c1,..,c8 192 BL24 c1,..,c8 SuperB2 SuperB3 a 32 a 2 a 2 b 25 b 1 b 1 : : : : : : z 7 z 0 z 1 NO2   NO17  NO18 . . . . . . . . .   a 21 a 23 a 55 b 13 b 13 b 38 : : : : : : z 1 z 1 z 8  NO23  n n    log l 2       log log n  l   log n    Only thing left is searching inside a bucket.  c1,..,c8 For example, Occ(c,164) will require counting in But we only have the compressed string Z. Need more structures BL21  NO24 NO3  Finding  Array BZ i ‘s starting position in Z, part 1  W 1,..., n / l 2  keeps for every of the compressed buckets    SuperB j , the sum of the sizes BZ 1 ,....., BZ jlogn (in bits). Array W  1,..., n / l keeps for every bucket BZ i , the sum of bucket sizes up to it (including), from the superblock’s beginning. W 10 13 12 13 16 11 011 00101 11 | 00110 10 00110 | 00101 00101 011 | 00110 10 010 10 | 00110 10 00110 | 00101 00101 011 BZ 1 W 23 48 75 BZ 6 n   n  W    n  log l 2    n  log log n   W   2  log n  1  2log      log n  l     l   log n  B  Finding BZ i ‘s starting position in Z, part 2 Given Occ(‘c’,q)  Find i of BLi:  i  q  log n 166/8 = 20.75  i = 21 Find character in BLi to count up to: h = 166 – 20*8 h=6 h  q  (i  1)  log n Find superblock SuperBt of BLi (21/8) = 2.625 t=2  i  t  1  log n  BZ i in Z: W t   W [i 1]  1 Locate position of W1[] W2[] W1   Occ(k,166) Compressed bucket is at W[2]+W`[20]+1 W3[] W4[] W2  W6[] W3    W [] 5 011 00101 11 | 00110 10 00110 | 00101 00101 011 | 00110 10 010 10 | 00110 10 00110 | 00101 00101 BL16 128 BL17 136 c1,..,c8 c1,..,c8 SuperB2 BL18 144 c1,..,c8 152 c1,..,c8 160 c1,..,c8 BL21 BL21 c1,..,c8 168 176 c1,..,c8 184 c1,..,c8 192 BL24 c1,..,c8 W c1,..,c8 SuperB  | 011 We have the compressed bucket BZ i 00101 11 | And we have h (how many chars to check in BZ i ).  But BZ i contains compressed Move-To-Front information…  Need more structures!   For every i, before encoding bucket the state of the MTF table T = pipeMississippi# L BLi h with Move-To-Front, keep |T| = |L| = 16 i p p p s s m e i p # i s s i i 0 0 5 0 5 5 4 4 5 2 5 0 1 0 LMTF 2 4BZ i # e i m p s p i # e m s e m s p i # i # p e m s MTF1 MTF2 MTF3 MTF4  n   n      log      log n   log n     How do we use MTFi to count inside BZ i ? Final structure! | 011 h 00101 11 | # e i m p s Build a table S c, h, BZ i , MTFi  , which stores the number of occurrences of ‘c’ among the first h characters of BLi  Possible because BZ i and MTF together, completely determine BL i i | 011 0010 11 | | 010 00101 011 0010| | 00111 00101 00111 00110| s # e i m p s # e p m s i 1 a b . s p i m e # s p m e i # . z 2 … logn Max size of a compressed bucket BZ i :   Size of inner table:    l (1 2 log  ) Number of possible compressed buckets : Number of possible MFT table states:  l  (1  2log  ) l Size of each entry:  log l 2 2  log   2t  Total size of S[ ]:  2t l log l  n  log n  log log n 0010 11a| linear| sized 010 00101 011 0010 00111 00101 uses 00111 00110 | n is •But –| 011 having index is |bad for |practical when large. s # e i m p s # e p m s 1 •We want to keep the index size sub-linear, specifically: a i •Choose bucket size •We s pget i s p m e i # … logn    n ,   1 l , such as: l  (1  2log  b)   log n 2 logn  n   n  m e # 2    . S   n  log n  log log n . z  1   Summing everything:  NO &  NO &  MTF takes  S takes  W both take W  both take   n     log n   n    log n    n    log log n   log n   n  log n  log log n  n  Total Size: Z    log log n   log n  W # e i m p s MTF1  Total Time: 1 10 13 12 13 16 11 011 00101 11 | 00110 10 00110 | 00101 00101 011 | 00110 10 010 10 | 00110 10 00110 | 00101 00101 011 a 23 a 55 b 13 b 38 : : : : z 1 z 8 NO1 NO2 W 23 48 75  We now want to retrieve the positions in T of the (Last – First+1) pattern occurrences. Meaning: for every i = first, first+1,…,Last Find position in T of the suffix which prefixes the i-th row.  Denote the above as pos(i)  1 2 3 4 5 6 7 8 9 10 11 12 pos(9) = 7 m i s s i s s i p p i # 1 2 3 4 5 6 7 8 9 10 11 12  We can’t find pos(i) directly, but we can do: Given row i (9) , we can find row j (11) such that pos(j) = pos(i) – 1  This algorithm is called backward_step(i)   Running time O(1) Uses previous structures P = si  L[i] precedes F[i] in T. All char’s appear at the same order in L and F.  So ideally, we would just compute Occ(L[i],i)+C[L[i]]  1 2 3 4 5 6 7 8 9 10 11 12 m i s s i s s i p p # P = si 1 2 3 4 5 6 7 8 9 10 11 12 i=9 i F # i i i i m p p s s s s L i p s 1 s 2 m # p i s 3 s i P = si i  Solution: Now we can compute Occ(L[i],i)+C[L[i]].  Calling Occ() is O(1)    (1)  Therefore backward_step takes O(1) time. m i s s i s s i p p i # 1 2 3 4 5 6 7 8 9 10 11 12 P = si i=9 Occ(c,8)  # Occ(c,9)  Compare Occ(c,i) to Occ(c,i-1) for every c    Obviously, will only differ at c = L[i].  L i p s s m # p i s s i i  Now we are ready for the final algorithm.  First, mark every log n character from T and its corresponding row (suffix) in L.    1   1 2 3 4 5 6 7 8 9 10 11 12 For each marked row r j , store its position Pos( r j ) in data structure S. For example, querying S for Pos( r3 ) will return 8. first last m i s s i s s i p p # P = si 1 2 3 4 5 6 7 8 9 10 11 12 i=9 i  Given row index i, find Pos(i) as follows:  If  Otherwise - use backward_step(i) to find i’ such that : Pos(i’) = Pos(i) – 1   ri is a marked row, return Pos(i) from S. DONE! 1 2 3 4 5 6 7 8 9 10 11 12 Repeat t times until we find a marked row. Then – retrieve Pos(i’) from S and compute Pos(i) by computing: Pos(i’) + t first last m i s s i s s i p p # P = si 1 2 3 4 5 6 7 8 9 10 11 12 i=9 i Example – finding “si”   For i = last = 10  r10 is marked – get Pos(10) from S :  Pos(10) = 4 1 2 3 4 5 6 7 8 9 10 11 12 For i = first = 9     r9 isn’t marked.  backward_step(9) backward_step(9) = 11 ( t = 1) 11 isn’t marked either.  backward_step(11) Backward_step(11) = 4 ( t = 2) r r r first last m i s s i s s i p p # P = si 1 2 3 4 5 6 7 8 9 10 11 12 i=9 i Example – finding “si”   For i = last = 10  r10 is marked – get Pos(10) from S :  Pos(10) = 4 1 2 3 4 5 6 7 8 9 10 11 12 For i = first = 9         r9 isn’t marked.  backward_step(9) backward_step(9) = 11 ( t = 1) 11 isn’t marked either.  backward_step(11) Backward_step(11) = 4 ( t = 2) 4 isn’t marked either.  backward_step(4) Backward_step(4) = 10 ( t = 3) 10 is marked – get Pos(10) from S. Pos(10) = 4 Pos(9) = Pos(10) + t = 4 + 3 = 7 r r r r r r first last m i s s i s s i p p # P = si 1 2 3 4 5 6 7 8 9 10 11 12 i=9 i  1  n  A marked row will be found in at most log iterations.  Each iteration uses backward_step, which is O(1).  So finding a single position takes  Finding all occ occurrences of P in T takes:   occ  log 1 n   log 1 n  1 2 3 4 5 6 7 8 9 10 11 12  but only if querying S for membership is O(1)!! first 1  log n  last m i s s i s s i p p # P = si 1 2 3 4 5 6 7 8 9 10 11 12 i=9 i   Partition L’s rows into buckets of For each bucket  Store all marked rows in a Packed-B-Tree (unique for each row), Using their distance from the beginning of the bucket as the key. (also storing the mapping)   1 2 3 4 5 6 7 8 9 10 11 12 4 3 6 1    log 2 n rows each.   A tree will contain at most  log 2 n keys, of size log log 2 n  log log n bits each.  O(1) access time m i s s i s s i p p # P = si 1 2 3 4 5 6 7 8 9 10 11 12 i=9 i    n The number of marked rows is   1 log    n log log n Each key encoded in a tree takes  bits, and we need an additional O(logn) bits to keep the Pos(i) value.   1 2 3 4 5 6 7 8 9 10 11 12  n  log log n  log n  So S takes  1  log n  The structure we used to count P, uses  n  bits, so choose ε between 0 and Z    log log n   log n   n      log log n 1 (because going lower than  log n   doesn’t reduce the asymptotic space usage.) m i s s i s s i p p # P = si 1 2 3 4 5 6 7 8 9 10 11 12 i=9 i

Backward Search FM

Related documents

Products

Support

Backward Search FM

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib