Backward Search FM

advertisement
BACKWARD SEARCH
FM-INDEX
(FULL-TEXT INDEX IN MINUTE SPACE)
Paper by
Ferragina & Manzini
Presentation by
Yuval Rikover

Combine Text compression with indexing
(discard original text).

Count and locate P by looking at only a small
portion of the compressed text.

Do it efficiently:
Time: O(p)
 Space: O(n Hk(T)) + o(n)


Exploit the relationship between the BurrowsWheeler Transform and the Suffix Array data
structure.

Compressed suffix array that encapsulates both the
compressed text and the full-text indexing
information.

Supports two basic operations:
Count – return number of occurrences of P in T.
 Locate – find all positions of P in T.


Process T[1,..,n] using Burrows-Wheeler Transform


Receive
LMTF[1,…,n]
Encode runs of zeroes in LMTFusing run-length
encoding


(permutation of T)
Run Move-To-Front encoding on L


Receive string L[1,..,n]
Receive
Lrle
Compress Lrle using variable-length prefix code

Receive Z
(over alphabet {0,1} )
• Every column is a permutation of T.
F
mississippi#
• Given row i, char L[i] precedes F[i] in
ississippi#m
original
T.
ssissippi#mi
sissippi#mis
the rows
• Consecutive
char’s in L areSort
adjacent
issippi#miss
ssippi#missi
to similar
strings in T.
sippi#missis
ippi#mississ
• Therefore – L usually contains long
ppi#mississi
runs ofpi#mississip
identical char’s.
i#mississipp
#mississippi
#
i
i
i
i
m
p
p
s
s
s
s
L
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
i
p
s
s
m
#
p
i
s
s
i
i
Reminder: Recovering T from L
1.
2.
3.
4.
5.
Find F by sorting L
First char of T? m
Find m in L
L[i] precedes F[i] in T. Therefore we get
mi
How do we choose the correct i in L?


6.
7.
The i’s are in the same order in L and F
As are the rest of the char’s
i is followed by s:
And so on….
mis
F
L
#
i
i
i
i
m
p
p
s
s
s
s
i
p
s
s
m
#
p
i
s
s
i
i

L i p s s m # p
Replace each char in L
1 3 4 0 4 4 3
with the number of
distinct char’s seen since
its last occurrence.
# i m p s

Keep MTF[1,…,|Σ|] array,
sorted lexicographically.

Runs of identical char’s
are transformed into runs
of zeroes in LMTF
i
s
s
i
i
4 4 0 1
0
0 1 2 3 4
• Bad example
i # m p s
• For larger, English texts, we will receive
0 1 2 3 4
And
so on…
more runs of zeroes, and
dominancy
of
smaller
p i numbers.
# m s
2 3being
4 that BWT creates
• 0The1reason
clusters of similar char’s.
s p i # m
0
1
2
3
4

ii p p p sp ss smm #e ip p i # si ss s i i ii
L
Replace any
sequence
of
m
zeroes
with:
LMTF21 43 0 40 50 0 45 45 43 4 45 42 50 01 1 00
0
(m+1) in binary
LSB first
Discard MSB





Add 2 new symbols – 0,1
rle
L
is defined over
{ 0,1,1,2,…,|Σ|}
Lrle 2 4 1 5 0 5 5 4 4 5 2 5 0 1 0
To give a meatier example (not really), we’ll
change
our text to:
Example
1.
2.
3.
4.
5.
6.
7.
0  1+1
 10  01  0
T == 2pipeMississippi#
00  2+1 = 3  11 11  1
000  3+1 = 4  100  001  00
0000  4+1 = 5  101  101  10
00000 5+1 = 6  110  011  01
000000  6+1 = 7  111  111  11
0000000  7+1 = 8  1000  0001000

Replace any
sequence of
m
zeroes
with:
0
(m+1) in binary
LSB first
Discard MSB





How to retrieve m
• given a binary number b0b1 ,..., bk
•Replace each bit b j with a sequence of
b j  1  2 j zeroes

0
1
•10  1  1  2  (0  1)  2  4 Zeroes
Example
Add 2 new symbols – 0,1
rle
L
is defined over
{ 0,1,1,2,…,|Σ|}

1.
2.
3.
4.
5.
6.
7.
0  1+1 = 2  10  01  0
00  2+1 = 3  11 11  1
000  3+1 = 4  100  001  00
0000  4+1 = 5  101  101  10
00000 5+1 = 6  110  011  01
000000  6+1 = 7  111  111  11
0000000  7+1 = 8  1000  0001000
i p p p s s m e i p # i s s i i
011 00101 11 00110 10 00110 00110
00101 00101 00110 011 00110 10 010 10
MTF 2 4 0 0 5 0 5 5 4 4 5 2 5 0 1 0
rle
 Compress L
as follows, L
L
over alphabet {0,1}:


Lrle 2 4 1 5 0 5 5 4 4 5 2 5 0 1 0
1  11
0  10
For i = 1,2,…, |Σ| - 1



log( i  1) zeroes
Followed by binary
representation of i+1
which takes 1+ log( i  1)
For a total of 1+2 log( i  1)
bits
Example
1.
2.
3.
4.
5.
6.
7.
i=1  log( 2)
i=2  log( 3)
i=3  log( 4)
i=4  log( 5)
i=5  log( 6)
i=6  log( 7)
i=7  log( 8)
0’s, bin(2)  010
0’s, bin(3)  011
0’s, bin(4)  00100
0’s, bin(5)  00101
0’s, bin(6)  00110
0’s, bin(7)  00111
0’s, bin(8)  0001000

In 1999, Manzini showed the following upper bound for BWT
compression ratio:


8n H k T   0.08  1  n  log n  g k ,  
The article presents a bound using
T  :
k
 Modified empirical entropy
 The maximum compression ratio we can achieve, using for each
symbol a codeword which depends on a context of size at most k
(instead of always using a context of size k).
H
*
n Hk T   n H k T   n H k T   Οlogn
*


In 2001, Manzini showed that for every k, the above compression
method is bounded by:
5n H k T   g k ,  
*



Backward-search algorithm
Uses only L (output of BWT)
Relies on 2 structures:

C[1,…,|Σ|] : C[c] contains the total number of text chars in T which are
alphabetically smaller then c (including repetitions of chars)

Occ(c,q): number of occurrences of char c in prefix L[1,q]
Example
•
C[ ] for T = mississippi#
•
•
occ(s, 5) = 2
occ(s,12) = 4
Occ
1 5 6 8
i m p s
 Rank
1
2
3
4
5
6
7
8
9
10
11
12

Works in p iterations, from p down to 1
P = msi

msi
i=2
c = ‘s’
msi
i=1
c = ‘m’
msi
Remember that the BWT matrix rows = sorted suffixes of T





i=3
c = ‘i’
All suffixes prefixed by pattern P, occupy a continuous set of rows
This set of rows has starting position First
and ending position Last
So, (Last – First +1) gives total pattern occurrences
At the end of the i-th phase, First points to the first row prefixed by
P[i,p], and Last points to the last row prefiex by P[i,p].
SUBSTRING SEARCH IN T (COUNT THE PATTERN OCCURRENCES)
C
P = si
First step
unknown
rows prefixed
by char “i”
fr
lr
occ=2
[lr-fr+1]
fr
lr
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
#
i
m
p
S
1
2
7
8
Paolo Ferragina,
Università di Pisa
P[ j ]
10
Inductive step: Given fr,lr for P[j+1,p]
Take c=P[j]
Œ
Find the first c in L[fr, lr]
Find the last c in L[fr, lr]
L-to-F mapping of these chars
Occ() oracle is enough

P = pssi

1
2
3
4
5
6
7
8
9
10
11
12
First
i = 34
Last
‘i’
‘s’

c=

First =
C[‘s’]
C[‘i’]++Occ(‘s’,1)
1=2
+1 = 8+0+1 = 9

Last =
C[‘i’
C[‘s’]
+ 1]
+ Occ(‘s’,5)
= C[‘m’] = 5= 8+2 = 10

(Last – First + 1) =
4
2
C[ ] = 1
i
5 6 8
m p s

P = pssi

i = 23

c = ‘s’

First =
1
2
3
4
5
6
7
8
9
10
11
12
C[‘s’] + Occ(‘s’,1)
Occ(‘s’,8) +1 = 8+0+1
8+2+1 = 911

Last =

(Last – First + 1) =
C[‘s’]
C[‘s’]++Occ(‘s’,10)
Occ(‘s’,5) ==8+2
8+4==10
12
First
Last
2
C[ ] = 1
i
5 6 8
m p s

P = pssi

i = 12

‘s’
c = ‘p’

First =

Last =

(Last – First + 1) =
1
2
3
4
5
6
7
8
9
10
11
12
C[‘s’]
C[‘p’] ++ Occ(‘s’,8)
Occ(‘p’,10)+1+1= =8+2+1
6+2+1= =119
C[‘s’]
C[‘p’]++Occ(‘s’,10)
Occ(‘p’,12) ==8+4
6+2==12
8
2
0
First
Last
C[ ] = 1
i
5 6 8
m p s

Backward-search makes P iterations, and is dominated by Occ( )
calculations.
Compressed text

Implement Occ( ) to run in O(1) time, using


 log log n 
 bits.
5n H k T    n 
log
n


We saw a partitioning of binary strings into Buckets and
Superblocks for answering Rank( ) queries.



So Count will run in O(p) time, using
 log log n 

Z   n 
log n 

We’ll use a similar solution
With the help of some new structures
General Idea: Sum character occurrences in 3 stages
Superblock
Bucket
Intrabucket
bits.

Buckets
l   log n

Partition L into substrings of

This partition induces a partition on
T = pipeMississippi#
L
chars each, denoted
LMTF, denoted BLi
|T| = |L| = 16
i p p p s s m e i p # i s s i
i  1,.., n / l
MTF
BLi
MTF
BLi
4
i
LMTF 2 4 0 0 5 0 5 5 4 4 5 2 5 0 1 0

Applying run-length encoding and prefix-free encoding on each bucket will
result in n/log(n) variable-length buckets, denoted BZ i
Lrle 2 4 1 5 0 5 5 4 4 5 2 5 0 1 0
011 00101 11 | 00110 10 00110 00110 | 00101 00101 00110 011 | 00110 10 010 10
BZ1
BZ 4

Superblocks
We also partition L into superblocks of size

T  L  256
BLi
MTF
Chars
SuperB j
Chars
l 2   log 2 n 
each.
8
i  1,2,...,32
 64
j  1,2,3,4
Create a table for each superblock, holding for each character c Є Σ, the

number of c’s occurrences in L, up to the start of the specific superblock.
SuperB j , store occurrences of c in range L 1,..., j  l 2
 Meaning, for

BL16 128 BL17 136
c1,..,c8
c1,..,c8
BL18 144
c1,..,c8
152
c1,..,c8
160
c1,..,c8
168
c1,..,c8

176
c1,..,c8
184
c1,..,c8
192
BL24
c1,..,c8
SuperB2
a
32
b
25
:
:
z
7
SuperB3
 n 
n


 2  log n     
l

 log n 
NO2
c1,..,c8
NO3
a
55
b
38
:
:
z
8
Back to Buckets

Create a similar table for buckets, but count only from current superblock's
start. Denote tables as NOi

BL16 128 BL17 136
c1,..,c8
c1,..,c8
BL18 144
c1,..,c8
152
c1,..,c8
160
c1,..,c8
BL21
168
c1,..,c8
176
184
c1,..,c8
c1,..,c8
192
BL24
c1,..,c8
SuperB2
SuperB3
a
32
a
2
a
2
b
25
b
1
b
1
:
:
:
:
:
:
z
7
z
0
z
1
NO2


NO17

NO18
. . . . . . . .
.


a
21
a
23
a
55
b
13
b
13
b
38
:
:
:
:
:
:
z
1
z
1
z
8

NO23
 n
n

  log l 2     
 log log n 
l

 log n

 
Only thing left is searching inside a bucket.

c1,..,c8
For example, Occ(c,164) will require counting in
But we only have the compressed string Z.
Need more structures
BL21

NO24
NO3

Finding

Array
BZ i
‘s starting position in Z, part 1

W 1,..., n / l 2

keeps for every
of the compressed buckets



SuperB j , the sum of the sizes
BZ 1 ,....., BZ jlogn
(in bits).
Array
W  1,..., n / l keeps for every bucket BZ i , the sum of bucket
sizes up to it (including), from the superblock’s beginning.
W
10 13 12 13 16 11
011 00101 11 | 00110 10 00110 | 00101 00101 011 | 00110 10 010 10 | 00110 10 00110 | 00101 00101 011
BZ 1
W
23 48 75
BZ 6
n
  n  W    n  log l 2    n  log log n 

W   2  log n  1  2log    
 log n

l




l
  log n 
B

Finding
BZ i
‘s starting position in Z, part 2
Given Occ(‘c’,q)

Find i of
BLi:

i  q
 log n
166/8 = 20.75
 i = 21
Find character in BLi to count up to:
h = 166 – 20*8
h=6
h  q  (i  1)  log n
Find superblock
SuperBt of BLi
(21/8) = 2.625
t=2
 i 
t
 1
 log n 
BZ i in Z:
W t   W [i 1]  1
Locate position of
W1[]
W2[]
W1 

Occ(k,166)
Compressed bucket is at
W[2]+W`[20]+1
W3[]
W4[]
W2 
W6[] W3  
 W []
5
011 00101 11 | 00110 10 00110 | 00101 00101 011 | 00110 10 010 10 | 00110 10 00110 | 00101 00101
BL16 128 BL17 136
c1,..,c8
c1,..,c8
SuperB2
BL18 144
c1,..,c8
152
c1,..,c8
160
c1,..,c8
BL21
BL21
c1,..,c8
168
176
c1,..,c8
184
c1,..,c8
192
BL24
c1,..,c8
W
c1,..,c8
SuperB

| 011
We have the compressed bucket BZ i
00101 11 |
And we have h (how many chars to check in BZ i ).
 But BZ i contains compressed Move-To-Front information…
 Need more structures!


For every i, before encoding bucket
the state of the MTF table
T = pipeMississippi#
L
BLi
h
with Move-To-Front, keep
|T| = |L| = 16
i p p p s s m e i p # i s s i
i
0 0 5 0 5 5 4 4 5 2 5 0 1 0
LMTF 2 4BZ
i
# e i m p s
p i # e m s
e m s p i #
i # p e m s
MTF1
MTF2
MTF3
MTF4
 n

 n 


  log    
 log n

 log n 



How do we use MTFi to count inside BZ i ?
Final structure!
| 011
h
00101 11 |
# e i m p s
Build a table S c, h, BZ i , MTFi  , which stores the number of
occurrences of ‘c’ among the first h characters of BLi
 Possible because BZ i and MTF together, completely determine BL i
i
| 011
0010 11 |
| 010
00101 011 0010|
| 00111
00101 00111 00110|
s
# e
i m p s
#
e p m s
i
1
a
b
.
s p
i m e #
s p m e
i
#
.
z
2
… logn
Max size of a compressed bucket BZ i :


Size of inner table:



l (1 2 log  )
Number of possible compressed buckets :
Number of possible MFT table states:

l  (1  2log  )
l
Size of each entry:

log l
2
2
 log 
 2t

Total size of S[ ]:  2t l log l  n  log n  log log n
0010 11a| linear| sized
010 00101
011 0010
00111 00101 uses
00111 00110
| n is
•But –| 011
having
index
is |bad for |practical
when
large.
s
# e
i m p s
#
e p m s
1
•We want to keep the index size sub-linear, specifically:
a
i
•Choose bucket size
•We
s pget
i
s p m e
i
#
… logn
 
 n ,   1
l , such as: l  (1  2log  b)   log n
2 logn  n   n 
m e #
2



.
S   n  log
n  log log n
.
z
 1


Summing everything:

NO
&

NO
&

MTF takes

S takes

W both take
W  both take

 n 


 log n 
 n 


log
n


 n


 log log n 
 log n

 n  log n  log log n
 n

Total Size: Z  
 log log n 
 log n

W
# e i m p s
MTF1

Total Time:
1
10 13 12 13 16 11
011 00101 11 | 00110 10 00110 | 00101 00101 011 | 00110 10 010 10 | 00110 10 00110 | 00101 00101 011
a
23
a
55
b
13
b
38
:
:
:
:
z
1
z
8
NO1
NO2
W
23 48 75

We now want to retrieve the positions in T of the (Last – First+1)
pattern occurrences.
Meaning: for every
i = first, first+1,…,Last
Find position in T of the suffix which prefixes the i-th row.
 Denote the above as pos(i)

1
2
3
4
5
6
7
8
9
10
11
12
pos(9) = 7
m
i
s
s
i
s
s
i
p
p
i
#
1
2
3
4
5
6
7
8
9
10 11 12

We can’t find pos(i) directly, but we can do:
Given row i (9)
, we can find row j
(11)
such that pos(j) = pos(i) – 1

This algorithm is called backward_step(i)


Running time O(1)
Uses previous structures
P = si

L[i] precedes F[i] in T.
All char’s appear at the same order in L and F.

So ideally, we would just compute Occ(L[i],i)+C[L[i]]

1
2
3
4
5
6
7
8
9
10
11
12
m
i
s
s
i
s
s
i
p
p
#
P = si
1
2
3
4
5
6
7
8
9
10 11 12
i=9
i
F
#
i
i
i
i
m
p
p
s
s
s
s
L
i
p
s 1
s 2
m
#
p
i
s 3
s
i
P = si i

Solution:
Now we can compute Occ(L[i],i)+C[L[i]].

Calling Occ() is O(1)

  (1)

Therefore backward_step takes O(1) time.
m
i
s
s
i
s
s
i
p
p
i
#
1
2
3
4
5
6
7
8
9
10 11 12
P = si
i=9
Occ(c,8)

#
Occ(c,9)

Compare Occ(c,i) to Occ(c,i-1) for every c   
Obviously, will only differ at c = L[i].

L
i
p
s
s
m
#
p
i
s
s
i
i

Now we are ready for the final algorithm.

First, mark every log
n character from T
and its corresponding row (suffix) in L.



1 

1
2
3
4
5
6
7
8
9
10
11
12
For each marked row r j , store its position
Pos( r j ) in data structure S.
For example, querying S for Pos( r3 ) will return 8.
first
last
m
i
s
s
i
s
s
i
p
p
#
P = si
1
2
3
4
5
6
7
8
9
10 11 12
i=9
i

Given row index i, find Pos(i) as follows:

If

Otherwise - use backward_step(i) to find i’ such
that :
Pos(i’) = Pos(i) – 1


ri
is a marked row, return Pos(i) from S. DONE!
1
2
3
4
5
6
7
8
9
10
11
12
Repeat t times until we find a marked row.
Then – retrieve Pos(i’) from S and compute Pos(i)
by computing: Pos(i’) + t
first
last
m
i
s
s
i
s
s
i
p
p
#
P = si
1
2
3
4
5
6
7
8
9
10 11 12
i=9
i
Example – finding “si”


For i = last = 10

r10 is marked – get Pos(10) from S :

Pos(10) = 4
1
2
3
4
5
6
7
8
9
10
11
12
For i = first = 9




r9
isn’t marked.  backward_step(9)
backward_step(9) = 11
( t = 1)
11 isn’t marked either.  backward_step(11)
Backward_step(11) = 4
( t = 2)
r
r
r
first
last
m
i
s
s
i
s
s
i
p
p
#
P = si
1
2
3
4
5
6
7
8
9
10 11 12
i=9
i
Example – finding “si”


For i = last = 10

r10 is marked – get Pos(10) from S :

Pos(10) = 4
1
2
3
4
5
6
7
8
9
10
11
12
For i = first = 9








r9
isn’t marked.  backward_step(9)
backward_step(9) = 11
( t = 1)
11 isn’t marked either.  backward_step(11)
Backward_step(11) = 4
( t = 2)
4 isn’t marked either.  backward_step(4)
Backward_step(4) = 10
( t = 3)
10 is marked – get Pos(10) from S. Pos(10) = 4
Pos(9) = Pos(10) + t = 4 + 3 = 7
r
r
r
r
r
r
first
last
m
i
s
s
i
s
s
i
p
p
#
P = si
1
2
3
4
5
6
7
8
9
10 11 12
i=9
i

1 
n

A marked row will be found in at most log
iterations.

Each iteration uses backward_step, which is O(1).

So finding a single position takes

Finding all occ occurrences of P in T takes:

 occ  log 1 n

 log 1 n

1
2
3
4
5
6
7
8
9
10
11
12

but only if querying S for membership is O(1)!!
first
1 
log
n

last
m
i
s
s
i
s
s
i
p
p
#
P = si
1
2
3
4
5
6
7
8
9
10 11 12
i=9
i


Partition L’s rows into buckets of
For each bucket

Store all marked rows in a Packed-B-Tree (unique for each row),
Using their distance from the beginning of the bucket
as the key. (also storing the mapping)


1
2
3
4
5
6
7
8
9
10
11
12
4
3
6
1


 log 2 n rows each.


A tree will contain at most  log 2 n keys, of size
log log 2 n  log log n bits each.
 O(1) access time
m
i
s
s
i
s
s
i
p
p
#
P = si
1
2
3
4
5
6
7
8
9
10 11 12
i=9
i



n
The number of marked rows is 

1
log



n
log log n
Each key encoded in a tree takes 
bits,
and we need an additional O(logn) bits to keep
the Pos(i) value.


1
2
3
4
5
6
7
8
9
10
11
12
 n

log log n  log n 
So S takes 
1
 log n

The structure we used to count P, uses
 n
 bits, so choose ε between 0 and
Z  
 log log n 
 log n

 n





log
log
n
1 (because going lower than 
log
n


doesn’t
reduce the asymptotic space usage.)
m
i
s
s
i
s
s
i
p
p
#
P = si
1
2
3
4
5
6
7
8
9
10 11 12
i=9
i
Download