Document

advertisement
≤ i ≤ n, Si denotes the i-th
suffix S[i..n] of S . The suffix array SA of the string S is an array of integers
in the range 1 to n specifying the lexicographic ordering of the n suffixes of
the string S . that is, it satisfies SSA[1] < SSA[2] < · · · < SSA[n] .
Let S be a string of length n. For every i, 1
ISA is an array of size n such that for any q with
1 ≤ q ≤ n the equality ISA[SA[q]] = q holds.
The inverse suffix array
1
i SA ISA SSA[i]
1
9
5 $
2
7
9 at$
3
5
4 atat$
4
3
8 atatat$
5
1
3 ctatatat$
6
8
7 t$
7
6
2 tat$
8
4
6 tatat$
9
2
1 tatatat$
2
The Burrows-Wheeler transform transforms a string S in three steps:
1. Form a conceptual matrix M 0 whose rows are the cyclic shifts of the
string S .
2. Compute the matrix M by sorting the rows of M 0 lexicographically.
3. Output the last column L of M .
3
F
ctatatat$
$ ctatata t
tatatat$c
a t$ctata t
atatat$ct
a tat$cta t
tatat$cta
ctatatat$
cyclic
−−−−−→
shif ts
L
atat$ctat
a tatat$c t
sort
−−−→
c tatatat $
tat$ctata
t $ctatat a
at$ctatat
t at$ctat a
t$ctatata
t atat$ct a
$ctatatat
t atatat$ c
M0
last
−−−−−−→
column
tttt$aaac
M
4
Definition: For a string S of length n having the sentinel character at the end
(and nowhere else), the string BWT[1..n] is defined by BWT[i]
= $ if
SA[i] = 1 and BWT[i] = S[SA[i] − 1] otherwise.
5
i SA ISA BWT SSA[i]
1
9
5
t $
2
7
9
t at$
3
5
4
t atat$
4
3
8
t atatat$
5
1
3
$ ctatatat$
6
8
7
a t$
7
6
2
a tat$
8
4
6
a tatat$
9
2
1
c tatatat$
6
Computing BWT from SA and the string S
← 1 to n do
if SA[i] = 1 then BWT[i] ← $
else BWT[i] ← S[SA[i] − 1]
for i
7
F
L
BWT
F
L
$ ctatata t
$
t
t
$
t
a t$ctata t
a t$
t
t
at$
t
a tat$cta t
a tat$
t
t
atat$
t
a tatat$
t
observe
t
atatat$
t
c tatatat$
$
L=BWT
$
ctatatat$
$
t $ctatat a
t$
a
a
t$
a
t at$ctat a
t at$
a
a
tat$
a
t atat$ct a
t atat$
a
a
tatat$
a
t atatat$ c
t atatat$
c
c
tatatat$
c
F
L
a tatat$c t
c tatatat $
truncate
−−−−−−−→
af ter $
−−−−−−→
8
: {1, . . . , n} → {1, . . . , n} is defined as
follows. If L[i] = c is the k -th occurrence of character c in L = BWT, then
LF (i) = j is the index so that F [j] is the k -th occurrence of c in F .
Definition: The function LF
The function LF is called last-to-first mapping because it maps the last
column L to the first column F .
9
i
1
2 3 4
5
L[i]
t
t
$ a a a c
LF (i) 6
F [i]
t
t
6 7 8 9
7 8 9
1
2 3 4 5
$ a a a
c
t
t
t
t
10
Computing the C -array of string S
cnt[1..σ] ← [0..0]
/* initialize cnt */
C[1..σ] ← [0..0]
/* initialize C */
for i ← 1 to n do
cnt[S[i]] ← cnt[S[i]] + 1
for j ← 2 to σ do
C[j] ← cnt[j − 1] + C[j − 1]
11
Computing LF from BWT
∈ Σ do
count[c] ← C[c]
for i ← 1 to n do
c ← BWT[i]
count[c] ← count[c] + 1
LF [i] ← count[c]
for all c
12
Lemma: The first row of the matrix M contains the suffix Sn
= $. If row i,
2 ≤ i ≤ n, of the matrix M contains the suffix Sj , then row LF (i) of M
contains the suffix Sj−1 .
13
Computing the string S from BWT and LF .
S[n] ← $
j←1
for i ← n − 1 downto 1 do
S[i] ← BWT[j]
j ← LF (j)
14
Theorem: If L
= BWT is the output of the Burrows-Wheeler transform
applied to the string S and LF is corresponding last-to-front mapping, then
the previous algorithm computes S .
15
String S
Burrows-Wheeler
Transform
(BWT)
Move-to-Front
Coding (MTF)
Huffman
Compression
Code Sc
16
Move-to-front coding of a string L
∈ Σ∗ of length n.
Initialize a list containing the characters from Σ in increasing order.
← 1 to n do
R[i] ← number of characters preceding character L[i] in list
move character L[i] to the front of list
for i
17
Move-to-front decoding of R.
Initialize a list containing the characters from Σ in increasing order.
← 1 to n do
L[i] ← character at position R[i] + 1 in list (numbering elements from 1)
move character L[i] to the front of list
for i
18
Construction of the suffix array of a string S
The induced sorting algorithm by Nong et al.:
Phase 0: Compute the type array T by a right-to-left scan of S .
Phase I: Compute all LMS-positions in S and sort the corresponding suffixes
in ascending lexicographic order (we will elaborate on this phase later).
Phase II: Using the output of phase I, induce the order of all suffixes.
19
The suffixes of S are classified into two types: Suffix Si is S-type if
Si < Si+1 and it is L-type if Si > Si+1 . The last suffix, the sentinel $, is
S-type. We use a bit array T [1..n + 1] to store the types of the suffixes:
T [i] = 0 means that suffix Si is L-type and T [i] = 1 means that it is
S-type. For better readability, we write T [i] = L instead of T [i] = 0 T [i] =
S instead of T [i] = 1.
Lemma: All suffixes can be classified by a right-to-left scan of S .
Si is S-type if (a) S[i] < S[i + 1] or (b) S[i] = S[i + 1] and Si+1
is S-type. Analogously, Si is L-type if (a) S[i] > S[i + 1] or (b)
S[i] = S[i + 1] and Si+1 is L-type.
Proof:
20
S
type
LMS
1 2 3 4 5 6 7
i m i m m m i
S L S L L L S
*
*
8
s
L
9 10 11 12 13 14 15 16
i s m i s i s s
S L L S L S L L
*
*
*
17 18 19 20 21
i i p i $
S S L L S
*
*
21
Lemma: An S-type suffix is lexicographically greater than any L-type suffix
starting with the same first character.
Corollary: In the suffix array of S , among all suffixes that begin with the
same character, the L-type suffixes appear before the S-type suffixes.
22
The c-bucket is the interval [i..j] with i
= C[c] + 1 and j = C[c + 1].
Every c-bucket [i..j] can further be divided into the interval [i..k] containing
all L-type suffixes starting with character c and the interval [k
+ 1..j]
containing all S-type suffixes starting with character c.
The interval [i..k] is called the L-type region and [k
S-type region of the c-bucket (k
+ 1..j] is called the
= i − 1 or k = j is possible, i.e., the
S-type region or the L-type region of the bucket may be empty).
23
< i ≤ n + 1, in S with T [i − 1] = L and T [i] = S (i.e.,
suffix Si−1 is L-type and suffix Si is S-type) is called LMS-position (leftmost
A position i, 1
S-type position).
S
type
LMS
1 2 3 4 5 6 7
i m i m m m i
S L S L L L S
*
*
8
s
L
9 10 11 12 13 14 15 16
i s m i s i s s
S L L S L S L L
*
*
*
17 18 19 20 21
i i p i $
S S L L S
*
*
24
Phase II:
1. Scan the sorted sequence of LMS-positions from right to left. For each
position encountered in the scan, move it to the current end of its bucket
in A, and shift the current end of the bucket by one position to the left.
2. Scan the array A from left to right. For each entry A[i] encountered in
the scan, if SA[i]−1 is an L-type suffix, move its start position A[i] − 1
in S to the current front of its bucket in A, and shift the current front of
the bucket by one position to the right.
3. Scan the array A from right to left. For each entry A[i] encountered in
the scan, if SA[i]−1 is an S-type suffix, move its start position A[i] − 1
in S to the current end of its bucket in A, and shift the current end of the
bucket by one position to the left.
25
S
type
1
2
3
4
5
6
i
S
m
L
i
S
m m m
L L L
*
LMS
L-type suffixes
S-type suffixes
$
21
21 20
i
17 3
17 3
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21
i s i s m i s i s s i
i p i $
S L S L L S L S L L S S L L S
*
*
*
*
*
*
p
m
s
7 12 9 14
7 12 9 14 2 6 11 5 4 19 16 8 13 10 15
21 20 17 1 17 3 7 12 9 14 2
3 18 7 12 9 14
6 11 5
4 19 16 8 13 10 15
26
Phase I:
1. Scan the array T from left to right. For each j : if j is an LMS-position,
move it to the current end of its bucket in A, and shift the current end of
the bucket by one position to the left; move j to the current front of array
P , and shift the current front of P by one position to the right.
2. Scan the array A from left to right. For each entry A[i] encountered in
the scan, if SA[i]−1 is an L-type suffix, move its start position A[i] − 1
in S to the current front of its bucket in A, and shift the current front of
the bucket by one position to the right.
3. Scan the array A from right to left. For each entry A[i] encountered in
the scan, if SA[i]−1 is an S-type suffix, move its start position A[i] − 1
in S to the current end of its bucket in A, and shift the current end of the
bucket by one position to the left.
27
4. Initialize an array LN[1..n + 1] that will contain the new lexicographic
names of the LMS-substrings, and initialize the counter i for new
lexicographic names to 1. Because the sentinel is the lexicographically
smallest LMS-substring, it gets the smallest new lexicographic name,
i.e., LN[n + 1]
= 1. Furthermore, initialize prev = n + 1. Now,
starting at index i = 2, scan the array A from left to right. Whenever an
entry A[i] = j is encountered so that j is the start position of an
LMS-substring do the following:
• Compare the LMS-substring starting at position prev in S with the
LMS-substring starting at position j in S (character by character).
• If they are different, increment i by one.
• Set LN[j] = i and prev = j .
28
5. Scan the array P [1..m] from left to right and for each k with
1 ≤ k ≤ m set S[k] = LN[P [k]].
6. If i
= m, then the LMS-substrings are pairwise different. Compute the
suffix array SA of the string S directly by SA[S[k]] = k .
7. Otherwise, recursively compute the suffix array SA of the string S .
8. The ascending lexicographic order of all the suffixes of S that start at an
LMS-position is P [SA[1]], P [SA[2]], . . . , P [SA[m]].
29
30
P
S
type
LMS
L-type suffixes
S-type suffixes
LN
1
3
2
7
3 4 5 6 7
9 12 14 17 21
1
2
3
4
6
7
8
i
S
m
L
i
S
m m m
L L L
i
S
s
L
5
*
*
$
i
21
17 14 12
21 20
17 14 12
21 20 17 1 17 14 12
3 18 12
3
9
9
9
7
4
9 10 11 12 13 14 15 16 17 18 19 20 21
i s m i s i s s i
i p i $
S L L S L S L L S S L L S
*
*
*
*
*
m
p
s
7 3
7 3 11 6 2 5 4 19 16 13 8 10 15
7 3 11 6 2 5 4 19 16 13 8 10 15
9 14
5
4
6
2
1
recursion
1
3
S
2
4
S
3
5
L
4
4
S
*
2
2
3
3
4
LMS
L-type suffixes
S-type suffixes
1
1
7
7
7
6
6
1
P
1
3
2
7
3 4 5 6 7
9 12 14 17 21
_
S
type
LMS
posi"on
5
6
L
6
2
L
7
1
S
*
5
6
5
7
6
3
3
5
5
4
2
4
4
4
4
3. 4. 6. 5. 7. 2. 1.
31
Worst-case time-complexity:
It is not difficult to see that each step in phases I and II takes at most O(n)
time. By definition, position 1 is not an LMS-position and there must be at
least one position in between two consecutive LMS-positions. Hence
|S| = m ≤ b n+1
2 c. It follows from the master theorem that the whole
induced sorting algorithm has a worst-case time complexity of O(n).
32
The master theorem says that a recurrence of the form
T (n) =
m
X
T (αi · n) + Θ(nk )
i=1
where 0
≤ αi ≤ 1, m ≥ 1, k ≥ 0, has the solution

Pm k

k

Θ(n ),
if

i=1 αi < 1

Pm k
k
T (n) =
Θ(n log n), if i=1 αi = 1


Pm k

 Θ(nc ),
if
i=1 αi > 1
where c is the solution of the equation
Pm
c = 1.
α
i=1 i
33
1 2 3 4
21 20 17 1
5 6 7 8
3 18 12 7
9 10 11 12 13 14 15 16 17 18 19 20 21
9 14 11 6 2 5 4 19 16 13 8 10 15
21 17 3 12 7 9 14
sorted LMS-substrings
21 17 3 12 7
9 14
3
lexicographic names
4 5
4 6 2
1
_
S
21 17 3 12 7
9 14
3
4
5
4
6
2
1
34
Definition: Let S be a string of length n on an alphabet Σ.
The Lempel-Ziv factorization (or LZ-factorization for short) of S is a
factorization S
(a) a letter c
= s1 s2 · · · sm so that each factor sj , 1 ≤ j ≤ m, is either
∈ Σ that does not occur in s1 s2 · · · sj−1 , or
(b) the longest substring of S that occurs at least twice in s1 s2 · · · sj .
The Lempel-Ziv factorization can be represented by a sequence of pairs
(PrevOcc1 , LPS1 ), . . . , (PrevOccm , LPSm ), where in case (a)
PrevOccj = c and LPSj = 0, and in case (b) PrevOccj is a position in
s1 s2 · · · sj−1 , at which an occurrence of sj starts and LPSj = |sj |.
35
Reconstruction of the string S , given its LZ-factorization
(PrevOcc1 , LPS1 ), . . . , (PrevOccm , LPSm ).
i←1
for j ← 1 to m do
if LPSj = 0 then
S[i] ← PrevOccj
i←i+1
else
← 0 to LPSj − 1 do
S[i] ← S[PrevOccj + k]
i←i+1
for k
36
Definition: For a string S of length n, the longest previous substring (LPS)
array of size n is defined by LPS[1]
= 0 and
LPS[k] = max{` : 0 ≤ ` ≤ n−k+1; S[k..k+`−1] is a substring of S[1..k+`−2]}
≤ k ≤ n. That is, LPS[k] is the length of the longest
prefix of Sk that has another occurrence in S starting strictly before position
k.
for every k with 2
37
The LPS and PrevOcc arrays of the string S
S[i]
i
LPS[i]
PrevOcc[i]
a c a a a c a
= acaaacatat$.
t
a
t
1 2 3 4 5 6 7 8 9 10
0 0 1 2 3 2 1 0 2
1
a c 1 3 1 2 5
8
t
7
38
Computation of the LZ-factorization based on LPS and PrevOcc.
i←1
while i < n do
if LPS[i] = 0 then
PrevOcc[i] ← S[i]
output (PrevOcc[i], LPS[i])
i ← i + max{1, LPS[i]}
39
Lemma: For every k with 2
≤ k ≤ n, the following equality holds true:
LPS[k] = max{|lcp(Sp , Sk )| : 1 ≤ p < k}
where lcp(Sp , Sk ) denotes the longest common prefix between Sp and Sk .
With k
= SA[i] and p = SA[j], the lemma can be rephrased as:
LPS[SA[i]] = max{|lcp(SSA[j] , SSA[i] )| : SA[j] < SA[i]}
(1)
40
Definition: For any index i with 1
≤ i ≤ n, we define
PSVSA [i] = max{j : 0 ≤ j < i and SA[j] < SA[i]}
and
NSVSA [i] = min{j : i < j ≤ n + 1 and SA[j] < SA[i]}
41
Definition: The LCP-array is an array of size n + 1 with boundary elements
LCP[1] = 0 and LCP[n + 1] = 0, and for all i with 2 ≤ i ≤ n we have
LCP[i] = |lcp(SSA[i−1] , SSA[i] )|
42
i SA LCP SSA[i]
PSV[i] NSV[i] LPS[SA[i]] PrevOcc[SA[i]]
0
0
1
3
0 aaacatat
0
3
1
1
2
4
2 aacatat
1
3
2
3
3
1
1 acaaacatat
0
11
0
⊥
4
5
3 acatat
3
7
3
1
5
9
1 at
4
6
2
7
6
7
2 atat
4
7
1
5
7
2
0 caaacatat
3
11
0
⊥
8
6
2 catat
7
11
2
2
0 t
8
10
1
8
1 tat
8
11
0
9 10
10
8
ε
⊥
43
Lemma: For every i with 1
≤ i ≤ n, the following equality holds
LPS[SA[i]] = max{|lcp(SSA[PSV[i]] , SSA[i] )|, |lcp(SSA[i] , SSA[NSV[i]] )|}
where S0 is the empty string ε.
44
Consider the (conceptual) graph defined as follows:
The graph has n + 2 nodes each of which is labeled with (i, SA[i]); to deal
with boundary cases, the graph also contains the nodes labeled with
(0, SA[0]) = (0, 0) and (n + 1, SA[n + 1]) = (n + 1, 0).
Moreover, consecutive nodes (i, SA[i]) and (i + 1, SA[i + 1]) are
connected by an edge labeled with the value LCP[i + 1].
It is instructive to view each node (i, SA[i]) as a point in the plane R2 .
45
10
1
9
2
1
0
8
7
6
5
0
2
2
4
0
3
3
1
2
0
1
0
0
46
In the graph, |lcp(SSA[PSV[i]] , SSA[i] )| is the minimum of the edge labels on
the path from node (PSV[i], SA[PSV[i]]) to (i, SA[i]).
Analogously, |lcp(SSA[i] , SSA[NSV[i]] )| is the minimum of the edge labels on
the path from (i, SA[i]) to node (NSV[i], SA[NSV[i]]).
Now consider a peak in the graph, i.e., a node (i, SA[i]) so that
SA[i − 1] < SA[i] and SA[i + 1] < SA[i]. In this case,
(PSV[i], SA[PSV[i]]) = (i − 1, SA[i − 1]) and
(NSV[i], SA[NSV[i]]) = (i + 1, SA[i + 1]).
Thus, we have LPS[SA[i]]
= max{LCP[i], LCP[i + 1]}.
We delete node (i, SA[i]) and its edges from the graph, add an edge
labeled with min(LCP[i], LCP[i + 1]) between nodes (i − 1, SA[i − 1])
and (i + 1, SA[i + 1]), and continue in the same fashion with the
transformed graph.
47
Removal of the first peak in the graph:
10
1
9
2
1
0
8
7
6
5
0
2
2
4
0
3
3
1
2
0
1
0
0
48
Transformed graph:
10
1
9
2
1
0
8
7
6
5
0
2
0
3
3
1
2
0
1
0
0
49
Removal of the second peak in the graph:
10
1
9
2
1
0
8
7
6
5
0
2
0
3
3
1
2
0
1
0
0
50
Transformed graph:
10
1
9
2
1
0
8
7
6
5
0
2
0
3
2
0
0
1
0
51
Removal of the third peak in the graph:
10
1
9
2
1
0
8
7
6
5
0
2
0
3
2
0
0
1
0
52
Transformed graph:
10
1
0
8
7
1
6
5
0
2
0
3
2
0
0
1
0
53
SA[n + 1] ← 0
LCP[n + 1] ← 0
push(h0, ⊥i)
for i ← 1 to n + 1 do
lcp ← LCP[i]
while SA[i] < top.sa
last ← pop
LPS[last.sa] ← max(last.lcp, lcp)
lcp ← min(last.lcp, lcp)
if LPS[last.sa] = 0 then PrevOcc[last.sa] ← ⊥
else if last.lcp > lcp then PrevOcc[last.sa] ← top.sa
else PrevOcc[last.sa] ← SA[i]
push(hSA[i], lcpi)
54
As the suffix array, the LCP-array can be computed in linear time.
However, it is not necessary to precompute the LCP-array.
The preceding algorithm can be modified in such a way that computes LPS
and PrevOcc using the suffix array only.
55
Download