Document

advertisement
Computing Reversed
Lempel-Ziv Factorization Online
Shiho Sugimoto, Tomohiro I, Shunsuke Inenaga,
Hideo Bannai, Masayuki Takeda
Kyushu University, Japan
Everything is String.
HABATAKITAI Laboratory
Outline
• Reversed LZ factorization without self-references (RLZ)
• Online RLZ algorithm by Kolpakov and Kucherov
• New online RLZ algorithm using O(n log σ) bits of space
• Reversed LZ factorization with self-references (RLZS)
• New online RLZS algorithm using O(n log n) bits of space
• New online RLZS algorithm using O(n log σ) bits of space
n : the length of input string
σ : the alphabet size
Everything is String.
HABATAKITAI Laboratory
Background
• LZ factorization was proposed in 1977
[Ziv & Lempel, 1977].
– data compression etc.
• Reversed LZ factorization (RLZ in short) was
proposed in 2009 [Kolpakov & Kucherov, 2009].
– finding gapped palindromes etc.
Everything is String.
HABATAKITAI Laboratory
LZ factorization without self-references
[Ziv & Lempel, 1977]
LZ factorization without self-references of string
w of length n is a factorization s1,s2,...,sm such that
• w = s1 s2…sm
• si is the longest non-empty prefix of
w[|s1…si−1|+1..n] that is also a substring of
w[1.. | s1…si−1|] if such exists
• si = w[|s1…si−1|+1] otherwise
Everything is String.
HABATAKITAI Laboratory
LZ factorization without self-references
[Ziv & Lempel, 1977]
LZ factorization without self-references of string
w of length n is a factorization s1,s2,...,sm such that
• w = s1 s2…sm
• si is the longest non-empty prefix of
w[|s1…si−1|+1..n] that is also a substring of
w[1.. | s1…si−1|] if such exists
• si = w[|s1…si−1|+1] otherwise
s1 s2
Ex)w = a b b a a a a b b b a c
Everything is String.
HABATAKITAI Laboratory
LZ factorization without self-references
[Ziv & Lempel, 1977]
LZ factorization without self-references of string
w of length n is a factorization s1,s2,...,sm such that
• w = s1 s2…sm
• si is the longest non-empty prefix of
w[|s1…si−1|+1..n] that is also a substring of
w[1.. | s1…si−1|] if such exists
• si = w[|s1…si−1|+1] otherwise
s1 s2 s3
Ex)w = a b b a a a a b b b a c
Everything is String.
HABATAKITAI Laboratory
LZ factorization without self-references
[Ziv & Lempel, 1977]
LZ factorization without self-references of string
w of length n is a factorization s1,s2,...,sm such that
• w = s1 s2…sm
• si is the longest non-empty prefix of
w[|s1…si−1|+1..n] that is also a substring of
w[1.. | s1…si−1|] if such exists
• si = w[|s1…si−1|+1] otherwise
s1 s2 s3 s4
Ex)w = a b b a a a a b b b a c
Everything is String.
HABATAKITAI Laboratory
LZ factorization without self-references
[Ziv & Lempel, 1977]
LZ factorization without self-references of string
w of length n is a factorization s1,s2,...,sm such that
• w = s1 s2…sm
• si is the longest non-empty prefix of
w[|s1…si−1|+1..n] that is also a substring of
w[1.. | s1…si−1|] if such exists
• si = w[|s1…si−1|+1] otherwise
s1 s2 s3 s4 s5
Ex)w = a b b a a a a b b b a c
Everything is String.
HABATAKITAI Laboratory
LZ factorization without self-references
[Ziv & Lempel, 1977]
LZ factorization without self-references of string
w of length n is a factorization s1,s2,...,sm such that
• w = s1 s2…sm
• si is the longest non-empty prefix of
w[|s1…si−1|+1..n] that is also a substring of
w[1.. | s1…si−1|] if such exists
• si = w[|s1…si−1|+1] otherwise
s1 s2 s3 s4 s5 s6
Ex)w = a b b a a a a b b b a c
Everything is String.
HABATAKITAI Laboratory
LZ factorization without self-references
[Ziv & Lempel, 1977]
LZ factorization without self-references of string
w of length n is a factorization s1,s2,...,sm such that
• w = s1 s2…sm
• si is the longest non-empty prefix of
w[|s1…si−1|+1..n] that is also a substring of
w[1.. | s1…si−1|] if such exists
• si = w[|s1…si−1|+1] otherwise
s1 s2 s3 s4 s5 s6 s7
Ex)w = a b b a a a a b b b a c
Everything is String.
HABATAKITAI Laboratory
LZ factorization without self-references
[Ziv & Lempel, 1977]
LZ factorization without self-references of string
w of length n is a factorization s1,s2,...,sm such that
• w = s1 s2…sm
• si is the longest non-empty prefix of
w[|s1…si−1|+1..n] that is also a substring of
w[1.. | s1…si−1|] if such exists
• si = w[|s1…si−1|+1] otherwise
s1 s2 s3 s4 s5 s6 s7
s8 s 9
Ex)w = a b b a a a a b b b a c
Everything is String.
HABATAKITAI Laboratory
Reversed LZ factorization
without self-references (RLZ)
[Kolpakov & Kucherov, 2009]
RLZ without self-references of string w of length n is
a factorization f1,f2,...,fm such that
• w = f1 f2…fm
• fi is the longest non-empty prefix of w[|f1...fi−1|+1..n]
that is also a substring of w[1.. | f1...fi−1|]R if such
exists
reversed
• fi = w[|f1...fi−1|+1] otherwise
Everything is String.
HABATAKITAI Laboratory
Reversed LZ factorization
without self-references (RLZ)
[Kolpakov & Kucherov, 2009]
RLZ without self-references of string w of length n is
a factorization f1,f2,...,fm such that
• w = f1 f2…fm
• fi is the longest non-empty prefix of w[|f1...fi−1|+1..n]
that is also a substring of w[1.. | f1...fi−1|]R if such
exists
reversed
• fi = w[|f1...fi−1|+1] otherwise
f1 f2
Ex)w = a b b a a a a b b b a c
Everything is String.
HABATAKITAI Laboratory
Reversed LZ factorization
without self-references (RLZ)
[Kolpakov & Kucherov, 2009]
RLZ without self-references of string w of length n is
a factorization f1,f2,...,fm such that
• w = f1 f2…fm
• fi is the longest non-empty prefix of w[|f1...fi−1|+1..n]
that is also a substring of w[1.. | f1...fi−1|]R if such
exists
reversed
• fi = w[|f1...fi−1|+1] otherwise
f1 f2 f3
Ex)w = a b b a a a a b b b a c
Everything is String.
HABATAKITAI Laboratory
Reversed LZ factorization
without self-references (RLZ)
[Kolpakov & Kucherov, 2009]
RLZ without self-references of string w of length n is
a factorization f1,f2,...,fm such that
• w = f1 f2…fm
• fi is the longest non-empty prefix of w[|f1...fi−1|+1..n]
that is also a substring of w[1.. | f1...fi−1|]R if such
exists
reversed
• fi = w[|f1...fi−1|+1] otherwise
f1 f2 f3 f4
Ex)w = a b b a a a a b b b a c
Everything is String.
HABATAKITAI Laboratory
Reversed LZ factorization
without self-references (RLZ)
[Kolpakov & Kucherov, 2009]
RLZ without self-references of string w of length n is
a factorization f1,f2,...,fm such that
• w = f1 f2…fm
• fi is the longest non-empty prefix of w[|f1...fi−1|+1..n]
that is also a substring of w[1.. | f1...fi−1|]R if such
exists
reversed
• fi = w[|f1...fi−1|+1] otherwise
f1 f2 f3 f4
f5
Ex)w = a b b a a a a b b b a c
Everything is String.
HABATAKITAI Laboratory
Reversed LZ factorization
without self-references (RLZ)
[Kolpakov & Kucherov, 2009]
RLZ without self-references of string w of length n is
a factorization f1,f2,...,fm such that
• w = f1 f2…fm
• fi is the longest non-empty prefix of w[|f1...fi−1|+1..n]
that is also a substring of w[1.. | f1...fi−1|]R if such
exists
reversed
• fi = w[|f1...fi−1|+1] otherwise
f1 f2 f3 f4
f5
f6 f7
Ex)w = a b b a a a a b b b a c
Everything is String.
HABATAKITAI Laboratory
KK algorithm
[Kolpakov & Kucherov, 2009]
• Computes RLZ in an online manner
• Works in O(n log n) bits of space and O(n log σ)
time (on a word RAM model).
– Constructs suffix tree for reversed prefixes online.
– Computes RLZ factors from suffix tree.
– Blumer’s version of Weiner’s algorithm achieves
above complexity [Blumer et al, 1985] [Weiner, 1973].
Everything is String.
HABATAKITAI Laboratory
KK algorithm
f1
[Kolpakov & Kucherov, 2009]
Ex)w = a b b a a a a b b b a c
Stree(ε)
Everything is String.
HABATAKITAI Laboratory
KK algorithm
f1 f2
[Kolpakov & Kucherov, 2009]
Ex)w = a b b a a a a b b b a c
Stree(aR)
a
Everything is String.
HABATAKITAI Laboratory
KK algorithm
f1 f2
[Kolpakov & Kucherov, 2009]
Ex)w = a b b a a a a b b b a c
Stree((ab)R)
a
Everything is String.
b
a
HABATAKITAI Laboratory
KK algorithm
f1 f2
f3
[Kolpakov & Kucherov, 2009]
Ex)w = a b b a a a a b b b a c
Stree((ab)R)
a
Everything is String.
b
a
HABATAKITAI Laboratory
KK algorithm
f1 f2
f3
f4
[Kolpakov & Kucherov, 2009]
Ex)w = a b b a a a a b b b a c
Stree((abba)R)
a
Everything is String.
b
b
a
b
a
b
a
HABATAKITAI Laboratory
KK algorithm
f1 f2
f3
f4
f5
[Kolpakov & Kucherov, 2009]
Ex)w = a b b a a a a b b b a c
Stree((aabba)R)
a
a
Everything is String.
a
b
b
b
b
a
b
a
b
a
HABATAKITAI Laboratory
KK algorithm
f1 f2
f3
f4
f5
[Kolpakov & Kucherov, 2009]
Ex)w = a b b a a a a b b b a c
Stree((aabba)R)
This suffix tree
requires O(n log n)
bits of space
a
a
a
b
b
b
b
a
b
a
b
a
We propose a new online RLZ
algorithm which uses only O(n log σ)
bits of space. (σ≦n is the alphabet size)
Everything is String.
HABATAKITAI Laboratory
For O(n log σ) bits of space
• We utilize the idea of Starikovskaya’s algorithm.
– It computes LZ factorization online in O(n log σ) bits of
space and O(n log2n) time [Starikovskaya, 2012].
• We divide input string into blocks of length
r = O(logσn).
– Each block is replaced by a meta-character.
Everything is String.
HABATAKITAI Laboratory
For O(n log σ) bits of space
• We utilize the idea of Starikovskaya’s algorithm.
– It computes LZ factorization online in O(n log σ) bits of
space and O(n log2n) time [Starikovskaya, 2012].
• We divide input string into blocks of length
r = O(logσn).
– Each block is replaced by a meta-character.
Ex)w = a b b a a a a b b b a c ………
r=3
B
Everything is String.
A
B
C
………
HABATAKITAI Laboratory
For O(n log σ) bits of space
• We utilize the idea of Starikovskaya’s algorithm.
– It computes LZ factorization online in O(n log σ) bits of
space and O(n log2n) time [Starikovskaya, 2012].
• We divide input string into blocks of length
r = O(logσn).
– Each block is replaced by a meta-character.
Ex)w = a b b a a a a b b b a c ………
r=3
B
Everything is String.
A
B
C
………
HABATAKITAI Laboratory
Our online RLZ algorithm
• For fi of length shorter than r, we use suffix trie
of reversed subwords of length 2r.
– can find fi in o(n) bits of space and O(|fi| log σ) time.
• For fi of length at least r, we use suffix tree of
reversed blocks (meta-characters).
– can find fi in O(n log σ) bits of space and O(|fi| log2n)
time.
Everything is String.
HABATAKITAI Laboratory
Our online RLZ algorithm
• For fi of length shorter than r, we use suffix trie
of reversed subwords of length 2r.
– can find fi in o(n) bits of space and O(|fi| log σ) time.
• For fi of length at least r, we use suffix tree of
reversed blocks (meta-characters).
– can find fi in O(n log σ) bits of space and O(|fi| log2n)
time.
Theorem
We can compute RLZ without self-references online
in O(n log σ) bits of space and O(n log2n) time.
Everything is String.
HABATAKITAI Laboratory
Outline
• Reversed LZ factorization without self-references (RLZ)
• Online RLZ algorithm by Kolpakov and Kucherov
• New online RLZ algorithm using O(n log σ) bits of space
• Reversed LZ factorization with self-references (RLZS)
• New online RLZS algorithm using O(n log n) bits of space
• New online RLZS algorithm using O(n log σ) bits of space
n : the length of input string
σ : the alphabet size
Everything is String.
HABATAKITAI Laboratory
LZ factorization with self-references
[Ziv & Lempel, 1977]
LZ factorization with self-references of string w of
length n is a factorization t1,t2,...,tm such that
• w = t1 t2…tm
• ti is the longest non-empty prefix of
w[|t1…ti−1|+1..n] that is also a substring of
w[1.. | t1…ti|-1] if such exists
self-reference
• ti = w[|t1…ti−1|+1] otherwise.
Everything is String.
HABATAKITAI Laboratory
LZ factorization with self-references
[Ziv & Lempel, 1977]
LZ factorization with self-references of string w of
length n is a factorization t1,t2,...,tm such that
• w = t1 t2…tm
• ti is the longest non-empty prefix of
w[|t1…ti−1|+1..n] that is also a substring of
w[1.. | t1…ti|-1] if such exists
self-reference
• ti = w[|t1…ti−1|+1] otherwise.
t1 t2 t3
Ex)w = a b b a a a a b b b a c
Everything is String.
HABATAKITAI Laboratory
LZ factorization with self-references
[Ziv & Lempel, 1977]
LZ factorization with self-references of string w of
length n is a factorization t1,t2,...,tm such that
• w = t1 t2…tm
• ti is the longest non-empty prefix of
w[|t1…ti−1|+1..n] that is also a substring of
w[1.. | t1…ti|-1] if such exists
self-reference
• ti = w[|t1…ti−1|+1] otherwise.
t1 t2 t3 t4
Ex)w = a b b a a a a b b b a c
Everything is String.
HABATAKITAI Laboratory
LZ factorization with self-references
[Ziv & Lempel, 1977]
LZ factorization with self-references of string w of
length n is a factorization t1,t2,...,tm such that
• w = t1 t2…tm
• ti is the longest non-empty prefix of
w[|t1…ti−1|+1..n] that is also a substring of
w[1.. | t1…ti|-1] if such exists
self-reference
• ti = w[|t1…ti−1|+1] otherwise.
t1 t2 t3 t4
t5
t6
t7 t8
Ex)w = a b b a a a a b b b a c
Everything is String.
HABATAKITAI Laboratory
Reversed LZ factorization
with self-references
RLZ with self-references (RLZS) of string w of length
n is a factorization g1,g2,...,gm such that
• w = g1 g2…gm
• gi is the longest non-empty prefix of
w[|g1...gi−1|+1..n] that is also a substring of
w[1.. | g1…gi|-1]R if such exists
self-reference
• gi = w[|g1…gi−1|+1] otherwise.
Everything is String.
HABATAKITAI Laboratory
Reversed LZ factorization
with self-references
RLZ with self-references (RLZS) of string w of length
n is a factorization g1,g2,...,gm such that
• w = g1 g2…gm
• gi is the longest non-empty prefix of
w[|g1...gi−1|+1..n] that is also a substring of
w[1.. | g1…gi|-1]R if such exists
self-reference
• gi = w[|g1…gi−1|+1] otherwise.
g1
g2
Ex)w = a b b a a a a b b b a c
Everything is String.
HABATAKITAI Laboratory
Reversed LZ factorization
with self-references
RLZ with self-references (RLZS) of string w of length
n is a factorization g1,g2,...,gm such that
• w = g1 g2…gm
• gi is the longest non-empty prefix of
w[|g1...gi−1|+1..n] that is also a substring of
w[1.. | g1…gi|-1]R if such exists
self-reference
• gi = w[|g1…gi−1|+1] otherwise.
g1
g2
g3
Ex)w = a b b a a a a b b b a c
Everything is String.
HABATAKITAI Laboratory
Reversed LZ factorization
with self-references
RLZ with self-references (RLZS) of string w of length
n is a factorization g1,g2,...,gm such that
• w = g1 g2…gm
• gi is the longest non-empty prefix of
w[|g1...gi−1|+1..n] that is also a substring of
w[1.. | g1…gi|-1]R if such exists
self-reference
• gi = w[|g1…gi−1|+1] otherwise.
g1
g2
g3
g4
g5
Ex)w = a b b a a a a b b b a c
Everything is String.
HABATAKITAI Laboratory
online computation of RLZS
Ex)w = a b b a a a a b b b a c
w[1..1] = a
Everything is String.
HABATAKITAI Laboratory
online computation of RLZS
Ex)w = a b b a a a a b b b a c
w[1..1] = a
w[1..2] = a b
Everything is String.
HABATAKITAI Laboratory
online computation of RLZS
Ex)w = a b b a a a a b b b a c
w[1..1] = a
w[1..2] = a b
w[1..3] = a b b
Everything is String.
HABATAKITAI Laboratory
online computation of RLZS
Ex)w = a
w[1..1] = a
w[1..2] = a
w[1..3] = a
w[1..4] = a
Everything is String.
b b a a a a b b b a c
b
b b
b b a
HABATAKITAI Laboratory
online computation of RLZS
Ex)w = a
w[1..1] = a
w[1..2] = a
w[1..3] = a
w[1..4] = a
Everything is String.
b b a a a a b b b a c
b
b b
b b a
HABATAKITAI Laboratory
online computation of RLZS
Ex)w = a
w[1..1] = a
w[1..2] = a
w[1..3] = a
w[1..4] = a
w[1..5] = a
w[1..6] = a
w[1..7] = a
w[1..8] = a
w[1..9] = a
w[1..10] = a
w[1..11] = a
w[1..12] = a
Everything is String.
b b a a a a b b b a c
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
b
b
b
b
b
b
b b
b b a
b b a c
HABATAKITAI Laboratory
Reversed LZ factorization
with self-references
Every self-referencing factor is a suffix of a
palindrome.
g1
g2
g3
g4
g5
Ex)w = a b b a a a a b b b a c
palindrome
Everything is String.
HABATAKITAI Laboratory
Reversed LZ factorization
with self-references
Every self-referencing factor is a suffix of a
palindrome.
g1
g2
g3
g4
g5
Ex)w = a b b a a a a b b b a c
palindrome
Everything is String.
HABATAKITAI Laboratory
online RLZS in O(nlogn) bits of space
We can compute each RLZS factor gi by
• using KK algorithm, and
– In a total of O(n log n) bits of space and O(n log σ) time.
• computing the longest palindrome which ends at
each position, online
– In a total of O(n log n) bits of space and O(n) time, by
modifying Manachar’s algorithm [Manacher, 1975].
Theorem
We can compute RLZS online in O(n log n) bits
of space and O(n log σ) time.
Everything is String.
HABATAKITAI Laboratory
Outline
• Reversed LZ factorization without self-references (RLZ)
• Online RLZ algorithm by Kolpakov and Kucherov
• New online RLZ algorithm using O(n log σ) bits of space
• Reversed LZ factorization with self-references (RLZS)
• New online RLZS algorithm using O(n log n) bits of space
• New online RLZS algorithm using O(n log σ) bits of space
n : the length of input string
σ : the alphabet size
Everything is String.
HABATAKITAI Laboratory
Suffix palindromes
• All suffix palindromes of a string of length n
can be presented by O(log n) arithmetic
progressions [Apostolico,1995].
Everything is String.
HABATAKITAI Laboratory
Suffix palindromes
• All suffix palindromes of a string of length n
can be presented by O(log n) arithmetic
progressions [Apostolico,1995].
Ex) w = a b a b a c a b a b a d a b a b a c a b a b a
Everything is String.
HABATAKITAI Laboratory
Suffix palindromes
• All suffix palindromes of a string of length n
can be presented by O(log n) arithmetic
progressions [Apostolico,1995].
Ex) w = a b a b a c a b a b a d a b a b a c a b a b a
Everything is String.
HABATAKITAI Laboratory
online computation of suffix
palindromes
• What happens to the suffix palindromes when
a new character is appended?
Ex) w = a b a b a b a b
wa = a b a b a b a b a
wc = a b a b a b a b c
Everything is String.
HABATAKITAI Laboratory
online computation of suffix
palindromes
• What happens to the suffix palindromes when
a new character is appended?
w
Everything is String.
a
x
HABATAKITAI Laboratory
online computation of suffix
palindromes
• What happens to the suffix palindromes when
a new character is appended?
w
a
x
if x = a
w
Everything is String.
a
a
HABATAKITAI Laboratory
online computation of suffix
palindromes
• What happens to the suffix palindromes when
a new character is appended?
w
b
b
b
b
x
if x = b
w
Everything is String.
b
b
b
b
b
HABATAKITAI Laboratory
Computing RLZS in O(n log σ) bits of
space
We can compute each RLZS factor gi by
• using our RLZS algorithm, and
– In a total of O(n log σ) bits of space and O(n log2n) time.
• computing the longest palindrome which ends at
each position, online
– In a total of O(log2n) bits of space and O(n log n) time.
Theorem
We can compute RLZS online in O(n log σ) bits
of space and O(n log2n) time.
Everything is String.
HABATAKITAI Laboratory
The problems of RLZS
• RLZS was too difficult for us to factorize
Proof
There is a mistake in proceedings of the PSC.
Everything is String.
HABATAKITAI Laboratory
The problems of RLZS
• RLZS was too difficult for us to factorize
Proof
There is a mistake in proceedings of the PSC.
p114
a b b a a a a b b b a c
a b b a a a a b b b a c
Everything is String.
HABATAKITAI Laboratory
The problems of RLZS
• RLZS was too difficult for us to factorize
Proof
There is a mistake in proceedings of the PSC.
Everything is String.
HABATAKITAI Laboratory
The problems of RLZS
• RLZS was too difficult for us to factorize
Proof
There is a mistake in proceedings of the PSC.
• No idea for using RLZS
Everything is String.
HABATAKITAI Laboratory
Conclusion
RLZ online algorithms
space
self
-references
O(n log n) bits
O(n log σ) bits
without
O(n log σ) time
O(n log2n) time
[Kolpakov & Kucherov, 2009]
with
O(n log σ) time
O(n log2n) time
n : the length of input string
σ : the alphabet size
Everything is String.
HABATAKITAI Laboratory
Conclusion
RLZ online algorithms
space
self
-references
O(n log n) bits
O(n log σ) bits
without
O(n log σ) time
O(n log2n) time
[Kolpakov & Kucherov, 2009]
with
O(n log σ) time
n : the length of input string
σ : the alphabet size
Everything is String.
O(n log2n) time
This work
HABATAKITAI Laboratory
Download