slides

advertisement
A Categorization Theorem on Suffix
Arrays with Applications to Space
Efficient Text Indexes
Meng He, J. Ian Munro,
and S. Srinivasa Rao
University of Waterloo
The Problem
• Initial Problem

Text searching: Finding occurrences of a
pattern string in a large (static) document
• Solution

Text indexing: Trading space for time
• New Problem

Succinct Text indexes: Reducing the space
cost
Pattern Searching
• Give a text string T of length n and a
pattern string P of length m, we look for
the occurrences of P in T.
• Three types of Queries

Existential queries: Does P occur in T?
 Cardinality queries: How many times does P
occur in T?
 Listing queries: Where does P occur in T?
Text Indexing
• Inverted files

Word index
 Need to store the text as well as the index
• Suffix trees

Efficient full-text index
 4n lg n to 6n lg n bits!
• Suffix arrays

n lg n bits in basic form, but
 3n lg n bits (with LCP data)
Applications
• Text databases

electronic encyclopedias, dictionaries, books,
etc.
• Web search engines

Google, Altavista, etc.
• Bioinformatics

gene databases
• More…
Related Work
• Compressed Suffix Arrays

Grossi & Vitter 2000
 Sadakane 2000
 Grossi, Gupta & Vitter 2003
• FM-index

Ferragina & Manzini 2000 & 2001
Assumptions & Notation
• Alphabet: Σ = {a, b}
• Text: T[1..n]

T[n] = #, where a < # < b
• Pattern: P[1..m]
Permutations and Suffix Arrays
• An observation

Permutations: n!
 Suffix arrays: 2n-1
 Not all permutations are suffix arrays
• An example

A suffix array: 4, 7, 5, 1, 8, 3, 6, 2
• Text: abbaaba#

A permutation: 4, 7, 1, 5, 8, 2, 3, 6
• Not a suffix array of any binary text
Two Features of Suffix Arrays
Ascending-to-max
Non-nesting
Suffix Array
4
7
5
1
8
3
6
2
5
8
2
3
6
Another Permutation
4
7
1
A Categorization Theorem
• A permutation is a suffix array iff it is:

Ascending-to-max
 Non-nesting
• An immediate application:

Checking whether a permutation is a suffix
array in O(n) time using n + O(1) additional
words in memory.
Application:
Space Efficient Suffix Array
SA:
8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14
Ba:
0 0 1 1 0 0 1 1 1 0 0 1 1 0 1 1
Bb:
1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0
Text: abaaabbaaabaabb#
Basic Searching Algorithm:
Answering Cardinality Queries
Basic Idea: backward search

Start from the end of the pattern P
 For i = m, m-1, …, 1, compute the interval [s,
e] of SA whose corresponding suffixes are
prefixed with P[i, m]
SA:
8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14
P = aba
More Algorithms and Tradeoffs
• Answering listing queries
• Speeding up the reporting of Occurrences
of Long Patterns
• Self-indexing
• Time-space tradeoff: multi-level structure
Putting it all together
Three index structures:
space (bits) pattern searching
Index 1 n+o(n)
Index 2 2n+o(n)
Index 3 O(n)
O(m) (existential &
cardinality queries only)
O(m + occ) (m=Ω(lg1+εn))
O(m + occ lg n) (otherwise)
O(m + occ) (m=Ω(lg1+εn))
O(m + occ lgλ n) (otherwise)
Conclusion
• Summary

A theorem that characterizes a permutation as
the suffix array of a binary string

An efficient algorithm checking whether a
permutation is a suffix array

Three space efficient text indexing methods
Conclusions (Continued)
• Related subsequent work

Generalization to larger alphabets
• Open problem

O(n)-bits text index supporting searching in
O(m+occ) time.
Thank You.
Download