slides

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo The Problem • Initial Problem  Text searching: Finding occurrences of a pattern string in a large (static) document • Solution  Text indexing: Trading space for time • New Problem  Succinct Text indexes: Reducing the space cost Pattern Searching • Give a text string T of length n and a pattern string P of length m, we look for the occurrences of P in T. • Three types of Queries  Existential queries: Does P occur in T?  Cardinality queries: How many times does P occur in T?  Listing queries: Where does P occur in T? Text Indexing • Inverted files  Word index  Need to store the text as well as the index • Suffix trees  Efficient full-text index  4n lg n to 6n lg n bits! • Suffix arrays  n lg n bits in basic form, but  3n lg n bits (with LCP data) Applications • Text databases  electronic encyclopedias, dictionaries, books, etc. • Web search engines  Google, Altavista, etc. • Bioinformatics  gene databases • More… Related Work • Compressed Suffix Arrays  Grossi & Vitter 2000  Sadakane 2000  Grossi, Gupta & Vitter 2003 • FM-index  Ferragina & Manzini 2000 & 2001 Assumptions & Notation • Alphabet: Σ = {a, b} • Text: T[1..n]  T[n] = #, where a < # < b • Pattern: P[1..m] Permutations and Suffix Arrays • An observation  Permutations: n!  Suffix arrays: 2n-1  Not all permutations are suffix arrays • An example  A suffix array: 4, 7, 5, 1, 8, 3, 6, 2 • Text: abbaaba#  A permutation: 4, 7, 1, 5, 8, 2, 3, 6 • Not a suffix array of any binary text Two Features of Suffix Arrays Ascending-to-max Non-nesting Suffix Array 4 7 5 1 8 3 6 2 5 8 2 3 6 Another Permutation 4 7 1 A Categorization Theorem • A permutation is a suffix array iff it is:  Ascending-to-max  Non-nesting • An immediate application:  Checking whether a permutation is a suffix array in O(n) time using n + O(1) additional words in memory. Application: Space Efficient Suffix Array SA: 8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14 Ba: 0 0 1 1 0 0 1 1 1 0 0 1 1 0 1 1 Bb: 1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 Text: abaaabbaaabaabb# Basic Searching Algorithm: Answering Cardinality Queries Basic Idea: backward search  Start from the end of the pattern P  For i = m, m-1, …, 1, compute the interval [s, e] of SA whose corresponding suffixes are prefixed with P[i, m] SA: 8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14 P = aba More Algorithms and Tradeoffs • Answering listing queries • Speeding up the reporting of Occurrences of Long Patterns • Self-indexing • Time-space tradeoff: multi-level structure Putting it all together Three index structures: space (bits) pattern searching Index 1 n+o(n) Index 2 2n+o(n) Index 3 O(n) O(m) (existential & cardinality queries only) O(m + occ) (m=Ω(lg1+εn)) O(m + occ lg n) (otherwise) O(m + occ) (m=Ω(lg1+εn)) O(m + occ lgλ n) (otherwise) Conclusion • Summary  A theorem that characterizes a permutation as the suffix array of a binary string  An efficient algorithm checking whether a permutation is a suffix array  Three space efficient text indexing methods Conclusions (Continued) • Related subsequent work  Generalization to larger alphabets • Open problem  O(n)-bits text index supporting searching in O(m+occ) time. Thank You.

slides

Related documents

Products

Support

slides

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib