A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo The Problem • Initial Problem Text searching: Finding occurrences of a pattern string in a large (static) document • Solution Text indexing: Trading space for time • New Problem Succinct Text indexes: Reducing the space cost Pattern Searching • Give a text string T of length n and a pattern string P of length m, we look for the occurrences of P in T. • Three types of Queries Existential queries: Does P occur in T? Cardinality queries: How many times does P occur in T? Listing queries: Where does P occur in T? Text Indexing • Inverted files Word index Need to store the text as well as the index • Suffix trees Efficient full-text index 4n lg n to 6n lg n bits! • Suffix arrays n lg n bits in basic form, but 3n lg n bits (with LCP data) Applications • Text databases electronic encyclopedias, dictionaries, books, etc. • Web search engines Google, Altavista, etc. • Bioinformatics gene databases • More… Related Work • Compressed Suffix Arrays Grossi & Vitter 2000 Sadakane 2000 Grossi, Gupta & Vitter 2003 • FM-index Ferragina & Manzini 2000 & 2001 Assumptions & Notation • Alphabet: Σ = {a, b} • Text: T[1..n] T[n] = #, where a < # < b • Pattern: P[1..m] Permutations and Suffix Arrays • An observation Permutations: n! Suffix arrays: 2n-1 Not all permutations are suffix arrays • An example A suffix array: 4, 7, 5, 1, 8, 3, 6, 2 • Text: abbaaba# A permutation: 4, 7, 1, 5, 8, 2, 3, 6 • Not a suffix array of any binary text Two Features of Suffix Arrays Ascending-to-max Non-nesting Suffix Array 4 7 5 1 8 3 6 2 5 8 2 3 6 Another Permutation 4 7 1 A Categorization Theorem • A permutation is a suffix array iff it is: Ascending-to-max Non-nesting • An immediate application: Checking whether a permutation is a suffix array in O(n) time using n + O(1) additional words in memory. Application: Space Efficient Suffix Array SA: 8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14 Ba: 0 0 1 1 0 0 1 1 1 0 0 1 1 0 1 1 Bb: 1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 Text: abaaabbaaabaabb# Basic Searching Algorithm: Answering Cardinality Queries Basic Idea: backward search Start from the end of the pattern P For i = m, m-1, …, 1, compute the interval [s, e] of SA whose corresponding suffixes are prefixed with P[i, m] SA: 8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14 P = aba More Algorithms and Tradeoffs • Answering listing queries • Speeding up the reporting of Occurrences of Long Patterns • Self-indexing • Time-space tradeoff: multi-level structure Putting it all together Three index structures: space (bits) pattern searching Index 1 n+o(n) Index 2 2n+o(n) Index 3 O(n) O(m) (existential & cardinality queries only) O(m + occ) (m=Ω(lg1+εn)) O(m + occ lg n) (otherwise) O(m + occ) (m=Ω(lg1+εn)) O(m + occ lgλ n) (otherwise) Conclusion • Summary A theorem that characterizes a permutation as the suffix array of a binary string An efficient algorithm checking whether a permutation is a suffix array Three space efficient text indexing methods Conclusions (Continued) • Related subsequent work Generalization to larger alphabets • Open problem O(n)-bits text index supporting searching in O(m+occ) time. Thank You.