slides - Università di Pisa

advertisement
Fast Compressed Tries
through Path Decompositions
Roberto Grossi
Giuseppe Ottaviano*
Università di Pisa
* Part of the work done while at Microsoft Research Cambridge
Compacted tries
Node label
Branching character
three
trial
triangle
trie
triple
triply
t
h
r
ree
e
i
a
p
l
ε
e
ε
y
ε
n
ε
l
gle
ε
Applications
• String dictionaries
– With prefix lookup, predecessor, …
– Exploit prefix compression
• Monotone perfect hash functions
– “Hollow” or “Blind” tries [ALENEX 09]
– Binary tree (no need store branching chars)
– No need to store node labels, just lengths (skips)
Height vs. performance
• Tries can be deep – no guarantee on height
• Bad with pointer-based trees
– ~1 cache miss per child operation
• Worse with succinct tree encodings
– Need to access several directories
– Many cache misses per child operation
– Large constants hidden in the O(1)
Path decomposition
t
h
e
h
i
p
p
l
Recurse here with
suffix le
ε
y
ε
e
a
l
e
gle
r
ree
ε
t i
n
ε
l
gle
ε
Query: triple
Centroid path decomposition
• Decompose along the heavy paths
– choose the edge that has most descendants
• Height of the decomposed tree: O(log n)
– Usually lower
• Average height
Web Queries
Compacted trie
Centroid trie
Hollow trie
Centroid hollow trie
URLs
Synthetic
11.0
18.1
504.4
5.2
6.2
2.8
50.8
67.3
1005.3
8.0
9.2
2.8
Succinct encoding
• [PODS 08] presents a succinct data structure
for centroid path-decomposed tries
• Not practical: need complex operations on
succinct trees
• We introduce a simpler and practical encoding
• This encoding enables also simple
compression of the labels
Succinct encoding
t i
h
e
gle
p
l
L : t1ri2a1ngle
BP: ( (((
)
B : h epl
(spaces added for clarity)
• Node label written literally, interleaved with number of
other branching characters at that point in array L
• Corresponding branching characters in array B
• Tree encoded with DFUDS in bitvector BP
– Variant of Range Min-Max tree [ALENEX 10] to support
Find{Close,Open}, more space-efficient (Range Min tree)
Compression of L
...$...index.html$....html$....html$...index.html$
…
3 index
…
5 .html
…
Dictionary
...$...35$...5$...5$...35$
• Dictionary codewords shared among labels
• Codewords do not cross label boundaries ($)
• Use vbyte to compress the codeword ids
Compression of L
• Node labels (t1ri2a1ngle, l1e, …):
– each label is suffix of a string in the set
– interleaved with few “special characters” 1, 2, 3,…
• Compressible if strings are compressible
• Dictionary and parsing computed with
modified Re-Pair
– Domain-specific compression can be used instead
• Decompression overhead negligible
Experimental results (time)
• Experiments show gains in time comparable to
the gains in height
• Confirm that bottleneck is traversal operations
Web Queries
URLs
Synthetic
Trie
3.5
7.0
119.8
Centroid trie
2.4
4.3
5.1
16.6
22.4
462.7
Hollow trie
7.2
13.9
137.1
Centroid hollow trie
2.8
4.4
11.1
Hollow trie [ALENEX 09]
(microseconds, lower is better)
Code available at https://github.com/ot/path_decomposed_tries
Experimental results (space)
• For strings with many common prefixes, even
non-compressed trie is space-efficient
• Labels compression considerably increases
space-efficiency
• Decompression time overhead: ~10%
Web Queries
URLs
Synthetic
Hu-Tucker Front Coding
40.9%
24.4%
19.1%
Centroid trie
55.6%
22.4%
17.9%
Centroid trie + compression
31.5%
13.6%
0.4%
(compression ratio, lower is better)
Code available at https://github.com/ot/path_decomposed_tries
Thanks for your attention!
Questions?
References
• [ALENEX 10] D. Arroyuelo, R. Cánovas, G. Navarro, and K.
Sadakane. Succinct trees in practice. In ALENEX, pages
84–97, 2010.
• [ALENEX 09] D. Belazzougui, P. Boldi, R. Pagh, and S.
Vigna. Monotone minimal perfect hashing: searching a
sorted table with O(1) accesses. In SODA, pages 785–
794, 2009.
• [PODS 08] P. Ferragina, R. Grossi, A. Gupta, R. Shah, and
J. S. Vitter. On searching compressed string collections
cache-obliviously. In PODS, pages 181–190, 2008.
Download