Indexable Compressed Bitvectors 02-714 Slides by Carl Kingsford Ramen, Ramen, Rao. Succinct Indexable Dictionaries with Applications to Encoding k-ary Trees, Prefix Sums and Multisets, SODA 2002: 233-242 Operations on bit vectors • • • rank1(S,i) := the number of 1 bits at or before position i in S. select1(S,j) := the position of the jth 1 bit in S. rank0(S,i) and select0(S,j) are defined analogously. S[i] = “access bit i” = rank1(S, i) – rank1(S, i – 1) Note: rank1(S, select1(S, j)) = j, so rank and select are inverses of each other. Goal: rank and select in O(1) time while using small space. RRR Ramen, Ramen, Rao, FOCS 2002 blocks of size u bits 010101001111010101010101010101110101110101010 w1 w2 w3 w4 s1 s2 s3 s4 p1 p2 p3 p4 w5 s5 p5 w = 0 rank1(i) 000 000 w = 1 rank1(i) ... ... ... wi = number of 1s in block i si = space to represent pi pi = index into tables of bit patterns w = 2 rank1(i) 001 001 011 012 010 011 110 122 100 111 101 112 w = 3 rank1(i) 111 123 blocks of size u bits RRR Space So Far 010101001111010101010101010101110101110101010 w1 w2 w3 w4 s1 s2 s3 s4 p1 p2 p3 p4 w5 s5 p5 w = 0 rank1(i) 000 000 w = 1 rank1(i) ... ... ... wi = number of 1s in block i si = space to represent pi pi = index into tables of bit patterns w = 2 rank1(i) 001 001 011 012 010 011 110 122 100 111 101 112 w = 3 rank1(i) 111 123 Each wi is ≤ u, so can be represented in dlog ue bits. Each pi is an index into a table with represented with ⇠ log ✓ u wi ◆⇡ ✓ ◆ u entries, so can be wi bits. Each si is ≤ u, so can be represented in dlog ue bits. Tables contain 2u entries. The rank vectors in table w = k are of size ulog k Prefix Sum Data Structure Thm (Tarjan & Yao; Pagh; simplified). Let z1, ..., zk be integers such that |zi| = nO(1) and |zi – zi-1| = O(log n), then the list z1, ..., zk can be represented in O(k log log n) bits allowing for constant access. Proof. Use the following representation: |zi – z1| z1 • • • z5 k / log n integers z9 z13 z1 z2 z3 z4 z5 z6 z7 z8 z9 z10 z11 z12 z13 z14 ... The “key frame” integers take at most O ✓ k log n log n ◆ = O(k) bits. The ≈ k delta integers take total O(k log log n) bits because each takes O(log log n) bits. ⟹ O(k + klog log n) = O(k log log n) bits total. Prefix Sum Data Structure, 2 Thm (Tarjan & Yao; Pagh; simplified). Let z1, ..., zk be integers such that |zi| = nO(1) and |zi – zi-1| = O(log n), then the list z1, ..., zk can be represented in O(k log log n) bits allowing for constant access. 010101001111010101010101010101110101110101010 f1 f2 f3 f4 f5 ... fi = number of 1s up through the end of block i Condition 1: fi ≤ n Condition 2: |fi+1 – fi| = O(log n) if u = O(log n) ⟹ k = n / u = n / log n ⟹ prefix sums can be represented in (n / log n) log log n bits. Summary: Prefix Sum Data Structure Thm. The prefix-sum data structure used in RRR takes O((n / log n) log log n) space. It can answer prefix-sum queries in constant time. Proof: To answer a prefixSum(x) query: 1. find the zi that is just before index x. 2. return zi + the zx – zi that is stored at position x. Each step takes O(1) time. (Nearly) Complete RRR Data Structure blocks of size u = O(log n) bits 010101001111010101010101010101110101110101010 w1 w2 w3 w4 s1 s2 s3 s4 p1 p2 p3 p4 w5 s5 p5 ... ... ... wi = number of 1s in block i si = space to represent pi pi = index into tables of bit patterns f1 f2 f3 f4 f5 ... Prefix sums of wi as in previous slides q1 q2 q3 q4 q5 ... Prefix sums of si as in previous slides w = 0 rank1(i) 000 000 w = 1 rank1(i) w = 2 rank1(i) 001 001 011 012 010 011 110 122 100 111 101 112 w = 3 rank1(i) 111 123 Space Usage, 1 blocks of size u = O(log n) bits n / log n blocks 010101001111010101010101010101110101110101010 w1 w2 w3 w4 w5 ... s1 p1 s2 p2 s3 p3 s4 p4 s5 p5 ... ... f1 f2 f3 f4 f5 ... wi < u (n/log n)log log n si < u (n/log n)log log n pi = index into tables of bit patterns (n / log n) log log n q1 q2 q3 q4 q5 ... (n / log n) log log n wi < u because there are at most u 1s in a block of size u. ⇠ Let B(wi , u) = log2 ✓ u wi ◆⇡ = # of bits needed to select a subset of wi elements from a universe of u elements. si = B(wi, u) < u b/c the plain u-long bit vector could store the subset. Space Usage, 2 p1 s ⇠ X i=1 p2 log2 p3 ✓ u wi p4 ◆⇡ p5 <s+ s X the floor i=1 ... log2 pi = index into tables of bit patterns ✓ u wi ◆ ✓ Ps ◆ ⇠ ✓ ◆⇡ u n i=1 s + log2 Ps s + log2 w i=1 wi the sums Space Usage, 2 p1 p2 s ⇠ X i=1 log2 p3 ✓ u wi p4 ◆⇡ p5 <s+ ... s X log2 i=1 s the floor X pi = index into tables of bit patterns ✓ u wi ✓ ◆ u log wi i=1 ◆ ✓ Ps ◆ ⇠ ✓ ◆⇡ u n i=1 s + log2 Ps s + log2 w i=1 wi = log ◆ s ✓ Y u i=1 the sums wi ≤ = # of ways to pick ∑wi objects from this line = # of ways to pick ∑wi objects from this line where we take wi from segment i • u • w1 • u •• w2 u • •• u • u ••• • w3 w4 w5 • u w6 Space Usage, 3 p1 s ⇠ X i=1 p2 log2 p3 ✓ u wi ◆⇡ space to store p array p4 p5 <s+ s X i=1 ... pi = index into tables of bit patterns log2 ✓ u wi ◆ ✓ Ps ◆ ⇠ ✓ ◆⇡ u n i=1 s + log2 Ps s + log2 w i=1 wi s = m/u = m/ log m So the total space for the p array is: B(w, n) + m/ log m minimum space to select set of w 1 bits out of n. Space Usage, 4: The Tables w = 0 rank1(i) 000 000 w = 1 rank1(i) w = 2 rank1(i) 001 001 011 012 010 011 110 122 100 111 101 112 The tables are tiny: w = 3 rank1(i) 111 123 The rank vector is u-long, and each entry is O(log w) u ✓ ◆ X u w=0 w for every weight, we this have many entries u log w u u ✓ ◆ X u w w=0 log u ✓ ◆ u = u log u w w=0 u X = u2u log u = O ((log n)(log log n)(log n)) 2 = O log n log log n Summary: Space Usage Thm. The RRR data structure takes O(B(w,n) + m log log m / log m) space. So: how do we solve rank1(S, i)? rank1(S, i) 1. Find the block x = bi/uc that i is in. 2. rank1(S, i ) = prefixSum(x) + T[wx][i – xu] compute in constant time The rank table for weight wx The appropriate bit in the xth block. Each step takes constant time, so the entire rank computation takes O(1) time. Summary • Can store bit vector in minimum space (B(w,m)) + O(m log log m / log m) • Despite using asymptotically less space than the naive representation, you can answer: - rank select access queries in constant time