Document

advertisement
Indexable Compressed
Bitvectors
02-714
Slides by Carl Kingsford
Ramen, Ramen, Rao. Succinct Indexable Dictionaries with Applications to Encoding
k-ary Trees, Prefix Sums and Multisets, SODA 2002: 233-242
Operations on bit vectors
•
•
•
rank1(S,i) := the number of 1 bits at or before position i in S.
select1(S,j) := the position of the jth 1 bit in S.
rank0(S,i) and select0(S,j) are defined analogously.
S[i] = “access bit i” = rank1(S, i) – rank1(S, i – 1)
Note: rank1(S, select1(S, j)) = j, so rank and select are inverses of
each other.
Goal: rank and select in O(1) time while using small space.
RRR
Ramen, Ramen, Rao, FOCS 2002
blocks of size u bits
010101001111010101010101010101110101110101010
w1 w2 w3 w4
s1 s2 s3 s4
p1 p2 p3 p4
w5
s5
p5
w = 0 rank1(i)
000 000
w = 1 rank1(i)
...
...
...
wi = number of 1s in block i
si = space to represent pi
pi = index into tables of bit
patterns
w = 2 rank1(i)
001
001
011
012
010
011
110
122
100
111
101
112
w = 3 rank1(i)
111 123
blocks of size u bits
RRR Space So Far
010101001111010101010101010101110101110101010
w1 w2 w3 w4
s1 s2 s3 s4
p1 p2 p3 p4
w5
s5
p5
w = 0 rank1(i)
000 000
w = 1 rank1(i)
...
...
...
wi = number of 1s in block i
si = space to represent pi
pi = index into tables of bit
patterns
w = 2 rank1(i)
001
001
011
012
010
011
110
122
100
111
101
112
w = 3 rank1(i)
111 123
Each wi is ≤ u, so can be represented in dlog ue bits.
Each pi is an index into a table with
represented with
⇠
log
✓
u
wi
◆⇡
✓
◆
u
entries, so can be
wi
bits.
Each si is ≤ u, so can be represented in dlog ue bits.
Tables contain 2u entries. The rank vectors in table w = k are of
size ulog k
Prefix Sum Data Structure
Thm (Tarjan & Yao; Pagh; simplified). Let z1, ..., zk be integers such
that |zi| = nO(1) and |zi – zi-1| = O(log n), then the list z1, ..., zk can be
represented in O(k log log n) bits allowing for constant access.
Proof. Use the following representation:
|zi – z1|
z1
• • •
z5
k / log n integers
z9
z13
z1 z2 z3 z4 z5 z6 z7 z8 z9 z10 z11 z12 z13 z14 ...
The “key frame” integers take at most O
✓
k
log n
log n
◆
= O(k) bits.
The ≈ k delta integers take total O(k log log n) bits because each takes
O(log log n) bits.
⟹ O(k + klog log n) = O(k log log n) bits total.
Prefix Sum Data Structure, 2
Thm (Tarjan & Yao; Pagh; simplified). Let z1, ..., zk be integers such that |zi| =
nO(1) and |zi – zi-1| = O(log n), then the list z1, ..., zk can be represented in O(k
log log n) bits allowing for constant access.
010101001111010101010101010101110101110101010
f1
f2
f3
f4
f5
...
fi = number of 1s up through
the end of block i
Condition 1: fi ≤ n
Condition 2: |fi+1 – fi| = O(log n) if u = O(log n)
⟹ k = n / u = n / log n
⟹ prefix sums can be represented in (n / log n) log log n bits.
Summary: Prefix Sum Data Structure
Thm. The prefix-sum data structure used in RRR takes O((n / log n)
log log n) space. It can answer prefix-sum queries in constant time.
Proof: To answer a prefixSum(x) query:
1. find the zi that is just before index x.
2. return zi + the zx – zi that is stored at position x.
Each step takes O(1) time.
(Nearly) Complete RRR Data
Structure
blocks of size u =
O(log n) bits
010101001111010101010101010101110101110101010
w1 w2 w3 w4
s1 s2 s3 s4
p1 p2 p3 p4
w5
s5
p5
...
...
...
wi = number of 1s in block i
si = space to represent pi
pi = index into tables of bit patterns
f1
f2
f3
f4
f5
...
Prefix sums of wi as in previous slides
q1
q2
q3
q4
q5
...
Prefix sums of si as in previous slides
w = 0 rank1(i)
000 000
w = 1 rank1(i)
w = 2 rank1(i)
001
001
011
012
010
011
110
122
100
111
101
112
w = 3 rank1(i)
111 123
Space Usage, 1
blocks of size u =
O(log n) bits
n / log n blocks
010101001111010101010101010101110101110101010
w1
w2
w3
w4
w5
...
s1
p1
s2
p2
s3
p3
s4
p4
s5
p5
...
...
f1
f2
f3
f4
f5
...
wi < u (n/log n)log log n
si < u (n/log n)log log n
pi = index into tables of bit patterns
(n / log n) log log n
q1
q2
q3
q4
q5
...
(n / log n) log log n
wi < u because there are at most u 1s in a block of size u.
⇠
Let B(wi , u) = log2
✓
u
wi
◆⇡
= # of bits needed to select a subset
of wi elements from a universe of u
elements.
si = B(wi, u) < u b/c the plain u-long bit vector could store the subset.
Space Usage, 2
p1
s ⇠
X
i=1
p2
log2
p3
✓
u
wi
p4
◆⇡
p5
<s+
s
X
the floor
i=1
...
log2
pi = index into tables of bit patterns
✓
u
wi
◆
✓ Ps
◆
⇠
✓ ◆⇡
u
n
i=1
 s + log2 Ps
 s + log2
w
i=1 wi
the sums
Space Usage, 2
p1
p2
s ⇠
X
i=1
log2
p3
✓
u
wi
p4
◆⇡
p5
<s+
...
s
X
log2
i=1
s
the floor X
pi = index into tables of bit patterns
✓
u
wi
✓
◆
u
log
wi
i=1
◆
✓ Ps
◆
⇠
✓ ◆⇡
u
n
i=1
 s + log2 Ps
 s + log2
w
i=1 wi
= log
◆
s ✓
Y
u
i=1
the sums
wi
≤
= # of ways to pick ∑wi objects from this line
= # of ways to pick ∑wi objects from this line where we take wi from
segment i
•
u
•
w1
•
u
••
w2
u
• ••
u
•
u
••• •
w3
w4
w5
•
u
w6
Space Usage, 3
p1
s ⇠
X
i=1
p2
log2
p3
✓
u
wi
◆⇡
space to
store p array
p4
p5
<s+
s
X
i=1
...
pi = index into tables of bit patterns
log2
✓
u
wi
◆
✓ Ps
◆
⇠
✓ ◆⇡
u
n
i=1
 s + log2 Ps
 s + log2
w
i=1 wi
s = m/u = m/ log m
So the total space for the p array is: B(w, n) + m/ log m
minimum space
to select set of w
1 bits out of n.
Space Usage, 4: The Tables
w = 0 rank1(i)
000 000
w = 1 rank1(i)
w = 2 rank1(i)
001
001
011
012
010
011
110
122
100
111
101
112
The tables are tiny:
w = 3 rank1(i)
111 123
The rank vector is u-long, and
each entry is O(log w)
u ✓ ◆
X
u
w=0
w
for every weight, we
this have many entries
u log w  u
u ✓ ◆
X
u
w
w=0
log u
✓ ◆
u
= u log u
w
w=0
u
X
= u2u log u
= O ((log n)(log log n)(log n))
2
= O log n log log n
Summary: Space Usage
Thm. The RRR data structure takes O(B(w,n) + m log log m / log m) space.
So: how do we solve rank1(S, i)?
rank1(S, i)
1. Find the block x = bi/uc that i is in.
2. rank1(S, i ) = prefixSum(x) + T[wx][i – xu]
compute in constant
time
The rank table
for weight wx
The
appropriate bit
in the xth block.
Each step takes constant time, so the entire rank computation takes
O(1) time.
Summary
•
Can store bit vector in minimum space
(B(w,m)) + O(m log log m / log m)
•
Despite using asymptotically less space than
the naive representation, you can answer:
-
rank
select
access
queries in constant time
Download