Estimating Distinct Elements, Optimally

advertisement
An Optimal Algorithm for the
Distinct Elements Problem
Daniel Kane, Jelani Nelson, David Woodruff
PODS, 2010
Problem Description
• Given a long stream of values from a universe of size n
– each value can occur any number of times
– count the number F0 of distinct values
• See values one at a time
• One pass over the stream
• Too expensive to store set of distinct values
• Algorithms should:
– Use a small amount of memory
– Have fast update time (per value processing time)
– Have fast recovery time (time to report the answer)
Randomized Approximation Algorithms
3, 141, 59, 265, 3, 58, 9, 7, 9, 32, 3, 846264, 338, 32, 4, …
• Consider algorithms that store a subset S of distinct values
• E.g., S = {3, 9, 32, 265}
• Main drawback is that S needs to be large to know if next value is a
new distinct value
• Any algorithm (whether it stores a subset of values or not) that is
either deterministic or computes F0 exactly must use ¼ F0 memory
• Hence, algorithms must be randomized and settle for an approximate
solution: output F 2 [(1-ε)F0, (1+ε)F0] with good probability
Problem History
• Long sequence of work on the problem
• Flajolet and Martin introduced problem, FOCS 1983
• Alon, Bar-Yossef, Beyer, Brody, Chakrabarti, Durand, Estan, Flajolet,
Fisk, Fusy, Gandouet, Gemulla, Gibbons, P. Haas, Indyk, Jayram,
Kumar, Martin, Matias, Meunier, Reinwald, Sismanis, Sivakumar,
Szegedy, Tirthapura, Trevisan, Varghese, W
• Previous best algorithm:
• O(ε-2 log log n + log n) bits of memory and O(ε-2) update and
reporting time
• Known lower bound on the memory:
• (ε-2 + log n)
• Our result:
• Optimal O(ε-2 + log n) bits of memory and O(1) update and
reporting time
Previous Approaches
•
Suppose we randomly hash F0 values into a hash table of 1/ε2 buckets and keep
track of the number C of non-empty buckets
•
If F0 < 1/ε2, there is a way to estimate F0 up to (1 ± ε) from C
•
Problem: if F0 À 1/ε2, with high probability, every bucket contains a value, so there is
no information
Problem: It takes 1/ε2 log n bits of
Solution: randomly choose Slog n µ Slog n - 1 µ Slog n - 2  µ S1 µ {1, 2, …, n}, where
|Si| ¼ n/2i memory to keep track of this
•
information
stream: 3, 141, 59, 265, 3, 58, 9, 7, 9, 32, 3, 846264, 338, 32, 4, …
Si = {1, 3, 7, 9, 265}
i-th substream: 3, 265, 3, 9, 7, 9, 3, …
• Run hashing procedure on each substream
• There is an i for which the # of distinct values in i-th substream ¼ 1/ε2
• Hashing procedure on i-th substream works
Our Techniques
Observation:
- Have 1/ε2 global buckets
- In each bucket we keep track of the index i of the set Si for the largest i
for which Si contains a value hashed to the bucket
- This gives O(1/ε2 log log n) bits of memory
New Ideas:
- Can show with high probability, at every point in the stream, most
buckets contain roughly the same index
- We can just keep track of the offsets from this common index
- We pack the offsets into machine words and use known fast read/write
algorithms to variable length arrays to efficiently update offsets
- Occasionally we need to decrement all offsets. Can spread the work
across multiple updates
Download