An Optimal Algorithm for the Distinct Elements Problem Daniel Kane, Jelani Nelson, David Woodruff PODS, 2010 Problem Description • Given a long stream of values from a universe of size n – each value can occur any number of times – count the number F0 of distinct values • See values one at a time • One pass over the stream • Too expensive to store set of distinct values • Algorithms should: – Use a small amount of memory – Have fast update time (per value processing time) – Have fast recovery time (time to report the answer) Randomized Approximation Algorithms 3, 141, 59, 265, 3, 58, 9, 7, 9, 32, 3, 846264, 338, 32, 4, … • Consider algorithms that store a subset S of distinct values • E.g., S = {3, 9, 32, 265} • Main drawback is that S needs to be large to know if next value is a new distinct value • Any algorithm (whether it stores a subset of values or not) that is either deterministic or computes F0 exactly must use ¼ F0 memory • Hence, algorithms must be randomized and settle for an approximate solution: output F 2 [(1-ε)F0, (1+ε)F0] with good probability Problem History • Long sequence of work on the problem • Flajolet and Martin introduced problem, FOCS 1983 • Alon, Bar-Yossef, Beyer, Brody, Chakrabarti, Durand, Estan, Flajolet, Fisk, Fusy, Gandouet, Gemulla, Gibbons, P. Haas, Indyk, Jayram, Kumar, Martin, Matias, Meunier, Reinwald, Sismanis, Sivakumar, Szegedy, Tirthapura, Trevisan, Varghese, W • Previous best algorithm: • O(ε-2 log log n + log n) bits of memory and O(ε-2) update and reporting time • Known lower bound on the memory: • (ε-2 + log n) • Our result: • Optimal O(ε-2 + log n) bits of memory and O(1) update and reporting time Previous Approaches • Suppose we randomly hash F0 values into a hash table of 1/ε2 buckets and keep track of the number C of non-empty buckets • If F0 < 1/ε2, there is a way to estimate F0 up to (1 ± ε) from C • Problem: if F0 À 1/ε2, with high probability, every bucket contains a value, so there is no information Problem: It takes 1/ε2 log n bits of Solution: randomly choose Slog n µ Slog n - 1 µ Slog n - 2 µ S1 µ {1, 2, …, n}, where |Si| ¼ n/2i memory to keep track of this • information stream: 3, 141, 59, 265, 3, 58, 9, 7, 9, 32, 3, 846264, 338, 32, 4, … Si = {1, 3, 7, 9, 265} i-th substream: 3, 265, 3, 9, 7, 9, 3, … • Run hashing procedure on each substream • There is an i for which the # of distinct values in i-th substream ¼ 1/ε2 • Hashing procedure on i-th substream works Our Techniques Observation: - Have 1/ε2 global buckets - In each bucket we keep track of the index i of the set Si for the largest i for which Si contains a value hashed to the bucket - This gives O(1/ε2 log log n) bits of memory New Ideas: - Can show with high probability, at every point in the stream, most buckets contain roughly the same index - We can just keep track of the offsets from this common index - We pack the offsets into machine words and use known fast read/write algorithms to variable length arrays to efficiently update offsets - Occasionally we need to decrement all offsets. Can spread the work across multiple updates