Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC) A scenario Challenge: compute something on the table, 131.107.65.14 using small space. 18.9.22.69 Example of “something”: 131.107.65.14 • # distinct IPs • max frequency 80.97.56.20 • other statistics… 18.9.22.69 IP Frequency 131.107.65.14 3 18.9.22.69 2 80.97.56.20 2 128.112.128.81 9 127.0.0.1 8 257.2.5.7 0 7.8.20.13 1 80.97.56.20 131.107.65.14 Sublinear: a panacea? Sub-linear space algorithm for solving Travelling Salesperson Problem? Hard to solve sublinearly even very simple problems: Sorry, perhaps a different lecture Ex: what is the count of distinct IPs seen Will settle for: Approximate algorithms: 1+ approximation IP Frequency 131.107.65.14 3 18.9.22.69 2 80.97.56.20 2 128.112.128.81 9 127.0.0.1 8 257.2.5.7 0 8.3.20.12 1 true answer ≤ output ≤ (1+) * (true answer) Randomized: above holds with probability 95% Quick and dirty way to get a sense of the data Streaming data Data through a router Data stored on a hard drive, or streamed remotely More efficient to do a linear scan on a hard drive Working memory is the (smaller) main memory 2 2 Application areas Data can come from: Network logs, sensor data Real time data Search queries, served ads Databases (query planning) … Problem 1: # distinct elements Problem: compute the number of distinct elements in the stream Trivial solution: 𝑂(𝑚) space for 𝑚 distinct elements Will see: 𝑂(log 𝑚) space (approximate) 2 5 7 5 5 i Frequency 2 1 5 3 7 1 Distinct Elements: idea 1 [Flajolet-Martin’85, Alon-Matias-Szegedy’96] Algorithm: Hash function ℎ: 𝑈 → 0,1 Compute 𝑚𝑖𝑛𝐻𝑎𝑠ℎ = min𝑖∈𝑆 ℎ(𝑖) 1 Output is −1 𝑚𝑖𝑛𝐻𝑎𝑠ℎ Process(int i): if (h(i) < minHash) minHash = h(index); repeats of the same element i don’t matter 1 𝐸 𝑠𝑒𝑔𝑚𝑒𝑛𝑡 = , for 𝑚 distinct elements 𝑚+1 5 0 Initialize: minHash=1 hash function h into [0,1] Output: 1/minHash-1 “Analysis”: Algorithm DISTINCT: ℎ(5) 1/(𝑚 + 1) 7 ℎ(7) 2 ℎ(2) 1 Distinct Elements: idea 2 Algorithm Algorithm DISTINCT: DISTINCT: Store 𝑚𝑖𝑛𝐻𝑎𝑠ℎ approximately Randomness: 2-wise enough! Store just the count of trailing zeros Need only 𝑂(log log 𝑛) bits Initialize: Initialize: minHash2=0 minHash=1 hash hash function function hh into into [0,1] [0,1] Process(int Process(int i): i): if if (h(i) (h(i) << 1/2^minHash2) minHash) minHash2 minHash == h(index); ZEROS(h(index)); Output: Output:2^minHash2 1/minHash-1 𝑂(log 𝑛) bits Better accuracy using more space: x=0.0000001100101 ZEROS(x) error 1 + 𝜖 repeat 𝑂(1/𝜖 2 ) times with different hash functions HyperLogLog: can also with just one hash function [FFGM’07] Problem 2: max count heavy hitters Problem: compute the maximum frequency of an element in the stream Bad news: 2 5 7 5 5 Hard to distinguish whether an element repeated (max = 1 vs 2) Good news: Can find “heavy hitters” elements with frequency > total frequency / s using space proportional to s IP Frequency 2 1 5 3 7 1 Heavy Hitters: CountMin [Charikar-Chen-FarachColton’04, Cormode-Muthukrishnan’05] Algorithm CountMin: 2 ℎ3 2 321 5 ℎ1 (2) 7 5 Initialize(r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1} 21 4321 321 1 𝑤 freq freq freq freq 11 ℎ2 (2) 1 1 5 2 =1 5 =3 7 =1 11 = 1 𝐿 Process(int i): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1; Output: foreach i in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate Heavy Hitters: analysis 5 𝑤 3 2 1 1 3 mass” Algorithm CountMin: 4 𝐿 1 = frequency of 5, plus “extra Expected “extra mass” ≤ total mass / w Chebyshev: true with probability >1/2 𝐿 = 𝑂(log 𝑚) to get high probability (for all 𝑚 elements) Compute heavy hitters from freq[] Initialize(r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1} Process(int i): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1; Output: foreach i in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate Problem 3: Moments Problem: compute frequency moment variance 𝐹2 = 𝑖 𝑓(𝑖)2 or higher moments 𝐹𝑘 = 𝑖 𝑓(𝑖)𝑘 for 𝑘 > 2 Skewness (k=3), kurtosis (k=4), etc a different proxy for max: lim 𝑘 𝐹𝑘 = max𝑖 𝑓(𝑖) 𝑘→∞ IP Frequency 𝒇(𝒊) 2 1 5 3 7 2 𝒇(𝒊)𝟐 𝒇(𝒊)𝟒 1 1 9 81 4 16 𝐹2 =1+9+4=14 𝐹2 = 3.74 𝐹4 =1+81+16=98 4 𝐹4 = 3.15 𝐹2 moment Use Johnson-Lindenstrauss lemma! (2nd lecture) Store sketch 𝑆 = 𝐺𝑓 Update on element 𝑖: 𝐺(𝑓 + 𝑒𝑖 ) = 𝐺𝑓 + 𝐺𝑒𝑖 Guarantees: 𝑓 = frequency vector 𝐺 = 𝑘 by 𝑛 matrix of Gaussian entries 𝑘 = 𝑂(1/𝜖 2 ) counters (words) 𝑂(𝑘) time to update Better: ±1 entries, 𝑂(1) update [AMS’96, TZ’04] 𝐹𝑘 : precision sampling => next Scenario 2: distributed traffic Statistics on traffic difference/aggregate between two routers Eg: traffic different by how many packets? Linearity is the power! Sketch(data 1) + Sketch(data 2) = Sketch(data 1 + data 2) Sketch(data 1) - Sketch(data 2) = Sketch(data 1 - data 2) 131.107.65.14 35.8.10.140 18.9.22.69 IP Frequency IP 18.9.22.69 Frequency 131.107.65.14 1 131.107.65.14 1 18.9.22.69 1 18.9.22.69 2 35.8.10.140 1 Two sketches should be sufficient to compute something on the difference or sum Common primitive: estimate sum Given: 𝑛 quantities 𝑎1 , 𝑎2 , … 𝑎𝑛 in the range [0,1] Goal: estimate 𝑆 = 𝑎1 + 𝑎2 + ⋯ 𝑎𝑛 “cheaply” Standard sampling: pick random set 𝐽 = {𝑗1, … 𝑗𝑚} of size 𝑚 Estimator: 𝑆 = 𝑛 𝑚 ⋅ (𝑎𝑗1 + 𝑎𝑗2 + ⋯ 𝑎𝑗𝑚 ) Chebyshev bound: with 90% success probability 1 𝑆 – 𝑂(𝑛/𝑚) < 𝑆 < 2𝑆 + 𝑂(𝑛/𝑚) 2 For constant additive error, need 𝑚 = Ω(𝑛) Compute an estimate 𝑆 from 𝑎1, 𝑎3 a3 a1 a1 a2 a3 a4 Precision Sampling Framework Alternative “access” to 𝑎𝑖 ’s: For each term 𝑎𝑖 , we get a (rough) estimate 𝑎𝑖 up to some precision 𝑢𝑖 , chosen in advance: |𝑎𝑖 – 𝑎𝑖 | < 𝑢𝑖 Challenge: achieve good trade-off between quality of approximation to 𝑆 use only weak precisions 𝑢𝑖 (minimize “cost” of estimating 𝑎) Compute an estimate 𝑆 from 𝑎1 , 𝑎2 , 𝑎3 , 𝑎4 u1 a1 ã1 u2 a2 ã2 u3 ã3 a3 u4 ã4 a4 Formalization Sum Estimator Adversary 1. fix precisions 𝑢𝑖 1. fix 𝑎1, 𝑎2, … 𝑎𝑛 3. given 𝑎1 , 𝑎2 , … 𝑎𝑛 , output 𝑆 s.t. 𝑖 𝑎𝑖 − 𝑆 < 1. What is cost? 2. fix 𝑎1 , 𝑎2 , … 𝑎𝑛 s.t. |𝑎𝑖 − 𝑎𝑖 | < 𝑢𝑖 Here, average cost = 1/𝑛 ⋅ 1/𝑢𝑖 to achieve precision 𝑢𝑖, use 1/𝑢𝑖 “resources”: e.g., if 𝑎𝑖 is itself a sum 𝑎𝑖 = 𝑗𝑎𝑖𝑗 computed by subsampling, then one needs Θ(1/𝑢𝑖 ) samples For example, can choose all 𝑢𝑖 = 1/𝑛 Average cost ≈ 𝑛 Precision Sampling Lemma [A-Krauthgamer-Onak’11] Goal: estimate ∑ai from {ãi} satisfying |ai-ãi|<ui. Precision Sampling Lemma: can get, with 90% success: O(1) 1.5 multiplicative error: ε additive error and 1+ε – ε <<S̃S̃ << (1+ ε)S +ε S –S O(1) 1.5*S + O(1) O(ε-3 log with average cost equal to O(log n) n) Example: distinguish Σai=3 vs Σai=0 Consider two extreme cases: if three ai=1: enough to have crude approx for all (ui=0.1) if all ai=3/n: only few with good approx ui=1/n, and the rest with ui=1 Precision Sampling Algorithm Precision Sampling Lemma: can get, with 90% success: ε additive error and 1+ε O(1) 1.5 multiplicative error: S –S O(1) 1.5*S + O(1) – ε <<S̃S̃ << (1+ ε)S +ε Algorithm: O(ε-3 log with average cost equal to O(log n) n) Choose each ui[0,1] i.i.d. distrib. = minimum of O(ε-3) u.r.v. concrete function of [ãi of /uii‘s- 4/ε] Estimator: S̃ = count number s.t. ã+i and / ui >u6i’s (up to a normalization constant) Proof of correctness: we use only ãi which are 1.5-approximation to ai E[S̃] ≈ ∑ Pr[ai / ui > 6] = ∑ ai/6. E[1/ui] = O(log n) w.h.p. Moments (𝐹𝑘 ) via precision sampling Theorem: linear sketch for 𝐹𝑘 with 𝑂(1) approximation, and 𝑂(𝑛1−2/𝑘 log 𝑛) space (90% succ. prob.). Sketch: Pick random 𝑢𝑖[0,1], 𝑟𝑖{±1}, and let 𝑦𝑖 = 𝑥𝑖 ⋅ 𝑟𝑖 /𝑢𝑖 throw into one hash table 𝐻, x= x1 1−2/𝑘 𝑚 = 𝑂(𝑛 log 𝑛) cells x2 x3 x4 Estimator: 1/𝑘 max 𝐻 𝑗 𝑗 𝑝 y1+ y4 H= y 3 Randomness: 𝑂(1) independence suffices y2+ y5+ y6 x5 x6 Streaming++ LOTS of work in the area: Surveys Muthukrishnan: http://algo.research.googlepages.com/eight.ps McGregor: http://people.cs.umass.edu/~mcgregor/papers/08graphmining.pdf Chakrabarti: http://www.cs.dartmouth.edu/~ac/Teach/CS49Fall11/Notes/lecnotes.pdf Open problems: http://sublinear.info Examples: Moments, sampling Median estimation, longest increasing sequence Graph algorithms Numerical algorithms (e.g., regression, SVD approximation) E.g., dynamic graph connectivity [AGG’12, KKM’13,…] Fastest (sparse) regression […CW’13,MM’13,KN’13,LMP’13] related to Compressed Sensing