Streaming Algorithms Piotr Indyk MIT Data Streams • A data stream is a sequence of data that is too large to be stored in available memory • Examples: – Network traffic – Sensor networks – Approximate query optimization and answering in large databases – Scientific data streams – ..And this talk Outline • Data stream model, motivations, connections • Streaming problems and techniques – Estimating number of distinct elements – Other quantities: Norms, moments, heavy hitters… – Sparse approximations and compressed sensing Basic Data Stream Model • Single pass over the data: i1, i2,…,in – Typically, we assume n is known • Bounded storage (typically n or logc n) – Units of storage: bits, words, coordinates or „elements” (e.g., points, nodes/edges) • Fast processing time per element – Randomness OK (in fact, almost always necessary) 8 2 1 9 1 9 2 4 6 3 9 4 2 3 4 2 3 8 5 2 5 6 ... Connections • External memory algorithms – Linear scan works best – Streaming algorithms yield efficient external memory algorithms • Communication complexity/sketching – Low-space algorithms yield lowcommunication protocols • Metric embeddings, sub-linear time algorithms, pseudo-randomness, compressive sensing A B Counting Distinct Elements • Stream elements: numbers from {1...m} • Goal: estimate the number of distinct elements DE in the stream – Up to 1 – With probability 1-P • Simpler goal: for a given T>0, provide an algorithm which, with probability 1-P: – Answers YES, if DE> (1+)T – Answers NO, if DE< (1-)T • Run, in parallel, the algorithm with T=1, 1+, (1+)2,..., n – Total space multiplied by log1+n ≈ log(n)/ Vector Interpretation Stream: 8 2 1 9 1 9 2 4 4 9 4 2 5 4 2 5 8 5 2 5 Vector X: 1 2 3 4 5 6 7 8 9 • Initially, x=0 • Insertion of i is interpreted as xi = xi +1 • Want to estimate DE(x) = the number of non-zero elements of x Estimating DE(x) (based on[Flajolet-Martin’85]) Vector X: 1 2 3 4 5 6 7 8 9 Set S: • • • • + ++ (T=4) Choose a random set S of coordinates – For each i, we have Pr[iS]=1/T Maintain SumS(x) = iS xi Estimation algorithm I: – – YES, if SumS(x)>0 NO, if SumS(x)=0 Analysis: – – – Pr=Pr[SumS(x)=0] = (1-1/T)DE For T large enough: (1-1/T)DE ≈e-DE/T Using calculus, for small enough: • If DE> (1+)T, then Pr ≈ e-(1+) < 1/e - /3 • if DE< (1-)T, then Pr ≈ e-( 1-) > 1/e + /3 0.8 0.7 Pr 0.6 0.5 0.4 Series1 0.3 0.2 0.1 0 1 3 5 7 9 11 13 15 17 19 DE Estimating DE(x) ctd. • We have an algorithm: – If DE> (1+)T, then Pr<1/e-/3 – if DE< (1-)T, then Pr>1/e+/3 • Need to amplify the probability of correctness: – Select sets S1 … Sk , k=O(log(1/P)/2) – Let Z = number of SumSj(x) that are equal to 0 – By Chernoff bound with probability >1-P • If DE> (1+)T, then Z<k/e • if DE< (1-)T, then Z>k/e • Total space: – Decision version: O(log (1/P)/2 ) numbers in range 0…n – Estimation: : O(log(n)/ * log (1/P)/2 ) numbers in range 0…n (the probability of error is O(P log(n)/) ) • Better algorithms known: – Theory: O( 1/ 2 +log n) bits [Kane-Nelson-Woodruff’10] – Practice: need 128 bytes for all works of Shakespeare , ε≈10% [Durand-Flajolet’03] Chernoff bound Let X1…Xn be a sequence of i.i.d. 0-1 random variables such that Pr[Xi=1]=p. Let X=sumi Xi Then for any eps<1 we have Pr[ |X-pn| >eps pn] <2e-eps^2 pn/3 Comments • Implementing S: – Choose a uniform hash function h: {1..m} -> {1..T} – Define S={i: h(i)=1} – We have i S iff h(i)=1 • Implementing h – In practice: to compute h(i) do: • Seed=i • Return random() More comments Vector X: 1 2 3 4 5 6 7 8 9 • The algorithm uses “linear sketches” SumSj(x)=iSj xi • Can implement decrements xi=xi-1 – I.e., the stream can contain deletions of elements (as long as x≥0) – Other names: dynamic model, turnstile model More General Problem • What other functions of a vector x can we maintain in small space ? • Lp norms: ||x||p = ( ∑i |xi|p )1/p – We also have ||x|| =maxi |xi| – … and we can define ||x||0 = DE(x), since ||x||pp =∑i |xi|p→DE(x) as p→0 • Alternatively: frequency moments Fp = p-th power of Lp norms (exception: F0 = L0 ) • How much space do you need to estimate ||x||p (for const. ) ? • Theorem: – For p[0,2]: polylog n space suffices – For p>2: n1-2/p polylog n space suffices and is necessary [Alon-Matias-Szegedy’96, Feigenbaum-Kannan-Strauss-Viswanathan’99, Indyk’00, Coppersmith-Kumar’04, Ganguly’04, Bar-Yossef-JayramKumar-Sivakumar’02’03, Saks-Sun’03, Indyk-Woodruff’05] Estimating L2 norm: AMS Alon-Matias-Szegedy’96 • Choose r1 … rm to be i.i.d. r.v., with Pr[ri=1]=Pr[ri=-1]=1/2 • Maintain Z=∑i ri xi under increments/decrements to xi • Algorithm I: Y=Z2 • “Claim”: Y “approximates” ||x||22 with “good” probability Analysis • We will use Chebyshev inequality – Need expectation, variance • The expectation of Z2 = (∑i ri xi )2 is equal to E[Z2] = E[∑i,j rixirjxj] = ∑i,j xi x j E[rirj] • We have – For i≠j, E[rirj] = E[ri] E[rj] =0 – term disappears – For i=j, E[rirj] =1 • Therefore E[Z2] = ∑i xi2 =||x||22 (unbiased estimator) Analysis, ctd. • The second moment of Z2 = (∑i ri xi )2 is equal to the expectation of Z4 = (∑i ri xi ) (∑i ri xi ) (∑i ri xi ) (∑i ri xi ) • This can be decomposed into a sum of – ∑i (ri xi )4 →expectation= ∑i xi 4 – 6 ∑i<j (ri rj xixj )2 →expectation= 6∑i<j xi2 xj2 – Terms involving single multiplier ri xi (e.g., r1x1r2x2r3x3r4x4) →expectation=0 Total: ∑i xi 4 + 6∑i<j xi2 xj2 • The variance of Z2 is equal to E[Z4]-E2[Z2] = ∑i xi 4 + 6∑i<j xi2 xj2 – (∑i xi2 )2 = ∑i xi 4 + 6∑i<j xi2 xj2 – ∑i xi4 -2 ∑i<j xi2 xj2 = 4∑i<j xi2 xj2 ≤ 2 (∑i xi 2 )2 Analysis, ctd. • We have an estimator Y=Z2 – E[Y] = ∑i xi2 – σ2 =Var[Y] ≤ 2 (∑i xi 2 )2 • Chebyshev inequality: Pr[ |E[Y]-Y| ≥ cσ ] ≤ 1/c2 • Algorithm II: – Maintain Z1 … Zk (and thus Y1 … Yk ), define Y’ = ∑i Yi /k – E[Y’] = k ∑i xi2 /k = ∑i x i2 – σ’2 = Var[Y’] ≤ 2k(∑i xi 2 )2 /k2 = 2 (∑i xi 2 )2 /k • Guarantee: Pr[ |Y’ - ∑i xi2 | ≥c (2/k)1/2 ∑i xi2 ] ≤ 1/c2 • Setting c to a constant and k=O(1/ε2) gives (1 ε)approximation with const. probability Comments • Only needed 4-wise independence of r1…rm – I.e., for any subset S of {1…m}, |S|=4, and any b{-1,1}4, we have Pr[rS=b]=1/24 – Can generate such vars from O(log m) random bits • What we did: – Maintain a “linear sketch” vector Z=[Z1...Zk] = R x – Estimator for ||x||22 : (Z12 +... + Zk2)/k = ||Rx||22 /k – “Dimensionality reduction”: x→ Rx … but the tail somewhat “heavy” – Reason: only used second moment of the estimator