Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.) The Story Begins with ... The Model Alice observes A(t) by time t 5 4 3 1 2 4 1 t 2 1 2 5 3 All parties have infinite computing power Goal is to minimize communication Carole tries to compute f (A(t)UB(t)) for all t 2 Bob observes B(t) by time t A(t), B(t): multisets The Model Continuous Communication Model / Distributed Streaming Model 5 4 2 3 1 3 2 1 2 1 2 2 5 3 3 4 3 1 2 3 1 2 3 5 2 k sites Combination of Two Models 3 2 3 2 1 1 1 1 2 4 2 4 Continuous Communication Model Distributed Streaming Model 3 Communication model One-shot Model 1 2 4 Streaming model 1 Other Models [Gibbons and Tirthapura, 2001] 5 4 3 1 2 4 1 t Carole tries to compute f (AUB) in the end 2 1 2 5 3 2 All parties make one pass using small memory small communication Applied Motivation: Distributed Monitoring Network Operations Center (NOC) Query site S1 Query Q(S1 ∪ S2 ∪…) S3 1 1 1 1 S2 1 0 0 1 0 1 1 0 1 1 0 S6 S4 0 1 1 1 0 S5 1 0 Large-scale querying/monitoring: Inherently distributed! Streams physically distributed across remote sites E.g., stream of UDP packets through routers Challenge is “holistic” querying/monitoring Queries over the union of distributed streams Q(S1 ∪ S2 ∪ …) Streaming data is spread throughout the network Slide from the tutorial “Streaming in a connected world: Querying and tracking distributed data streams” at VLDB’06 and SIGMOD’07 [Cormode and Garofalakis] Applied Motivation: Distributed Monitoring Network Operations Center (NOC) Query site S1 Query Q(S1 ∪ S2 ∪…) S3 1 1 1 1 S2 1 0 1 0 0 1 1 0 1 1 0 S6 S4 0 1 1 1 Traditional approach: “pull” based Query all nodes once for a while Expensive communication, most is wasted Inaccurate Current trend: moving towards a “push” based approach The remote sites alert the coordinator when something interesting happens 0 S5 1 0 Theoretical Questions Upper bounds: Worst-case communication bounds for a given f ? Lower bounds: Is there a gap in the communication complexity between the one-shot model and the continuous model? The Frequency Moments Assume integer domain [n] = {1, …, n} i appears mi times The p-th frequency moment: F1 is the cardinality of A F0 is # unique items in A (define 00=0) F2 is Gini’s index of homogeneity in statistics self-join size in db Extensively studied since [Alon, Matias, and Szegedy, 1999] Approximate Monitoring Must trigger alarm when Fp > τ Cannot trigger alarm when Fp < (1 − ε) τ Fp τ (1 − ε) τ alarm time Why approximate: Exact monitoring is expensive and unnecessary Why monitoring Most applications only need monitoring Tracking can be simulated by monitoring with τ = 1+ε, (1+ε)2, (1+ε)3, …, so at most an O(1/ε) factor away. Prior Work Several papers in the database literature Mostly heuristic based Bad worst-case bounds, no lower bounds F1: O(k/ε log(τ/k)) [SIGMOD’06] O(k log(1/ε)) F0: Õ(k2/ε3) [ICDE’06] Õ(k/ε2) F2: Õ(k2/ε4) [VLDB’05] Õ(k2/ε+k3/2/ε3) Õ() suppresses polylog factors Continuous vs One-Shot If there is a continuous monitoring algorithm that communicates X bits, then there is a one-shot algorithms that communicates O(X+k) bits Our Results Good news: all continuous bounds (except F2) are close to their one-shot counterparts Bad news: all continuous bounds (except F2) are close to their one-shot counterparts Talk Outline Introduction Deterministic F1 algorithm: O(k log(1/ε)) Randomized F1 algorithm: O(1/ε2∙log(1/δ)) Randomized F0 algorithm: Õ(k/ε2) Randomized F2 algorithm: Õ(k2/ε+k3/2/ε3) Conclusions Deterministic F1 Algorithm The first round: Terminates round after receiving k signals τ/2k · k = τ/2 < F1 < τ τ/2k coordinator Deterministic F1 Algorithm The second round: τ/4k coordinator Deterministic F1 Algorithm The second round: Terminates round after receiving k signals 3τ/4 < F1 < τ τ/4k coordinator Deterministic F1 Algorithm Each round communicates O(k) bits Continue until Δ=ετ O(log(1/ε)) rounds Δ=ετ After the last round, we have (1-ε)τ < F1 < τ Total communication: O(k log(1/ε)) Lower bound: Ω(k log(1/(εk))) One-Shot: O(k log(1/ε)) Lower bound: Ω(k log(1/(εk))) coordinator Talk Outline Introduction Deterministic F1 algorithm: O(k log(1/ε)) Randomized F1 algorithm: O(1/ε2∙log(1/δ)) Randomized F0 algorithm: Õ(k/ε2) Randomized F2 algorithm: Õ(k2/ε+k3/2/ε3) Conclusions F0: # Distinct Items Lower bound: Any deterministic (or Las Vegas randomized) algorithm has to communicate Ω(n) bits Consider the one-shot case first Use “sketches”: small-space streaming algorithms “Combine” the sketches from the k sites FM sketch [Flajolet and Martin 1985; Alon, Matias, and Szegedy, 1999] FM Sketch Take a pair-wise independent random hash function h : {1,…,n} {1,…,2d}, where 2d > n For each incoming element x, compute h(x) e.g., h(5) = 10101100010000 Count how many trailing zeros Remember the maximum number of trailing zeroes in any h(x) Let Y be the maximum number of trailing zeroes Can show E[2Y] = # distinct elements FM Sketch So 2Y is an unbiased estimator for # distinct elements However, has a large variance Some recent techniques [Gibbons and Tirthapura, 2001; BarYossef, Jayram, Kumar, Sivakumar, and Trevisan, 2002] to produce a good estimator that has probability 1–δ to be within relative error ε Space increased to Õ(1/ε2) FM sketch has linearity Y1 from A, Y2 from B, then 2max{Y1, Y2} estimates # distinct items in AUB A one-shot algorithm with communication Õ(k/ε2) Continuously Monitoring F0 FM sketch is monotone Yi is non-decreasing, and Yi < log n Whenever Yi increases, notify the coordinator The coordinator can always have the up-todate combined FM sketch Total communication: Õ(k/ε2) Lower bound: Ω(k) Talk Outline Introduction Deterministic F1 algorithm: O(k log(1/ε)) Randomized F1 algorithm: O(1/ε2∙log(1/δ)) Randomized F0 algorithm: Õ(k/ε2) Randomized F2 algorithm: Õ(k2/ε+k3/2/ε3) Conclusions F2: The One-Shot Case Lower bound: Any deterministic (or Las Vegas randomized) algorithm has to communicate Ω(n) bits Consider the one-shot case first Use “sketches”: small-space streaming algorithms “Combine” the sketches from the k sites AMS sketch [Alon, Matias, and Szegedy, 1999] AMS Sketch: “Tug-of-War” Take a 4-wise independent random hash function h : {1,…,n} {−1,+1} Compute Y = ∑ h(x) over all x Y2 is an unbiased estimator for F2 Use O(1/ε2 ∙ log(1/δ)) copies to guarantee a good estimator that has probability 1–δ to be within relative error ε Linearity still holds! o One-shot case can be solved with communication Õ(k/ε2) However… Y is not monotone! Can’t afford to send all changes of the local sketch to the coordinator F2 Monitoring: Multi-Round Algorithm Beginning of a round sketch Õ(1/ε2) sketch Õ(1/ε2) coordinator estimate for F2 F2 Monitoring: Multi-Round Algorithm During a round sends a signal whenever the F2 of the updates increases by t = (τ − F2)2/(64k2τ) coordinator estimate for F2 F2 Monitoring: Multi-Round Algorithm End of a round: when k signals are received coordinator estimate for F2 old F2 + (τ − old F2) ∙ ε/k < new F2 < τ # rounds: O(k/ε) Total cost: Õ(k2/ε3) F2: Round / Sub-Round Algorithm End of a sub-round: when k signals are received “rough” sketch of size Õ(1) “rough” sketch of size Õ(1) coordinator k estimate for F2 old F2 + (τ − old F2) ∙ ε/k < new F2 < τ Lower bound: Ω(k) combine sketches maintain an upper bound of F2 Total cost: Õ(k2/ε+k3/2/ε3) One-shot: Õ(k/ε2) Open Problems Still no clear separation between the one-shot model and the continuous model F2 is an interesting case Many other functions f Statistics: entropy, heavy hitters Geometric measures: diameter, width, … Variations of the model One-way vs two-way communication Does having a broadcast channel help? Sliding windows? “Continuous Communication Complexity”? Thank you!