Slide

Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.) The Story Begins with ... The Model Alice observes A(t) by time t 5 4 3 1 2 4 1 t 2 1 2 5 3 All parties have infinite computing power Goal is to minimize communication Carole tries to compute f (A(t)UB(t)) for all t 2 Bob observes B(t) by time t A(t), B(t): multisets The Model Continuous Communication Model / Distributed Streaming Model 5 4 2 3 1 3 2 1 2 1 2 2 5 3 3 4 3 1 2 3 1 2 3 5 2 k sites Combination of Two Models 3 2 3 2 1 1 1 1 2 4 2 4 Continuous Communication Model Distributed Streaming Model 3 Communication model One-shot Model 1 2 4 Streaming model 1 Other Models [Gibbons and Tirthapura, 2001] 5 4 3 1 2 4 1 t Carole tries to compute f (AUB) in the end 2 1 2 5 3 2 All parties make one pass using small memory  small communication Applied Motivation: Distributed Monitoring Network Operations Center (NOC) Query site S1 Query Q(S1 ∪ S2 ∪…) S3 1 1 1 1 S2 1 0 0 1 0 1 1 0 1 1 0 S6 S4 0 1 1 1 0 S5 1 0  Large-scale querying/monitoring: Inherently distributed!  Streams physically distributed across remote sites E.g., stream of UDP packets through routers  Challenge is “holistic” querying/monitoring  Queries over the union of distributed streams Q(S1 ∪ S2 ∪ …)  Streaming data is spread throughout the network Slide from the tutorial “Streaming in a connected world: Querying and tracking distributed data streams” at VLDB’06 and SIGMOD’07 [Cormode and Garofalakis] Applied Motivation: Distributed Monitoring Network Operations Center (NOC) Query site S1 Query Q(S1 ∪ S2 ∪…) S3 1 1 1 1 S2 1 0 1 0 0 1 1 0 1 1 0 S6 S4 0 1 1 1  Traditional approach: “pull” based  Query all nodes once for a while  Expensive communication, most is wasted  Inaccurate  Current trend: moving towards a “push” based approach  The remote sites alert the coordinator when something interesting happens 0 S5 1 0 Theoretical Questions Upper bounds: Worst-case communication bounds for a given f ? Lower bounds: Is there a gap in the communication complexity between the one-shot model and the continuous model? The Frequency Moments  Assume integer domain [n] = {1, …, n}  i appears mi times  The p-th frequency moment:  F1 is the cardinality of A  F0 is # unique items in A (define 00=0)  F2 is Gini’s index of homogeneity in statistics self-join size in db  Extensively studied since [Alon, Matias, and Szegedy, 1999] Approximate Monitoring  Must trigger alarm when Fp > τ  Cannot trigger alarm when Fp < (1 − ε) τ Fp τ (1 − ε) τ alarm time  Why approximate: Exact monitoring is expensive and unnecessary  Why monitoring  Most applications only need monitoring  Tracking can be simulated by monitoring with τ = 1+ε, (1+ε)2, (1+ε)3, …, so at most an O(1/ε) factor away. Prior Work Several papers in the database literature Mostly heuristic based Bad worst-case bounds, no lower bounds F1: O(k/ε log(τ/k)) [SIGMOD’06] O(k log(1/ε)) F0: Õ(k2/ε3) [ICDE’06] Õ(k/ε2) F2: Õ(k2/ε4) [VLDB’05] Õ(k2/ε+k3/2/ε3) Õ() suppresses polylog factors Continuous vs One-Shot If there is a continuous monitoring algorithm that communicates X bits, then there is a one-shot algorithms that communicates O(X+k) bits Our Results Good news: all continuous bounds (except F2) are close to their one-shot counterparts Bad news: all continuous bounds (except F2) are close to their one-shot counterparts Talk Outline Introduction Deterministic F1 algorithm: O(k log(1/ε)) Randomized F1 algorithm: O(1/ε2∙log(1/δ)) Randomized F0 algorithm: Õ(k/ε2) Randomized F2 algorithm: Õ(k2/ε+k3/2/ε3) Conclusions Deterministic F1 Algorithm The first round: Terminates round after receiving k signals τ/2k · k = τ/2 < F1 < τ τ/2k coordinator Deterministic F1 Algorithm The second round: τ/4k coordinator Deterministic F1 Algorithm The second round: Terminates round after receiving k signals 3τ/4 < F1 < τ τ/4k coordinator Deterministic F1 Algorithm Each round communicates O(k) bits Continue until Δ=ετ  O(log(1/ε)) rounds Δ=ετ After the last round, we have (1-ε)τ < F1 < τ Total communication: O(k log(1/ε)) Lower bound: Ω(k log(1/(εk))) One-Shot: O(k log(1/ε)) Lower bound: Ω(k log(1/(εk))) coordinator Talk Outline Introduction Deterministic F1 algorithm: O(k log(1/ε)) Randomized F1 algorithm: O(1/ε2∙log(1/δ)) Randomized F0 algorithm: Õ(k/ε2) Randomized F2 algorithm: Õ(k2/ε+k3/2/ε3) Conclusions F0: # Distinct Items Lower bound: Any deterministic (or Las Vegas randomized) algorithm has to communicate Ω(n) bits Consider the one-shot case first Use “sketches”: small-space streaming algorithms “Combine” the sketches from the k sites FM sketch [Flajolet and Martin 1985; Alon, Matias, and Szegedy, 1999] FM Sketch  Take a pair-wise independent random hash function h : {1,…,n}  {1,…,2d}, where 2d > n  For each incoming element x, compute h(x) e.g., h(5) = 10101100010000 Count how many trailing zeros Remember the maximum number of trailing zeroes in any h(x)  Let Y be the maximum number of trailing zeroes Can show E[2Y] = # distinct elements FM Sketch  So 2Y is an unbiased estimator for # distinct elements  However, has a large variance Some recent techniques [Gibbons and Tirthapura, 2001; BarYossef, Jayram, Kumar, Sivakumar, and Trevisan, 2002] to produce a good estimator that has probability 1–δ to be within relative error ε Space increased to Õ(1/ε2)  FM sketch has linearity Y1 from A, Y2 from B, then 2max{Y1, Y2} estimates # distinct items in AUB  A one-shot algorithm with communication Õ(k/ε2) Continuously Monitoring F0 FM sketch is monotone Yi is non-decreasing, and Yi < log n Whenever Yi increases, notify the coordinator The coordinator can always have the up-todate combined FM sketch Total communication: Õ(k/ε2) Lower bound: Ω(k) Talk Outline Introduction Deterministic F1 algorithm: O(k log(1/ε)) Randomized F1 algorithm: O(1/ε2∙log(1/δ)) Randomized F0 algorithm: Õ(k/ε2) Randomized F2 algorithm: Õ(k2/ε+k3/2/ε3) Conclusions F2: The One-Shot Case Lower bound: Any deterministic (or Las Vegas randomized) algorithm has to communicate Ω(n) bits Consider the one-shot case first Use “sketches”: small-space streaming algorithms “Combine” the sketches from the k sites AMS sketch [Alon, Matias, and Szegedy, 1999] AMS Sketch: “Tug-of-War”  Take a 4-wise independent random hash function h : {1,…,n}  {−1,+1}  Compute Y = ∑ h(x) over all x  Y2 is an unbiased estimator for F2  Use O(1/ε2 ∙ log(1/δ)) copies to guarantee a good estimator that has probability 1–δ to be within relative error ε  Linearity still holds! o One-shot case can be solved with communication Õ(k/ε2) However… Y is not monotone! Can’t afford to send all changes of the local sketch to the coordinator F2 Monitoring: Multi-Round Algorithm Beginning of a round sketch Õ(1/ε2) sketch Õ(1/ε2) coordinator estimate for F2 F2 Monitoring: Multi-Round Algorithm During a round sends a signal whenever the F2 of the updates increases by t = (τ − F2)2/(64k2τ) coordinator estimate for F2 F2 Monitoring: Multi-Round Algorithm End of a round: when k signals are received coordinator estimate for F2 old F2 + (τ − old F2) ∙ ε/k < new F2 < τ # rounds: O(k/ε) Total cost: Õ(k2/ε3) F2: Round / Sub-Round Algorithm End of a sub-round: when k signals are received “rough” sketch of size Õ(1) “rough” sketch of size Õ(1) coordinator  k estimate for F2 old F2 + (τ − old F2) ∙ ε/k < new F2 < τ Lower bound: Ω(k) combine sketches maintain an upper bound of F2 Total cost: Õ(k2/ε+k3/2/ε3) One-shot: Õ(k/ε2) Open Problems  Still no clear separation between the one-shot model and the continuous model  F2 is an interesting case  Many other functions f  Statistics: entropy, heavy hitters  Geometric measures: diameter, width, …  Variations of the model  One-way vs two-way communication  Does having a broadcast channel help?  Sliding windows?  “Continuous Communication Complexity”? Thank you!

Slide

Related documents

Products

Support

Slide

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib