Slide

advertisement
Algorithms for Distributed
Functional Monitoring
Ke Yi
HKUST
Joint work with
Graham Cormode (AT&T Labs)
S. Muthukrishnan (Google Inc.)
The Story Begins with ...
The Model
Alice observes
A(t) by time t
5
4
3
1
2
4
1
t
2
1
2
5
3
All parties have infinite computing power
Goal is to minimize communication
Carole tries to compute
f (A(t)UB(t)) for all t
2
Bob observes
B(t) by time t
A(t), B(t): multisets
The Model
Continuous Communication Model / Distributed Streaming Model
5
4
2
3
1
3
2
1
2
1
2
2
5
3
3
4
3
1
2
3
1
2
3
5
2
k sites
Combination of Two Models
3
2
3
2
1
1
1
1
2
4
2
4
Continuous Communication Model
Distributed Streaming Model
3
Communication model
One-shot Model
1
2
4
Streaming model
1
Other Models [Gibbons and Tirthapura, 2001]
5
4
3
1
2
4
1
t
Carole tries to compute
f (AUB) in the end
2
1
2
5
3
2
All parties make one pass using small memory
 small communication
Applied Motivation: Distributed Monitoring
Network
Operations
Center (NOC)
Query site
S1
Query
Q(S1 ∪ S2 ∪…)
S3
1
1 1
1
S2
1 0
0
1
0
1
1
0
1
1 0
S6
S4
0
1
1
1
0
S5
1
0
 Large-scale querying/monitoring: Inherently distributed!
 Streams physically distributed across remote sites
E.g., stream of UDP packets through routers
 Challenge is “holistic” querying/monitoring
 Queries over the union of distributed streams Q(S1 ∪ S2 ∪ …)
 Streaming data is spread throughout the network
Slide from the tutorial “Streaming in a connected world: Querying and tracking
distributed data streams” at VLDB’06 and SIGMOD’07 [Cormode and Garofalakis]
Applied Motivation: Distributed Monitoring
Network
Operations
Center (NOC)
Query site
S1
Query
Q(S1 ∪ S2 ∪…)
S3
1
1 1
1
S2
1 0
1
0
0
1
1
0
1
1 0
S6
S4
0
1
1
1
 Traditional approach: “pull” based
 Query all nodes once for a while
 Expensive communication, most is wasted
 Inaccurate
 Current trend: moving towards a “push” based approach
 The remote sites alert the coordinator when something interesting
happens
0
S5
1
0
Theoretical Questions
Upper bounds: Worst-case communication
bounds for a given f ?
Lower bounds: Is there a gap in the
communication complexity between the
one-shot model and the continuous model?
The Frequency Moments
 Assume integer domain [n] = {1, …, n}
 i appears mi times
 The p-th frequency moment:
 F1 is the cardinality of A
 F0 is # unique items in A (define 00=0)
 F2 is
Gini’s index of homogeneity in statistics
self-join size in db
 Extensively studied since [Alon, Matias, and Szegedy, 1999]
Approximate Monitoring
 Must trigger alarm when Fp > τ
 Cannot trigger alarm when Fp < (1 − ε) τ
Fp
τ
(1 − ε) τ
alarm
time
 Why approximate: Exact monitoring is expensive and
unnecessary
 Why monitoring
 Most applications only need monitoring
 Tracking can be simulated by monitoring with τ = 1+ε, (1+ε)2,
(1+ε)3, …, so at most an O(1/ε) factor away.
Prior Work
Several papers in the database literature
Mostly heuristic based
Bad worst-case bounds, no lower bounds
F1: O(k/ε log(τ/k)) [SIGMOD’06] O(k log(1/ε))
F0: Õ(k2/ε3) [ICDE’06]
Õ(k/ε2)
F2: Õ(k2/ε4) [VLDB’05]
Õ(k2/ε+k3/2/ε3)
Õ() suppresses polylog factors
Continuous vs One-Shot
If there is a continuous monitoring
algorithm that communicates X bits, then
there is a one-shot algorithms that
communicates O(X+k) bits
Our Results
Good news: all continuous bounds (except
F2) are close to their one-shot counterparts
Bad news: all continuous bounds (except
F2) are close to their one-shot counterparts
Talk Outline
Introduction
Deterministic F1 algorithm: O(k log(1/ε))
Randomized F1 algorithm: O(1/ε2∙log(1/δ))
Randomized F0 algorithm: Õ(k/ε2)
Randomized F2 algorithm: Õ(k2/ε+k3/2/ε3)
Conclusions
Deterministic F1 Algorithm
The first round:
Terminates round after receiving k signals
τ/2k · k = τ/2 < F1 < τ
τ/2k
coordinator
Deterministic F1 Algorithm
The second round:
τ/4k
coordinator
Deterministic F1 Algorithm
The second round:
Terminates round after receiving k signals
3τ/4 < F1 < τ
τ/4k
coordinator
Deterministic F1 Algorithm
Each round communicates O(k) bits
Continue until Δ=ετ  O(log(1/ε)) rounds
Δ=ετ
After the last round, we have (1-ε)τ < F1 < τ
Total communication: O(k log(1/ε))
Lower bound: Ω(k log(1/(εk)))
One-Shot: O(k log(1/ε))
Lower bound: Ω(k log(1/(εk)))
coordinator
Talk Outline
Introduction
Deterministic F1 algorithm: O(k log(1/ε))
Randomized F1 algorithm: O(1/ε2∙log(1/δ))
Randomized F0 algorithm: Õ(k/ε2)
Randomized F2 algorithm: Õ(k2/ε+k3/2/ε3)
Conclusions
F0: # Distinct Items
Lower bound: Any deterministic (or Las
Vegas randomized) algorithm has to
communicate Ω(n) bits
Consider the one-shot case first
Use “sketches”: small-space streaming
algorithms
“Combine” the sketches from the k sites
FM sketch [Flajolet and Martin 1985; Alon, Matias,
and Szegedy, 1999]
FM Sketch
 Take a pair-wise independent random hash
function h : {1,…,n}  {1,…,2d}, where 2d > n
 For each incoming element x, compute h(x)
e.g., h(5) = 10101100010000
Count how many trailing zeros
Remember the maximum number of trailing zeroes in
any h(x)
 Let Y be the maximum number of trailing zeroes
Can show E[2Y] = # distinct elements
FM Sketch
 So 2Y is an unbiased estimator for # distinct elements
 However, has a large variance
Some recent techniques [Gibbons and Tirthapura, 2001; BarYossef, Jayram, Kumar, Sivakumar, and Trevisan, 2002] to produce
a good estimator that has probability 1–δ to be within
relative error ε
Space increased to Õ(1/ε2)
 FM sketch has linearity
Y1 from A, Y2 from B, then 2max{Y1, Y2} estimates #
distinct items in AUB
 A one-shot algorithm with communication Õ(k/ε2)
Continuously Monitoring F0
FM sketch is monotone
Yi is non-decreasing, and Yi < log n
Whenever Yi increases, notify the coordinator
The coordinator can always have the up-todate combined FM sketch
Total communication: Õ(k/ε2)
Lower bound: Ω(k)
Talk Outline
Introduction
Deterministic F1 algorithm: O(k log(1/ε))
Randomized F1 algorithm: O(1/ε2∙log(1/δ))
Randomized F0 algorithm: Õ(k/ε2)
Randomized F2 algorithm: Õ(k2/ε+k3/2/ε3)
Conclusions
F2: The One-Shot Case
Lower bound: Any deterministic (or Las
Vegas randomized) algorithm has to
communicate Ω(n) bits
Consider the one-shot case first
Use “sketches”: small-space streaming
algorithms
“Combine” the sketches from the k sites
AMS sketch [Alon, Matias, and Szegedy, 1999]
AMS Sketch: “Tug-of-War”
 Take a 4-wise independent random hash function
h : {1,…,n}  {−1,+1}
 Compute
Y = ∑ h(x)
over all x
 Y2 is an unbiased estimator for F2
 Use O(1/ε2 ∙ log(1/δ)) copies to guarantee a good
estimator that has probability 1–δ to be within relative
error ε
 Linearity still holds!
o One-shot case can be solved with communication Õ(k/ε2)
However…
Y is not monotone!
Can’t afford to send all changes of the
local sketch to the coordinator
F2 Monitoring: Multi-Round Algorithm
Beginning of a round
sketch Õ(1/ε2)
sketch Õ(1/ε2)
coordinator
estimate for F2
F2 Monitoring: Multi-Round Algorithm
During a round
sends a signal whenever
the F2 of the updates increases
by t = (τ − F2)2/(64k2τ)
coordinator
estimate for F2
F2 Monitoring: Multi-Round Algorithm
End of a round: when k signals are received
coordinator
estimate for F2
old F2 + (τ − old F2) ∙ ε/k < new F2 < τ
# rounds: O(k/ε)
Total cost: Õ(k2/ε3)
F2: Round / Sub-Round Algorithm
End of a sub-round: when k signals are received
“rough” sketch
of size Õ(1)
“rough” sketch
of size Õ(1)
coordinator

k
estimate for F2
old F2 + (τ − old F2) ∙ ε/k < new F2 < τ
Lower bound: Ω(k)
combine sketches
maintain an upper
bound of F2
Total cost: Õ(k2/ε+k3/2/ε3)
One-shot: Õ(k/ε2)
Open Problems
 Still no clear separation between the one-shot model and
the continuous model
 F2 is an interesting case
 Many other functions f
 Statistics: entropy, heavy hitters
 Geometric measures: diameter, width, …
 Variations of the model
 One-way vs two-way communication
 Does having a broadcast channel help?
 Sliding windows?
 “Continuous Communication Complexity”?
Thank you!
Download