e 2 - Size

advertisement
Time-Decaying Sketches for
Sensor Data Aggregation
Graham Cormode
AT&T Labs, Research
Srikanta Tirthapura
Dept. of Electrical and Computer Engineering
Iowa State University
Bojian Xu
Dept. of Electrical and Computer Engineering
Iowa State University
Mean of the Temperatures
in the Last 30 Minutes
75F 76F 72F 73F
11:39 11:34 11:29 11:19
78F 73F 76F 76F
11:41 11:39 11:38 11:26
76F 73F
11:45 11:40
79F 70F 76F
11:30 11:22 11:15
76F 80F 79F 76F
11:45 11:38 11:30 11:25
2/30
Sketch
75F 76F 72F 73F
11:39 11:34 11:29 11:19
78F 73F 76F 76F
11:41 11:39 11:38 11:26
76F 73F
11:45 11:40
79F 70F 76F
11:30 11:22 11:15
76F 80F 79F 76F
11:45 11:38 11:30 11:25
3/30
Sketch Merging
Answer
4/30
General Time Decay
• General Decay function:
• Time decayed value of element
at time c is:
0
age
5/30
Formal Model of the Data
(on One Sensor)
Data stream: e0=(v0,t0,id0), e1=(v1,t1,id1), …
– v: value
– t: timestamp of creation
– id: a unique id of the observation
• User defined Time Decay:
• Asynchronous arrival: It is possible ti > tj, while i<j
• Duplicates: idi = idj
is possible
– Assume: if idi = idj , then vi = vj, ti=tj
6/30
Contribution
First mergable sketch combines the following:
Logarithmic space of the universe size
Guaranteed accuracy
Any time decay model
Sum
Asynchronous arrival
Quantile
Duplicate insensitive
Frequent elements
Data aggregation under any multi-path routing protocol
7/30
Related Work
Any time
decay model
Asynchronous Duplicate
arrival
insensitive
1
2
3
Our
work
√
√
√
√
√
√
√
√
√
4
√
√
√
Sum
Quantile Frequent
Elements
√
√
1. S. Nath, P. B. Gibbons, S. Seshan and Z. R. Anderson, “Synopsis diffusion
for robust aggregation in sensor networks”, SenSys 2004
2. J. Considine, F. Li, G. Kollios and J. Byers, “Approximate Aggregation
Techniques for Sensor Databases”, ICDE 2004
3. E. Cohen and M. Strauss, “Maintaining time-decaying stream aggregates”,
PODS 2003; Journal of Algorithm 2006
4. S. Tirthapura, B. Xu and C. Busch, “Sketching Asynchronous
Streams Over Sliding Windows”, PODC 2006
8/30
Outline
• Problem: Time decayed sum of distinct
elements over an asynchronous stream.
• Focus on Integral decay model:
is always an integer
9/30
Estimate of the Sum
(on One Sensor)
• Given:
– Stream: R = (v0,t0,id0),…, (vn,tn,idn), …
– User defined decay function: f()
• Maintain:
– c: current time
– D: set of distinct elements in R
10/30
Estimate of the Sum (cont’d)
• Linear space lower bound on duplicate-insensitive
sum (Alon, Matias and Szegedy, STOC 1996)
– Deterministic approximate algorithm
– Randomized algorithm giving accurate result
• Goal: Continuously maintain an (,  )-estimate of:
– User inputs:
– D: set of distinct elements in R
An (,  )- estimate for X is a random
variable Y, such that Pr[|Y-X| >  X] < .
11/30
Algorithm for Sum
(High Level Picture)
Sum
Count
v1=4
√
+
+
v2=8
√ √
Random Sampling
√
SampleRate = p
• Count the number of selected integers
• Multiply by 1/p
12/30
Duplicate Detection
Hash Function
Random
Sampling
Select x
Copy 1
√
√
√
Copy 2
13/30
Intuition - I
(v,t,id)
sample rate
Sample
By Chebyshev inequality, for an ε-approximation of
the count with constant probability:
14/30
Intuition - II
• t
• t+
• Sample rate ?
15/30
Maintain Multiple Samples
SIZE ??
SampleRate pj
p0 = 1
p1 = 1/2
p2 = 1/4
16/30
Faster Sampling
• RangeSample (Pavan & Tirthapura, SICOMP 2007)
– Efficiently compute the number of selected integers
√
√
√
SIZE ??
SampleRate pj
p0 = 1
p1 = 1/2
p2 = 1/4
p0 = 1
p1 = 1/2
p2 = 1/4
17/30
Expiry Time
e=(v, t, id)
√
√
√
At time: t
At time: t +  = Expiry Time
Binary search over [t, tmax] using RangeSample
√
expiry time
√
√
18/30
Sketch Structure
Largest expiry
time of all the
elements
discarded from
the sample
Level 0
t0
p=1
Sample 0
Level 1
t1
1/2
Level 2
t2
1/4
Sketch
1/8
19/30
current time
17
data: (v, t, id)
e1
Expiry0
Expiry1
Expiry2
(22, 16, 6)
22
19
17
Level 0
(e1,22)
p=1
Level 1
(e1,19)
1/2
Level 2
1/4
20/30
current time
17
18
18
data: (v, t, id)
e1
e2
e3
(22, 16, 6)
(32, 17, 9)
(7, 16, 11)
Expiry0
22
Expiry1
19
Expiry2
17
23
21
18
21
16
16
Level 0
(e1,22) (e2,23) (e3,21)
Level 1
(e1,19) (e2,21)
Level 2
p=1
1/2
1/4
21/30
current time
17
18
18
20
data: (v, t, id)
e1
e2
e3
e4
(22, 16, 6)
(32, 17, 9)
(7, 16, 11)
(21, 18, 8)
Expiry0
22
23
21
Expiry1
19
21
16
Expiry2
17
18
16
23
21
20
Discard the element with
smallest expiry time
Level 0
(e3,21) (e1,22) (e2,23) (e4,23)
Level 1
(e1,19) (e2,21) (e4,21)
Level 2
p=1
1/2
1/4
22/30
17
e1
18
e2
18
e3
20
e4
(22, 16, 6)
(32, 17, 9)
(7, 16, 11)
(21, 18, 8)
Expiry0
Expiry1
22
19
23
21
21
16
23
21
Expiry2
17
18
16
20
current time
data: (v, t, id)
Level 0
Level 1
Level 2
t0= 21
(e1,22) (e2,23) (e4,23)
p=1
(e1,19) (e2,21) (e4,21)
1/2
1/4
23/30
17
e1
18
e2
18
e3
20
e4
20
e5
(22, 16, 6)
(32, 17, 9)
(7, 16, 11)
(21, 18, 8)
(32, 17, 9)
Expiry0
Expiry1
22
19
23
21
21
16
23
21
23
21
Expiry2
17
18
16
20
18
current time
data: (v, t, id)
Duplicate
Level 0
Level 1
Level 2
t0= 21
(e1,22) (e2,23) (e4,23)
p=1
(e1,19) (e2,21) (e4,21)
1/2
1/4
24/30
Answer a Query
for the Decayed Sum
Current time = 20
Level 0
t0= 21
Level 1
Level 2
e2
e4
(e1,22) (e2,23) (e4,23)
p=1
(e1,19) (e2,21) (e4,21)
1/2
Level used to
answer the
query
1/4
√
√
25/30
Over the Whole Sensor N/W
Sketch 1
(e1,6)
(e2,9)
(e3,13)
union
Result of merging sketch 1&2
(e2,9)
(e5,10)
(e3,13)
union
union
(e4,6)
(e5,10)
(e3,13)
Each sample keeps
3 distinct items with
largest expiry time.
Sketch 2
26/30
Algorithm Complexity
• Space complexity:
• Time complexity
– expected time for processing one item
– Time for answering a query
– Time for merging two sketches
27/30
Conclusion
First sketch combines the following
Logarithmic space of the universe size
Guaranteed accuracy
Any time decay model
Sum
Asynchronous arrival
Quantile
Duplicate insensitive
Frequent elements
Data aggregation under any multi-path routing protocol
28/30
Ongoing and Future Work
•
Implementation
– Observed results better than theoretical
predictions
•
Better duplicate insensitive sketches for
specific decay models?
•
Other aggregates, such as Variance,
clustering?
29/30
THANKS
30/30
Download