Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering Iowa State University Bojian Xu Dept. of Electrical and Computer Engineering Iowa State University Mean of the Temperatures in the Last 30 Minutes 75F 76F 72F 73F 11:39 11:34 11:29 11:19 78F 73F 76F 76F 11:41 11:39 11:38 11:26 76F 73F 11:45 11:40 79F 70F 76F 11:30 11:22 11:15 76F 80F 79F 76F 11:45 11:38 11:30 11:25 2/30 Sketch 75F 76F 72F 73F 11:39 11:34 11:29 11:19 78F 73F 76F 76F 11:41 11:39 11:38 11:26 76F 73F 11:45 11:40 79F 70F 76F 11:30 11:22 11:15 76F 80F 79F 76F 11:45 11:38 11:30 11:25 3/30 Sketch Merging Answer 4/30 General Time Decay • General Decay function: • Time decayed value of element at time c is: 0 age 5/30 Formal Model of the Data (on One Sensor) Data stream: e0=(v0,t0,id0), e1=(v1,t1,id1), … – v: value – t: timestamp of creation – id: a unique id of the observation • User defined Time Decay: • Asynchronous arrival: It is possible ti > tj, while i<j • Duplicates: idi = idj is possible – Assume: if idi = idj , then vi = vj, ti=tj 6/30 Contribution First mergable sketch combines the following: Logarithmic space of the universe size Guaranteed accuracy Any time decay model Sum Asynchronous arrival Quantile Duplicate insensitive Frequent elements Data aggregation under any multi-path routing protocol 7/30 Related Work Any time decay model Asynchronous Duplicate arrival insensitive 1 2 3 Our work √ √ √ √ √ √ √ √ √ 4 √ √ √ Sum Quantile Frequent Elements √ √ 1. S. Nath, P. B. Gibbons, S. Seshan and Z. R. Anderson, “Synopsis diffusion for robust aggregation in sensor networks”, SenSys 2004 2. J. Considine, F. Li, G. Kollios and J. Byers, “Approximate Aggregation Techniques for Sensor Databases”, ICDE 2004 3. E. Cohen and M. Strauss, “Maintaining time-decaying stream aggregates”, PODS 2003; Journal of Algorithm 2006 4. S. Tirthapura, B. Xu and C. Busch, “Sketching Asynchronous Streams Over Sliding Windows”, PODC 2006 8/30 Outline • Problem: Time decayed sum of distinct elements over an asynchronous stream. • Focus on Integral decay model: is always an integer 9/30 Estimate of the Sum (on One Sensor) • Given: – Stream: R = (v0,t0,id0),…, (vn,tn,idn), … – User defined decay function: f() • Maintain: – c: current time – D: set of distinct elements in R 10/30 Estimate of the Sum (cont’d) • Linear space lower bound on duplicate-insensitive sum (Alon, Matias and Szegedy, STOC 1996) – Deterministic approximate algorithm – Randomized algorithm giving accurate result • Goal: Continuously maintain an (, )-estimate of: – User inputs: – D: set of distinct elements in R An (, )- estimate for X is a random variable Y, such that Pr[|Y-X| > X] < . 11/30 Algorithm for Sum (High Level Picture) Sum Count v1=4 √ + + v2=8 √ √ Random Sampling √ SampleRate = p • Count the number of selected integers • Multiply by 1/p 12/30 Duplicate Detection Hash Function Random Sampling Select x Copy 1 √ √ √ Copy 2 13/30 Intuition - I (v,t,id) sample rate Sample By Chebyshev inequality, for an ε-approximation of the count with constant probability: 14/30 Intuition - II • t • t+ • Sample rate ? 15/30 Maintain Multiple Samples SIZE ?? SampleRate pj p0 = 1 p1 = 1/2 p2 = 1/4 16/30 Faster Sampling • RangeSample (Pavan & Tirthapura, SICOMP 2007) – Efficiently compute the number of selected integers √ √ √ SIZE ?? SampleRate pj p0 = 1 p1 = 1/2 p2 = 1/4 p0 = 1 p1 = 1/2 p2 = 1/4 17/30 Expiry Time e=(v, t, id) √ √ √ At time: t At time: t + = Expiry Time Binary search over [t, tmax] using RangeSample √ expiry time √ √ 18/30 Sketch Structure Largest expiry time of all the elements discarded from the sample Level 0 t0 p=1 Sample 0 Level 1 t1 1/2 Level 2 t2 1/4 Sketch 1/8 19/30 current time 17 data: (v, t, id) e1 Expiry0 Expiry1 Expiry2 (22, 16, 6) 22 19 17 Level 0 (e1,22) p=1 Level 1 (e1,19) 1/2 Level 2 1/4 20/30 current time 17 18 18 data: (v, t, id) e1 e2 e3 (22, 16, 6) (32, 17, 9) (7, 16, 11) Expiry0 22 Expiry1 19 Expiry2 17 23 21 18 21 16 16 Level 0 (e1,22) (e2,23) (e3,21) Level 1 (e1,19) (e2,21) Level 2 p=1 1/2 1/4 21/30 current time 17 18 18 20 data: (v, t, id) e1 e2 e3 e4 (22, 16, 6) (32, 17, 9) (7, 16, 11) (21, 18, 8) Expiry0 22 23 21 Expiry1 19 21 16 Expiry2 17 18 16 23 21 20 Discard the element with smallest expiry time Level 0 (e3,21) (e1,22) (e2,23) (e4,23) Level 1 (e1,19) (e2,21) (e4,21) Level 2 p=1 1/2 1/4 22/30 17 e1 18 e2 18 e3 20 e4 (22, 16, 6) (32, 17, 9) (7, 16, 11) (21, 18, 8) Expiry0 Expiry1 22 19 23 21 21 16 23 21 Expiry2 17 18 16 20 current time data: (v, t, id) Level 0 Level 1 Level 2 t0= 21 (e1,22) (e2,23) (e4,23) p=1 (e1,19) (e2,21) (e4,21) 1/2 1/4 23/30 17 e1 18 e2 18 e3 20 e4 20 e5 (22, 16, 6) (32, 17, 9) (7, 16, 11) (21, 18, 8) (32, 17, 9) Expiry0 Expiry1 22 19 23 21 21 16 23 21 23 21 Expiry2 17 18 16 20 18 current time data: (v, t, id) Duplicate Level 0 Level 1 Level 2 t0= 21 (e1,22) (e2,23) (e4,23) p=1 (e1,19) (e2,21) (e4,21) 1/2 1/4 24/30 Answer a Query for the Decayed Sum Current time = 20 Level 0 t0= 21 Level 1 Level 2 e2 e4 (e1,22) (e2,23) (e4,23) p=1 (e1,19) (e2,21) (e4,21) 1/2 Level used to answer the query 1/4 √ √ 25/30 Over the Whole Sensor N/W Sketch 1 (e1,6) (e2,9) (e3,13) union Result of merging sketch 1&2 (e2,9) (e5,10) (e3,13) union union (e4,6) (e5,10) (e3,13) Each sample keeps 3 distinct items with largest expiry time. Sketch 2 26/30 Algorithm Complexity • Space complexity: • Time complexity – expected time for processing one item – Time for answering a query – Time for merging two sketches 27/30 Conclusion First sketch combines the following Logarithmic space of the universe size Guaranteed accuracy Any time decay model Sum Asynchronous arrival Quantile Duplicate insensitive Frequent elements Data aggregation under any multi-path routing protocol 28/30 Ongoing and Future Work • Implementation – Observed results better than theoretical predictions • Better duplicate insensitive sketches for specific decay models? • Other aggregates, such as Variance, clustering? 29/30 THANKS 30/30