Introduction to Data Streams Yannis Theodoridis Information Systems Laboratory (InfoLab) Dept of Informatics, Univ. Piraeus (http://infolab.cs.unipi.gr) Outline Introduction ‐ Applications Data Stream Model Data Model Query Model Synopses Data Stream Projects Wireless Sensor Networks Outline Introduction ‐ Applications Data Stream Model Data Model Query Model Synopses Data Stream Projects Wireless Sensor Networks Introduction Data streams differ from conventional stored relation model: Data elements in the stream arrive online System has no control over order in which data elements to be processed Data streams are potentially unbounded in size Once an element from a data stream has been processed, it is discarded or archived. It cannot be retrieved easily unless it is stored in memory, which is small relative to the size of data streams Operating in data stream model does not preclude use of data in conventional stored relations. Applications Web‐based financial search engine that evaluates queries over real‐time streaming financial data such as stock tickers and news feeds. (e.g., Traderbot) Modern security applications. (e.g., iPolicy Networks) Provides integrated security platform providing services such as firewall support and intrusion detection over multi‐gigabit network packet streams. Needs to perform complex stream processing including URL‐filtering based on table lookups and correlation across multiple network traffic flows. Large web site monitor web logs (clickstreams) online to enable applications such as personalization, performance monitoring, and load‐balancing. (e.g., Yahoo) Sensor monitoring Network traffic management The Database Model User/Application User/Application Results Register Query Database Query Processor S/W to access Stored Data Memory Stored Data Data Base Management System (DBMS) The Data Stream Model User/Application User/Application Results Register Query Stream Query Processor S/W to access Streaming Data Data Stream Management System (DSMS) Scratch Space (Memory and/or Disk) DBMS vs. DSMS Persistent relations One‐time queries Random access “Unbounded” disk store Only current state matters Passive repository Relatively low update rate No real‐time services Assume precise data Access plan determined by query processor, physical DB design Transient streams Continuous queries Sequential access Bounded main memory History/arrival‐order is critical Active stores Possibly multi‐GB arrival rate Real‐time requirements Data stale/imprecise Unpredictable/variable data arrival and characteristics Outline Introduction ‐ Applications Data Stream Model Data Model Query Model Synopses Data Stream Projects Wireless Sensor Networks Data Model Append‐only Updates Stock tickers Deletes Call records Transactional data Meta‐Data Control signals, punctuations Query Model Query Registration User/Application Answer Availability • Predefined • • • • • Ad-hoc • Predefined, inactive until invoked One-time Event/timer based Multiple-time, periodic Continuous (stored or streamed) Query Query Processor Processor Stream Access • Arbitrary • Weighted history • Sliding window (special case: size = 1) DSMS PODS 2002 Queries One‐time queries and Continuous queries One‐time queries Evaluated once over a point‐in‐time snapshot of data set Continuous queries Evaluated continuously as data streams continue to arrive May be stored and updated as new data arrives, or may produce data streams themselves Queries Predefined and Ad hoc queries Predefined Supplied to data stream management system before any relevant data has arrived Usually continuous queries Scheduled one‐time queries possible Ad hoc Can be either one‐time or continuous queries Complicates design of data stream management system (DSMS), because they are not known in advance for purposes of query optimization and correctly answering it may require referencing data that may have already arrived on data streams and potentially have already been discarded Making Things Concrete callee caller Central Office Central Office Outgoing (call_ID, caller, time, event) Incoming (call_ID, callee, time, event) DSMS event = start or end Query 1 (self‐join) Find all outgoing calls longer than 2 minutes SELECT O1.call_ID, O1.caller FROM Outgoing O1, Outgoing O2 WHERE (O2.time – O1.time > 2 AND O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end) Result requires unbounded storage Can provide result as data stream Can output after 2 min, without seeing end Query 2 (join) Pair up callers and callees SELECT O.caller, I.callee FROM Outgoing O, Incoming I WHERE O.call_ID = I.call_ID Can still provide result as data stream Requires unbounded temporary storage … … unless streams are near‐synchronized Query 3 (group‐by aggregation) Total connection time for each caller SELECT O1.caller, sum(O2.time – O1.time) FROM Outgoing O1, Outgoing O2 WHERE (O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end) GROUP BY O1.caller Cannot provide result in (append‐only) stream Output updates? Provide current value on demand? Memory? Approximate Query Answering Data streams are potentially unbounded in size, the amount of storage required to compute exact answer to a query may grow without bound Since we are limited to bounded amount of memory, it may not be possible to produce exact answers High‐quality approximate answers can be an acceptable solution Techniques for data reduction and synopsis construction 18 Synopses Histograms Wavelets Sketches Synopses ‐ Histograms Equi‐Depth Histograms Idea: Select buckets such that counts per bucket are equal Count for bucket 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Domain values V‐Optimal Histograms Idea: Select buckets to minimize frequency variance within C buckets minimize ( f − B )2 ∑∑ B v∈B v VB Count for bucket 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Domain values Synopses ‐ Wavelets Wavelets: Mathematical tool for hierarchical decomposition of functions/signals Haar wavelets: Simplest wavelet basis, easy to understand and implement Recursive pairwise averaging and differencing at different resolutions Resolution 3 Averages Detail Coefficients [2, 2, 0, 2, 3, 5, 4, 4] 2 [2, 1 1, 4, [1.5, 0 ---- 4] [0, -1, -1, 0] 4] [0.5, 0] [2.75] [-1.25] Haar wavelet decomposition: [2.75, -1.25, 0.5, 0, 0, -1, -1, 0] • Compression by ignoring small coefficients Haar Wavelet Coefficients Hierarchical decomposition structure • Reconstruct data values d(i) – d(i) = 2.75 + ∑ (+/-1) * (coefficient on path) 0.5 + 0 + Original data -1.25 + 2 - + 2 0 - + -1 -1 - + 0 2 3 0 - + 5 4 • Coefficient thresholding : only B<<|D| coefficients can be kept – B is determined by the available synopsis space – B largest coefficients in absolute normalized value – Provably optimal in terms of the overall Sum Squared (L2) Error 4 Sketches e.g [Alon96]: internal product of α with O(log(N/δ)/ε2) pseudorandom {‐1,+1} vectors α[i] 4 2 7 1 0 3 5 4 r1[i] 1 -1 -1 1 -1 1 1 1 8 r2[i] -1 1 1 -1 1 1 -1 -1 -2 r3[i] -1 -1 1 1 -1 -1 1 -1 0 sketch(α) Streaming Model Incoming Data Stream Sketch Seeds Queries Wavelet/Histogram Approximation Outline Introduction ‐ Applications Data Stream Model Data Model Query Model Synopses Data Stream Projects Wireless Sensor Networks Stream Projects Amazon/Cougar (Cornell) – sensors Aurora (Brown/MIT) – sensor monitoring, dataflow Hancock (AT&T) – telecom streams Hancock Niagara (OGI/Wisconsin) – Internet XML databases OpenCQ (Georgia) – triggers, incr. view maintenance Stream (Stanford) – general‐purpose DSMS Tapestry (Xerox) – pub/sub content‐based filtering Telegraph (Berkeley) – adaptive engine for sensors Tribeca (Bellcore) – network monitoring 26 Aurora/STREAM Overview Synopses Output streams Query Plans Running Op Ready Op π Waiting Op σ σ Applications register continuous queries Users issue continuous and ad-hoc queries Historical Storage Administrator monitors query execution and adjusts run-time parameters Input streams Adaptivity (Telegraph) Output Queues STeMs for join R grouped filter (R.A) EDDY S T R S T R S T grouped filter (S.B) Input Streams Runtime Adaptivity Multi‐query Optimization Framework – implements arbitrary schemes Query‐Split Scheme (Niagara) trig.Act.i trig.Act.j scan scan file i … … IBM file i MSFT file j … … file j split Quotes.XML join scan Symbol = Const.Value constant table scan Aggregate subscription for efficiency Split – evaluate trigger only when file updated Triggers – multi‐query optimization Shared Predicates [Niagara, Telegraph] Predicates for R.A R.A > 1 R.A > 7 R.A > 11 R.A < 3 R.A < 5 R.A = 6 R.A = 8 R.A ≠ 9 > 7 1 11 A>1 < A>7 3 A<3 = ≠ A<5 6 8 9 A>11 Tuple A=8 Outline Introduction ‐ Applications Data Stream Model Data Model Query Model Synopses Data Stream Projects Wireless Sensor Networks WSNs – An Introduction (1/3) The vision: Push connectivity out of the pc and into the real world Billions of sensors and actuators everywhere Zero configuration and administrative cost Build everything out of CMOS so that each device costs pennies Enable new sensing paradigms New challenges in data stream management WSNs – An Introduction (2/3) Wireless Sensor Networks utility: Scatter cheap, tiny motes in an area of interest Perform querying operations Obtain reports of physical quantities under study Support sampling procedures, alert mechanism infrastructures, decision making processes etc WSNs – An Introduction (3/3) Mote Features Low Power Supply, Low Power, Low Power... Low processing capabilities Constrained memory capacity Network Features Wireless, multi‐hop communication using ISM radio zones (433MHz – 2,4GHz) Ad‐hoc network topologies Sensor Net Sample Apps Habitat Monitoring: Storm petrels on great duck island, microclimates on James Reserve. Vehicle detection: sensors along a road, collect data about passing vehicles. Earthquake monitoring in shaketest sites. Traditional monitoring apparatus. Communication In Sensor Nets Radio communication has high link‐level losses A typically about 20% @ 5m B C Ad‐hoc neighbor discovery D TAG ÆTree‐based routing F E Declarative Queries for Sensor Nets Examples: 1 SELECT nodeid, light FROM sensors WHERE light > 400 EPOCH DURATION 1s Sensors Epoch Nodeid Light Temp Accel Sound 0 1 455 x x x 0 2 389 x x x 1 1 422 x x x 1 2 405 x x x Aggregation Queries 2 SELECT AVG(sound) FROM sensors EPOCH DURATION 10s 3 SELECT roomNo, AVG(sound) FROM sensors Epoch 0 AVG(sound) 440 1 445 Epoch roomNo AVG(sound) 0 1 360 0 2 520 GROUP BY roomNo 1 1 370 HAVING AVG(sound) > 200 1 2 520 EPOCH DURATION 10s Rooms w/ sound > 200 Example: Network Organization into Clusters Base Station SELECT faggr(attr), attr,… FROM sensors WHERE predicate OUTPUT ACTION action SAMPLE PERIOD t sec FOR d sec Query dissemination: The basestation poses a query of interest •The query is received by nearby clusterheads •Clusterheads propagate the query to their peers •…and to nodes within their cluster • Example (cont.) Query answer: •Clusterheads collect mote measurements •Send data to forwarders •Forwarders propagate the data towards the basestation •Answer reaches the basestation Summary Novel Data Management issues arise when dealing with data streams Data modeling, Query Processing, etc. Exact answers are hard to be given so approximation along with theoretical bounds suffices There is a lot of ongoing work that deal with streams, while wireless sensor networks pose new challenges in handling them Reading list N. Alon, Y. Matias, M. Szegedy. The space complexity of approximating the frequency moments. Proc. ACM STOC, 1996. A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M. Strauss. Surfing Wavelets on Streams: One Pass Summaries for Approximate Aggregate Queries. Proc. VLDB, 2001. S.Madden, M. J. Franklin, J. M. Hellerstein, W. Hong . TAG: a Tiny AGgregation Service for Ad‐Hoc Sensor Networks,” Proc. OSDI, 2002. D. J. Abadi , D. Carney, U. Cetintemel , M. Cherniack , C. Convey C. , S. Lee, M. Stonebraker , N. Tatbul , S. Zdonik. Aurora: a new model and architecture for data stream management. VLDB Journal (2003). S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong , S. Krishnamurthy, S. R. Madden, F. Reiss, M. A. Shah. TelegraphCQ: continuous dataflow processing. VLDB Journal (2003).