HiFi Systems: Network-Centric Query Processing for the Physical World Michael Franklin UC Berkeley 2.13.04 Introduction Continuing improvements in sensor devices – Wireless motes – RFID – Cellular-based telemetry Cheap devices can monitor the environment at a high rate. Connectivity enables remote monitoring at many different scales. Widely different concerns at each of these levels and scales. M. Franklin, UC Berkeley, Feb. 04 Plan of Attack Motivation/Applications/Examples Characteristics of HiFi Systems Foundational Components – TelegraphCQ – TinyDB Research Issues Conclusions M. Franklin, UC Berkeley, Feb. 04 The Canonical HiFi System M. Franklin, UC Berkeley, Feb. 04 RFID - Retail Scenario “Smart Shelves” continuously monitor item addition and removal. Info is sent back through the supply chain. M. Franklin, UC Berkeley, Feb. 04 “Extranet” Information Flow Manufacturer C Retailer A Manufacturer D Aggregation/ Distribution Service Retailer B M. Franklin, UC Berkeley, Feb. 04 M2M - Telemetry/Remote Monitoring Energy Monitoring - Demand Response Traffic Power Generation Remote Equipment M. Franklin, UC Berkeley, Feb. 04 Time-Shift Trend Prediction National companies can exploit East Coast/ West Coast time differentials to optimize West Coast operations. M. Franklin, UC Berkeley, Feb. 04 Virtual Sensors Sensors don’t have to be physical sensors. Network Monitoring algorithms for detecting viruses, spam, DoS attacks, etc. Disease outbreak detection M. Franklin, UC Berkeley, Feb. 04 Properties High Fan-In, globally-distributed architecture. Large data volumes generated at edges. – Filtering and cleaning must be done there. Successive aggregation as you move inwards. – Summaries/anomalies continually, details later. Strong temporal focus. Strong spatial/geographic focus. Streaming data and stored data. Integration within and across enterprises. M. Franklin, UC Berkeley, Feb. 04 One View of the Design Space Filtering, Cleaning, Alerts seconds On-the-fly processing Monitoring, Time-series Data mining (recent history) Time Scale Combined Stream/Disk Processing Archiving (provenance and schema evolution) years Disk-based processing M. Franklin, UC Berkeley, Feb. 04 Another View of the Design Space Filtering, Cleaning, Alerts local Several Readers Monitoring, Time-series Data mining (recent history) Geographic Scope Regional Centers Archiving (provenance and schema evolution) global Central Office M. Franklin, UC Berkeley, Feb. 04 One More View of the Design Space Filtering, Cleaning, Alerts Monitoring, Time-series Degree of Detail Dup Elim history: hrs Data mining (recent history) Archiving (provenance and schema evolution) Aggregate Data Volume Interesting Events history: days Trends/Archive history: years M. Franklin, UC Berkeley, Feb. 04 Building Blocks TinyDB TelegraphCQ M. Franklin, UC Berkeley, Feb. 04 TelegraphCQ: Monitoring Data Streams Streaming Data – – – – B2B and Enterprise apps – – Network monitors Sensor Networks News feeds Stock tickers Supply-Chain, CRM, RFID Trade Reconciliation, Order Processing etc. (Quasi) real-time flow of events and data Must manage these flows to drive business (and other) processes. Can mine flows to create/adjust business rules or to perform on-line analysis. TelegraphCQ (Continuous Queries) An adaptive system for large-scale shared dataflow processing. Based on an extensible set of operators: 1) Ingress (data access) operators Wrappers, File readers, Sensor Proxies 2) Non-Blocking Data processing operators Selections (filters), XJoins, … 3) Adaptive Routing Operators Eddies, STeMs, FLuX, etc. Operators connected through “Fjords” – queue-based framework unifying push&pull. – Fjords will also allow us to easily mix and match streaming and stored data sources. Extreme Adaptivity static late plans binding current DBMS per intrainteroperator operator tuple ??? Query ??? Dynamic, XJoin, DPHJ Eddies, Parametric, Scrambling, Convergent CACQ QP Competitive, MidQuery PSoup Re-opt … Traditional query optimization depends on statistical knowledge of the data and a stable environment. The streaming world has neither. This is the region that we are exploring in the Telegraph project. M. Franklin, UC Berkeley, Feb. 04 Adaptivity Overview [Avnur & Hellerstein 2000] static dataflow eddy D C A B A B C D • How to order and reorder operators over time? – Traditionally, use performance, economic/admin feedback – won’t work for never-ending queries over volatile streams • Instead, use adaptive record routing. Reoptimization = change in routing policy M. Franklin, UC Berkeley, Feb. 04 The TelegraphCQ Architecture Shared Memory Query Plan Queue TelegraphCQ Back End Eddy Control Queue Modules CQEddy Split Query Result Queues } Split Scans TelegraphCQ Front End Planner Parser Listener Mini-Executor Proxy Catalog Shared Memory Buffer Pool Wrappers TelegraphCQ Wrapper ClearingHouse Disk A single CQEddy can encode multiple queries. M. Franklin, UC Berkeley, Feb. 04 The StreaQuel Query Language SELECT projection_list FROM from_list WHERE selection_and_join_predicates ORDEREDBY TRANSFORM…TO WINDOW…BY Target language for TelegraphCQ Windows can be applied to individual streams Window movement is expressed using a “for loop construct in the “transform” clause We’re not completely happy with our syntax at this point. M. Franklin, UC Berkeley, Feb. 04 Example Window Query: Landmark NOW = 40 = t 0 5 10 15 20 25 30 35 Window 40 45 50 55 60 Timeline ST NOW = 41 = t ... ... Window Timeline ST NOW = 45 = t Window ST Window ST NOW = 50 = t Timeline Timeline Current Status - TelegraphCQ System developed by modifying PostgreSQL. Initial Version released Aug 03 – Open Source (PostgreSQL license) – Shared joins with windows and aggregates – Archived/unarchived streams – Next major release planned this summer. Initial users include – Network monitoring project at LBL (Netlogger) – Intrusion detection project at Eurecom (France) – Our own project on Sensor Data Processing – Class projects at Berkeley, CMU, and ??? Visit http://telegraph.cs.berkeley.edu for more information. Query-based interface to sensor networks Developed on TinyOS/Motes Benefits – Ease of programming and retasking – Extensible aggregation framework – Power-sensitive optimization and adaptivity Sam Madden (Ph.D. Thesis) in collaboration with Wei Hong (Intel). http://telegraph.cs.berkeley.edu/tinydb SELECT MAX(mag) FROM sensors WHERE mag > thresh SAMPLE PERIOD 64ms App Query, Trigger Data TinyDB Sensor Network M. Franklin, UC Berkeley, Feb. 04 Declarative Queries in Sensor Nets Many sensor network applications can be described using query language primitives. – Potential for tremendous reductions in development and debugging effort. “Report the light intensities of the bright nests.” Sensors SELECT nestNo, light FROM sensors WHERE light > 400 EPOCH DURATION 1s Epoch nestNo Light Temp Accel Sound 0 1 455 x x x 0 2 389 x x x 1 1 422 x x x 1 2 405 x x x Aggregation Query Example “Count the number occupied nests in each loud region of the island.” SELECT region, CNT(occupied) AVG(sound) Epoch CNT(…) region AVG(…) 0 North 3 360 0 South 3 520 1 North 3 370 1 South 3 520 FROM sensors GROUP BY region HAVING AVG(sound) > 200 EPOCH DURATION 10s Regions w/ AVG(sound) > 200 Query Language (TinySQL) SELECT <aggregates>, <attributes> [FROM {sensors | <buffer>}] [WHERE <predicates>] [GROUP BY <exprs>] [SAMPLE PERIOD <const> | ONCE] [INTO <buffer>] [TRIGGER ACTION <command>] M. Franklin, UC Berkeley, Feb. 04 Sensor Queries @ 10000 Ft Query {A,B,C,D,E,F} (Almost) All Queries are Continuous and Periodic A {B,D,E,F} B C •Sample rate {D,E,F} D E Written in SQL With Extensions For : F •Offline delivery •Temporal Aggregation M. Franklin, UC Berkeley, Feb. 04 In-Network Processing: Aggregation SELECT COUNT(*) FROM sensors Interval 4 1 Sensor # 1 2 3 4 5 2 3 4 Interval # Epoch 3 2 4 1 4 5 M. Franklin, UC Berkeley, Feb. 04 In-Network Processing: Aggregation SELECT COUNT(*) FROM sensors Interval 4 1 Sensor # 1 Interval # 4 2 3 4 5 Epoch 2 3 1 3 2 1 4 4 1 5 M. Franklin, UC Berkeley, Feb. 04 In-Network Processing : Aggregation SELECT COUNT(*) FROM sensors Interval 3 1 Sensor # 1 2 3 4 Interval # 4 3 2 5 2 3 1 2 2 4 1 4 5 M. Franklin, UC Berkeley, Feb. 04 In-Network Processing : Aggregation SELECT COUNT(*) FROM sensors Interval 2 1 Sensor # 1 1 2 3 4 Interval # 4 2 3 1 3 2 5 3 2 1 3 4 1 4 5 M. Franklin, UC Berkeley, Feb. 04 In-Network Processing : Aggregation SELECT COUNT(*) FROM sensors 5 1 Sensor # 1 2 3 4 Interval # 4 4 2 3 2 2 1 5 1 3 Interval 1 1 3 4 5 5 M. Franklin, UC Berkeley, Feb. 04 In-Network Processing : Aggregation SELECT COUNT(*) FROM sensors Interval 4 1 Sensor # 1 2 3 4 Interval # 4 3 2 2 4 2 1 3 1 5 1 4 3 1 5 1 5 M. Franklin, UC Berkeley, Feb. 04 In Network Aggregation: Example Benefits 2500 Nodes 50x50 Grid Depth = ~10 Total Bytes Xmitted vs. Aggregation Function Neighbors = ~20 Total Bytes Xmitted 100000 90000 80000 70000 60000 50000 40000 30000 20000 10000 0 EXTERNAL MAX AVERAGE COUNT MEDIAN Aggregation Function M. Franklin, UC Berkeley, Feb. 04 Taxonomy of Aggregates TinyDB insight: classify aggregates according to various functional properties – Yields a general set of optimizations that can automatically be applied Property Examples Affects Partial State MEDIAN : unbounded, MAX : 1 record Effectiveness of TAG Duplicate Sensitivity MIN : dup. insensitive, AVG : dup. sensitive MAX : exemplary COUNT: summary COUNT : monotonic AVG : non-monotonic Routing Redundancy Exemplary vs. Summary Monotonic Applicability of Sampling, Effect of Loss Hypothesis Testing, Snooping M. Franklin, UC Berkeley, Feb. 04 Current Status - TinyDB System built on top of TinyOS (~10K lines embedded C code)Latest release 9/2003 Several deployments including redwoods at UC Botanical Garden Humidity vs. Time 101 104 109 110 111 36m 33m: 111 32m: 110 30m: 109,108,107 Rel Humidity (%) 95 85 75 65 55 45 35 20m: 106,105,104 Temperature vs. Time 10m: 103, 102, 101 Temperature (C) 33 28 23 18 13 8 7/7/03 7/7/03 7/7/03 7/7/03 8/7/03 8/7/03 8/7/03 8/7/03 8/7/03 8/7/03 9/7/03 9/7/03 9/7/03 9:40 13:41 17:43 21:45 1:47 5:49 9:51 13:53 17:55 21:57 1:59 6:01 10:03 Visit http://telegraph.cs.berkeley.edu/tinydb for more information. Date Putting It All Together? TinyDB TelegraphCQ M. Franklin, UC Berkeley, Feb. 04 Ursa - A HiFi Implementation Current effort towards building an integrated infrastructure that spans the large scale in: – Time – Geography – Resources Ursa-Minor Mid-tier (TinyDB-based) (???) Ursa-Major (TelegraphCQ w/Archiving) M. Franklin, UC Berkeley, Feb. 04 TelegraphCQ/TinyDB Integration Fjords [Madden & Franklin 02] provide the dataflow plumbing necessary to use TinyDB as a data stream. Main issues revolve around what to run where. – TCQ is a query processor – TinyDB is also a query processor – Optimization criteria include: total cost, response time, answer quality, answer likelihood, power conservation on motes, … Project on-going, should work by summer. Related work: Gigascope work at AT&T M. Franklin, UC Berkeley, Feb. 04 TCQ-based Overlay Network TCQ is primarily a single node system – Flux operators [Shah et al 03] support cluster-based processing. Want to run TCQ at each internal node. Primary issue is support for wide-area temporal and geographic aggregation. – In an adaptive manner, of course Currently under design. Related work: Astrolabe, IRISNet, DBIS, … M. Franklin, UC Berkeley, Feb. 04 Querying the Past, Present, and Future Need to handle archived data – Adaptive compression can reduce processing time. – Historical queries – Joins of Live and Historical Data – Deal with later arriving detail info Archiving Storage Manager - A Split-stream SM for stream and disk-based processing. Initial version of new SM running. Related Work: Temporal and Time-travel DBs M. Franklin, UC Berkeley, Feb. 04 XML, Integration, and Other Realities Eventually need to support XML Must integrate with existing enterprise apps. In many areas, standardization well underway Augmenting moving data High Fan-in High Fan-out Related Work: YFilter [Diao & Franklin 03], Mutant Queries [Papadimos et al. OGI], 30+ years of data integration research, 10+ years of XML research, … M. Franklin, UC Berkeley, Feb. 04 Conclusions Sensors, RFIDs, and other data collection devices enable realtime enterprises. These will create high fan-in systems. Can exploit recent advances in streaming and sensor data management. Lots to do! M. Franklin, UC Berkeley, Feb. 04