Implementation and Research Issues in Query Processing for Wireless Sensor Networks Wei Hong Intel Research, Berkeley whong@intel-research.net Sam Madden MIT madden@csail.mit.edu 1 Adapted by L.B. Declarative Queries • Programming Apps is Hard – – – – – Limited power budget Lossy, low bandwidth communication Require long-lived, zero admin deployments Distributed Algorithms Limited tools, debugging interfaces • Queries abstract away much of the complexity – Burden on the database developers – Users get: • Safe, optimizable programs • Freedom to think about apps instead of details 2 TinyDB: Prototype declarative query processor • Platform: Berkeley Motes + TinyOS • Continuous variant of SQL : TinySQL • Power and data-acquisition based innetwork optimization framework • Extensible interface for aggregates, new types of sensors 3 TinyDB Revisited • High level abstraction: – Data centric programming – Interact with sensor network as a whole – Extensible framework • Under the hood: – Intelligent query processing: query optimization, power efficient execution – Fault Mitigation: automatically introduce redundancy, avoid problem areas SELECT MAX(mag) FROM sensors WHERE mag > thresh SAMPLE PERIOD 64ms App Query, Trigger Data TinyDB Sensor Network 4 Feature Overview • • • • • • Declarative SQL-like query interface Metadata catalog management Multiple concurrent queries Network monitoring (via queries) In-network, distributed query processing Extensible framework for attributes, commands and aggregates • In-network, persistent storage 5 Architecture TinyDB GUI TinyDB Client API JDBC PC side Mote side 0 0 4 TinyDB query processor 2 1 5 Sensor network DBMS 83 6 7 6 Data Model • Entire sensor network as one single, infinitely-long logical table: sensors • Columns consist of all the attributes defined in the network • Typical attributes: – Sensor readings – Meta-data: node id, location, etc. – Internal states: routing tree parent, timestamp, queue length, etc. • Nodes return NULL for unknown attributes • On server, all attributes are defined in catalog.xml • Discussion: other alternative data models? 7 Query Language (TinySQL) SELECT <aggregates>, <attributes> [FROM {sensors | <buffer>}] [WHERE <predicates>] [GROUP BY <exprs>] [SAMPLE PERIOD <const> | ONCE] [INTO <buffer>] [TRIGGER ACTION <command>] 8 Comparison with SQL • Single table in FROM clause • Only conjunctive comparison predicates in WHERE and HAVING • No subqueries • No column alias in SELECT clause • Arithmetic expressions limited to column op constant • Only fundamental difference: SAMPLE PERIOD clause 9 TinySQL Examples “Find the sensors in bright nests.” Sensors 1 SELECT nodeid, nestNo, light FROM sensors WHERE light > 400 EPOCH DURATION 1s Epoch Nodeid nestNo Light 0 1 17 455 0 2 25 389 1 1 17 422 1 2 25 405 10 TinySQL Examples (cont.) 2 SELECT AVG(sound) FROM sensors EPOCH DURATION 10s 3 SELECT region, CNT(occupied) AVG(sound) FROM sensors GROUP BY region HAVING AVG(sound) > 200 EPOCH DURATION 10s “Count the number occupied nests in each loud region of the island.” Epoch region CNT(…) AVG(…) 0 North 3 360 0 South 3 520 1 North 3 370 1 South 3 520 Regions w/ AVG(sound) > 200 11 Event-based Queries • ON event SELECT … • Run query only when interesting events happens • Event examples – Button pushed – Message arrival – Bird enters nest • Analogous to triggers but events are userdefined 12 Query over Stored Data • • • • • Named buffers in Flash memory Store query results in buffers Query over named buffers Analogous to materialized views Example: – CREATE BUFFER name SIZE x (field1 type1, field2 type2, …) – SELECT a1, a2 FROM sensors SAMPLE PERIOD d INTO name – SELECT field1, field2, … FROM name SAMPLE PERIOD d 13 Inside TinyDB SELECT T:1, AVG: 225 AVG(temp) Queries Results T:2, AVG: 250 WHERE light > 400 Multihop Network Query Processor Aggavg(temp) ~10,000 Lines Embedded C Code Filter Name: temp light > 400 got(‘temp’) ~5,000 LinesSamples (PC-Side) Java Time to sample: 50 uS get (‘temp’) Tables Cost to sample: 90 uJ Schema ~3200 Bytes RAM (w/ 768 byte heap) Table: 3 Calibration getTempFunc(…) Units: Deg. F TinyOS code ~58 kB compiled Error: ± 5 Deg F Get f Program) : getTempFunc()… (3x larger than 2nd largest TinyOS 14 TinyDB Tree-based Routing • Tree-based routing – Used in: • Query delivery • Data collection • In-network aggregation – Relationship to indexing? Q:SELECT … A Q R:{…} Q R:{…} B Q R:{…}Q Q D R:{…}Q C Q Q R:{…} Q Q Q F E Q 15 Sensor Network Research • Very active research area – Can’t summarize it all • Focus: database-relevant research topics – Some outside of Berkeley – Other topics that are itching to be scratched – But, some bias towards work that we find compelling 16 Topics • • • • • • In-network aggregation Acquisitional Query Processing Heterogeneity Intermittent Connectivity In-network Storage Statistics-based summarization and sampling • In-network Joins • Adaptivity and Sensor Networks • Multiple Queries 17 Topics • • • • • • In-network aggregation Acquisitional Query Processing Heterogeneity Intermittent Connectivity In-network Storage Statistics-based summarization and sampling • In-network Joins • Adaptivity and Sensor Networks • Multiple Queries 18 Tiny Aggregation (TAG) • In-network processing of aggregates – Common data analysis operation • Aka gather operation or reduction in || programming – Communication reducing • Operator dependent benefit – Across nodes during same epoch • Exploit query semantics to improve efficiency! Madden, Franklin, Hellerstein, Hong. Tiny AGgregation (TAG), OSDI 2002. 19 Basic Aggregation • In each epoch: – Each node samples local sensors once – Generates partial state record (PSR) • local readings • readings from children – Outputs PSR during assigned comm. interval • At end of epoch, PSR for whole network output at root • New result on each successive epoch 1 2 3 4 5 • Extras: – Predicate-based partitioning via GROUP BY 20 Illustration: Aggregation SELECT COUNT(*) FROM sensors Sensor # 1 Interval # 4 2 3 Interval 4 1 4 Epoch 5 1 2 3 3 2 1 4 4 1 5 21 Illustration: Aggregation SELECT COUNT(*) FROM sensors Sensor # 1 2 3 Interval # 2 1 4 4 3 Interval 3 5 1 2 3 2 2 4 1 4 5 22 Illustration: Aggregation SELECT COUNT(*) FROM sensors Sensor # 1 2 3 Interval # 1 4 4 1 5 1 3 2 Interval 2 3 2 3 2 1 3 4 1 4 5 23 Illustration: Aggregation SELECT COUNT(*) FROM sensors Sensor # 1 2 3 Interval # 2 3 2 2 4 5 1 3 Interval 1 1 4 4 1 5 1 3 4 5 5 24 Illustration: Aggregation SELECT COUNT(*) FROM sensors Sensor # 1 2 3 Interval # 5 1 3 2 3 2 2 4 1 4 4 1 Interval 4 1 3 4 5 1 1 5 25 Aggregation Framework • As in extensible databases, TinyDB supports any aggregation function conforming to: Aggn={finit, fmerge, fevaluate} Finit {a0} <a0> Partial State Record (PSR) Fmerge {<a1>,<a2>} <a12> Fevaluate {<a1>} aggregate value Example: Average AVGinit {v} <v,1> AVGmerge {<S1, C1>, <S2, C2>} < S1 + S2 , C1 + C2> AVGevaluate{<S, C>} S/C Restriction: Merge associative, commutative 26 Taxonomy of Aggregates • TAG insight: classify aggregates according to various functional properties – Yields a general set of optimizations that can automatically be applied Property Partial State Examples MEDIAN : unbounded, MAX : 1 record Affects Effectiveness of TAG Monotonicity COUNT : monotonic AVG : non-monotonic MAX : exemplary COUNT: summary MIN : dup. insensitive, AVG : dup. sensitive Hypothesis Testing, Snooping Exemplary vs. Summary Duplicate Sensitivity Drives an API! Applicability of Sampling, Effect of Loss Routing Redundancy 27 Use Multiple Parents • Use graph structure – Increase delivery probability with no communication overhead • For duplicate insensitive aggregates, or • Aggs expressible as sum of parts – Send (part of) aggregate to all parents SELECT COUNT(*) • In just one message, via multicast R – Assuming independence, decreases variance P(link xmit successful) = p P(success from A->R) = p2 E(cnt) = c * p2 Var(cnt) = c2 * p2 * (1 – p2) V # of parents = n E(cnt) = n * (c/n * p2) (c/n)2 Var(cnt) = n * * p2 * (1 – p2) = V/n B C c c/n n=2 c/n A 28 Multiple Parents Results With Splitting Benefit of Result Splitting (COUNT query) 1400 1200 Avg. COUNT No Splitting • Better than previous analysis expected! Critical • Losses aren’t Link! independent! • Insight: spreads data over many links 1000 800 Splitting No Splitting 600 400 200 0 (2500 nodes, lossy radio model, 6 parents per node) 29 Acquisitional Query Processing (ACQP) • TinyDB acquires AND processes data – Could generate an infinite number of samples • An acqusitional query processor controls – when, – where, – and with what frequency data is collected! • Versus traditional systems where data is provided a priori Madden, Franklin, Hellerstein, and Hong. The Design of An 30 Acqusitional Query Processor. SIGMOD, 2003. ACQP: What’s Different? • How should the query be processed? – Sampling as a first class operation • How does the user control acquisition? – Rates or lifetimes – Event-based triggers • Which nodes have relevant data? – Index-like data structures • Which samples should be transmitted? – Prioritization, summary, and rate control 31 Operator Ordering: Interleave Sampling + Selection SELECT light, mag FROM sensors WHERE pred1(mag) AND pred2(light) EPOCH DURATION 1s Traditional DBMS (pred1) (pred2) At 1 sample / sec, total power savings • could E(sampling mag) as >> 3.5mW E(sampling be as much light) 1500 uJ vs. uJ Comparable to 90 processor! Correct ordering (unless pred1 is very selective and pred2 is not): (pred1) ACQP Costly (pred2) Cheap mag light mag light (pred2) light (pred1) mag 32 Exemplary Aggregate Pushdown SELECT WINMAX(light,8s,8s) FROM sensors WHERE mag > x EPOCH DURATION 1s Traditional DBMS WINMAX (mag>x) ACQP WINMAX (mag>x) mag • Novel, general pushdown technique • Mag sampling is the most expensive operation! (light > MAX) light mag light 33 Topics • • • • • • • • • In-network aggregation Acquisitional Query Processing Heterogeneity Intermittent Connectivity In-network Storage Statistics-based summarization and sampling In-network Joins Adaptivity and Sensor Networks Multiple Queries 34 Heterogeneous Sensor Networks • Leverage small numbers of high-end nodes to benefit large numbers of inexpensive nodes • Still must be transparent and ad-hoc • Key to scalability of sensor networks • Interesting heterogeneities – – – – – Energy: battery vs. outlet power Link bandwidth: Chipcon vs. 802.11x Computing and storage: ATMega128 vs. Xscale Pre-computed results Sensing nodes vs. QP nodes 35 Computing Heterogeneity with TinyDB • Separate query processing from sensing – Provide query processing on a small number of nodes – Attract packets to query processors based on “service value” • Compare the total energy consumption of the network • • • • No aggregation All aggregation Opportunistic aggregation HSN proactive aggregation Mark Yarvis and York Liu, Intel’s Heterogeneous Sensor Network Project, ftp://download.intel.com/research/people/HSN_IR_Day_Poster_03.pdf. 36 5x7 TinyDB/HSN Mica2 Testbed 37 Data Packet Saving Data Packet Saving 0.00% % Change in Data Packet Count -5.00% • How many aggregators are desired? • Does placement matter? -10.00% -15.00% 11% aggregators achieve 72% of max data reduction -20.00% -25.00% -30.00% -35.00% -40.00% -45.00% -50.00% 1 2 3 4 5 6 All (35) Number of Aggregator Data Packet Saving - Aggregator Placement % Change in Data Packet Counnt 0.00% -5.00% -10.00% -15.00% -20.00% -25.00% Optimal placement 2/3 distance from sink. -30.00% -35.00% -40.00% -45.00% -50.00% 25 27 29 31 Aggregator Location All (35) 38 Occasionally Connected Sensornets internet TinyDB Server Mobile GTWY GTWY TinyDB QP Mobile GTWY Mobile GTWY GTWY GTWY TinyDB QP TinyDB QP 39 Occasionally Connected Sensornets Challenges • Networking support – Tradeoff between reliability, power consumption and delay – Data custody transfer: duplicates? – Load shedding – Routing of mobile gateways • Query processing – Operation placement: in-network vs. on mobile gateways – Proactive pre-computation and data movement • Tight interaction between networking and QP Fall, Hong and Madden, Custody Transfer for Reliable Delivery in Delay Tolerant Networks, http://www.intel-research.net/Publications/Berkeley/081220030852_157.pdf . 40 Distributed In-network Storage • Collectively, sensornets have large amounts of in-network storage • Good for in-network consumption or caching • Challenges – Distributed indexing for fast query dissemination – Resilience to node or link failures – Graceful adaptation to data skews – Minimizing index insertion/maintenance cost 41 Example: DIM • Functionality – Efficient range query for multidimensional data. • Approaches – Divide sensor field into bins. – Locality preserving mapping from m-d space to geographic locations. – Use geographic routing such as GPSR. E2= <0.6, 0.7> E1 = <0.7, 0.8> • Assumptions – Nodes know their locations and network boundary – No node mobility Q1=<.5-.7, .5-1> Xin Li, Young Jin Kim, Ramesh Govindan and Wei Hong, Distributed Index for Multi-dimentional Data (DIM) in Sensor Networks, SenSys 2003. 42 Statistical Techniques • Approximations, summaries, and sampling based on statistics and statistical models • Applications: – Limited bandwidth and large number of nodes -> data reduction – Lossiness -> predictive modeling – Uncertainty -> tracking correlations and changes over time – Physical models -> improved query answering 43 Correlated Attributes • Data in sensor networks is correlated; e.g., – – – – – Temperature and voltage Temperature and light Temperature and humidity Temperature and time of day etc. 44 IDSQ • Idea: task sensors in order of best improvement to estimate of some value: – Choose leader(s) • Suppress subordinates • Task subordinates, one at a time – Until some measure of goodness (error bound) is met » E.g. “Mahalanobis Distance” -- Accounts for correlations in axes, tends to favor minimizing principal axis See “Scalable Information-Driven Sensor Querying and Routing for ad hoc Heterogeneous Sensor Networks.” Chu, Haussecker and Zhao. Xerox TR P2001-10113. May, 2001. 45 Graphical Representation Model location estimate as a point with 2-dimensional Gaussian uncertainty. Residual 1 Residual 2 Area of residuals is equal Principal Axis S1 S2 Preferred because it reduces error along principal 46 axis MQSN: Model-based Probabilistic Querying over Sensor Networks Joint work with Amol Desphande, Carlos Guestrin, and Joe Hellerstein Model Query Processor 1 3 4 2 5 6 7 8 9 47 MQSN: Model-based Probabilistic Querying over Sensor Networks Probabilistic Query select NodeID, Temp ± 0.1C where NodeID in [1..9] with conf(0.95) Model Consult Model Query Processor Observation Plan [Temp, 3], [Temp, 9] 1 3 4 2 5 6 7 8 9 48 MQSN: Model-based Probabilistic Querying over Sensor Networks Probabilistic Query select NodeID, Temp ± 0.1C where NodeID in [1..9] with conf(0.95) Model Consult Model Query Processor Observation Plan [Temp, 3], [Temp, 9] 1 3 4 2 5 6 7 8 9 49 MQSN: Model-based Probabilistic Querying over Sensor Networks Query Results Temperature 30 Model 25 20 15 10 Update Model 1 Query Processor 2 3 4 Node ID Data [Temp, 3] = …, [Temp, 9] = … 1 3 4 2 5 6 7 8 9 50 Challenges • What kind of models to use ? • Optimization problem: – Given a model and a query, find the best set of attributes to observe – Cost not easy to measure • Non-uniform network communication costs • Changing network topologies – Large plan space • Might be cheaper to observe attributes not in query – e.g. Voltage instead of Temperature • Conditional Plans: – Change the observation plan based on observed values 51 MQSN: Current Prototype • Multi-variate Gaussian Models – Kalman Filters to capture correlations across time • Handles: – Range predicate queries • sensor value within [x,y], w/ confidence – Value queries • sensor value = x, w/in epsilon, w/ confidence – Simple aggregate queries • AVG(sensor value) n, w/in epsilon, w/confidence • Uses a greedy algorithm to choose the observation plan 52 In-Net Regression • Linear regressionX :vssimple way to predict Y w/ Curve Fit future12values, identify outliers y = 0.9703x - 0.0067 10 2 • Regression can be acrossRlocal = 0.947or remote values, 8multiple dimensions, or with high degree6polynomials – E.g., node A readings vs. node B’s 4 – Or, location (X,Y), versus temperature 2 E.g., over many nodes 0 1 3 5 7 9 Guestrin, Thibaux, Bodik, Paskin, Madden. “Distributed Regression: an Efficient53 Framework for Modeling Sensor Network Data .” Under submission. In-Net Regression (Continued) • Problem: may require data from all sensors to build model • Solution: partition sensors into overlapping “kernels” that influence each other – Run regression in each kernel • Requiring just local communication – Blend data between kernels – Requires some clever matrix manipulation • End result: regressed model at every node – Useful in failure detection, missing value estimation 54 Exploiting Correlations in Query Processing • Simple idea: – Given predicate P(A) over expensive attribute A – Replace it with P’ over cheap attribute A’ such that P’ evaluates to P – Problem: unless A and A’ are perfectly correlated, P’ ≠ P for all time • So we could incorrectly accept or reject some readings • Alternative: use correlations to improve selectivity estimates in query optimization – Construct conditional plans that vary predicate order based on prior observations 55 Exploiting Correlations (Cont.) • • • Insight: by observing a (cheap and correlated) variable not involved in the query, it may be possible to improve query performance – Improves estimates of selectivities Use conditional plans Example T Light Light>> 100 100Lux Lux Temp Temp<< 20° 20°CC Expected ExpectedCost Cost==150 110 Cost Cost==100 100 Cost Cost==100 100 Selectivity Selectivity==.5.1 Selectivity Selectivity==.5.9 Time in [6pm, 6am] F Temp Temp<< 20° 20°CC Light Light>> 100 100Lux Lux Cost Cost==100 100 Cost Cost==100 100 Selectivity Selectivity==.5.1 Selectivity Selectivity==.5.9 Expected ExpectedCost Cost==150 110 56 In-Network Join Strategies • Types of joins: – non-sensor -> sensor – sensor -> sensor • Optimization questions: – Should the join be pushed down? – If so, where should it be placed? – What if a join table exceeds the memory available on one node? 57 Choosing Where to Place Operators • Idea : choose a “join node” to run the operator • Over time, explore other candidate placements – Nodes advertise data rates to their neighbors – Neighbors compute expected cost of running the join based on these rates – Neighbors advertise costs – Current join node selects a new, lower cost node Bonfils + Bonnet, Adaptive and Decentralized Operator Placement for In-Network QueryProcessing IPSN 2003. 58 Topics • • • • • • In-network aggregation Acquisitional Query Processing Heterogeneity Intermittent Connectivity In-network Storage Statistics-based summarization and sampling • In-network Joins • Adaptivity and Sensor Networks • Multiple Queries 59 Adaptivity In Sensor Networks • Queries are long running • Selectivities change – E.g. night vs day • Network load and available energy vary • All suggest that some adaptivity is needed – Of data rates or granularity of aggregation when optimizing for lifetimes – Of operator orderings or placements when selectivities change (c.f., conditional plans for correlations) • As far as we know, this is an open problem! 60 Multiple Queries and Work Sharing • As sensornets evolve, users will run many queries simultaneously – E.g., traffic monitoring • Likely that queries will be similar – But have different end points, parameters, etc • Would like to share processing, routing as much as possible • But how? Again, an open problem. 61 Concluding Remarks • Sensor networks are an exciting emerging technology, with a wide variety of applications • Many research challenges in all areas of computer science – Database community included – Some agreement that a declarative interface is right • TinyDB and other early work are an important first step • But there’s lots more to be done! 62