Implementation and Research Issues in Query Processing for Wireless Sensor Networks Wei Hong

advertisement
Implementation and Research
Issues in Query Processing
for Wireless Sensor Networks
Wei Hong
Intel Research, Berkeley
whong@intel-research.net
Sam Madden
MIT
madden@csail.mit.edu
1
Adapted by L.B.
Declarative Queries
• Programming Apps is Hard
–
–
–
–
–
Limited power budget
Lossy, low bandwidth communication
Require long-lived, zero admin deployments
Distributed Algorithms
Limited tools, debugging interfaces
• Queries abstract away much of the complexity
– Burden on the database developers
– Users get:
• Safe, optimizable programs
• Freedom to think about apps instead of details
2
TinyDB: Prototype declarative
query processor
• Platform: Berkeley Motes + TinyOS
• Continuous variant of SQL : TinySQL
• Power and data-acquisition based innetwork optimization framework
• Extensible interface for aggregates, new
types of sensors
3
TinyDB Revisited
• High level abstraction:
– Data centric programming
– Interact with sensor
network as a whole
– Extensible framework
• Under the hood:
– Intelligent query processing:
query optimization, power
efficient execution
– Fault Mitigation:
automatically introduce
redundancy, avoid problem
areas
SELECT MAX(mag)
FROM sensors
WHERE mag > thresh
SAMPLE PERIOD 64ms
App
Query,
Trigger
Data
TinyDB
Sensor Network
4
Feature Overview
•
•
•
•
•
•
Declarative SQL-like query interface
Metadata catalog management
Multiple concurrent queries
Network monitoring (via queries)
In-network, distributed query processing
Extensible framework for attributes,
commands and aggregates
• In-network, persistent storage
5
Architecture
TinyDB GUI
TinyDB Client API
JDBC
PC side
Mote side
0
0
4
TinyDB query
processor
2
1
5
Sensor network
DBMS
83
6
7
6
Data Model
• Entire sensor network as one single, infinitely-long logical
table: sensors
• Columns consist of all the attributes defined in the network
• Typical attributes:
– Sensor readings
– Meta-data: node id, location, etc.
– Internal states: routing tree parent, timestamp, queue length,
etc.
• Nodes return NULL for unknown attributes
• On server, all attributes are defined in catalog.xml
• Discussion: other alternative data models?
7
Query Language (TinySQL)
SELECT <aggregates>, <attributes>
[FROM {sensors | <buffer>}]
[WHERE <predicates>]
[GROUP BY <exprs>]
[SAMPLE PERIOD <const> | ONCE]
[INTO <buffer>]
[TRIGGER ACTION <command>]
8
Comparison with SQL
• Single table in FROM clause
• Only conjunctive comparison predicates in
WHERE and HAVING
• No subqueries
• No column alias in SELECT clause
• Arithmetic expressions limited to column
op constant
• Only fundamental difference: SAMPLE
PERIOD clause
9
TinySQL Examples
“Find the sensors in bright
nests.”
Sensors
1
SELECT nodeid, nestNo, light
FROM sensors
WHERE light > 400
EPOCH DURATION 1s
Epoch Nodeid nestNo Light
0
1
17
455
0
2
25
389
1
1
17
422
1
2
25
405
10
TinySQL Examples (cont.)
2 SELECT AVG(sound)
FROM sensors
EPOCH DURATION 10s
3 SELECT region, CNT(occupied)
AVG(sound)
FROM sensors
GROUP BY region
HAVING AVG(sound) > 200
EPOCH DURATION 10s
“Count the number occupied
nests in each loud region of
the island.”
Epoch
region
CNT(…)
AVG(…)
0
North
3
360
0
South
3
520
1
North
3
370
1
South
3
520
Regions w/ AVG(sound) > 200
11
Event-based Queries
• ON event SELECT …
• Run query only when interesting events
happens
• Event examples
– Button pushed
– Message arrival
– Bird enters nest
• Analogous to triggers but events are userdefined
12
Query over Stored Data
•
•
•
•
•
Named buffers in Flash memory
Store query results in buffers
Query over named buffers
Analogous to materialized views
Example:
– CREATE BUFFER name SIZE x (field1 type1, field2
type2, …)
– SELECT a1, a2 FROM sensors SAMPLE PERIOD d INTO
name
– SELECT field1, field2, … FROM name SAMPLE PERIOD d
13
Inside TinyDB
SELECT
T:1, AVG: 225
AVG(temp) Queries
Results T:2, AVG: 250
WHERE
light > 400
Multihop
Network
Query Processor
Aggavg(temp)
~10,000
Lines Embedded C Code
Filter
Name: temp
light >
400
got(‘temp’)
~5,000
LinesSamples
(PC-Side)
Java Time to sample: 50 uS
get (‘temp’) Tables
Cost to sample: 90 uJ
Schema
~3200 Bytes
RAM (w/ 768 byte
heap) Table: 3
Calibration
getTempFunc(…)
Units: Deg. F
TinyOS code
~58 kB compiled
Error: ± 5 Deg F
Get f Program)
: getTempFunc()…
(3x larger than 2nd largest TinyOS
14
TinyDB
Tree-based Routing
• Tree-based routing
– Used in:
• Query delivery
• Data collection
• In-network aggregation
– Relationship to indexing?
Q:SELECT …
A
Q
R:{…}
Q
R:{…}
B
Q
R:{…}Q
Q
D
R:{…}Q
C
Q
Q
R:{…}
Q
Q
Q
F
E
Q
15
Sensor Network Research
• Very active research area
– Can’t summarize it all
• Focus: database-relevant research topics
– Some outside of Berkeley
– Other topics that are itching to be scratched
– But, some bias towards work that we find
compelling
16
Topics
•
•
•
•
•
•
In-network aggregation
Acquisitional Query Processing
Heterogeneity
Intermittent Connectivity
In-network Storage
Statistics-based summarization and
sampling
• In-network Joins
• Adaptivity and Sensor Networks
• Multiple Queries
17
Topics
•
•
•
•
•
•
In-network aggregation
Acquisitional Query Processing
Heterogeneity
Intermittent Connectivity
In-network Storage
Statistics-based summarization and
sampling
• In-network Joins
• Adaptivity and Sensor Networks
• Multiple Queries
18
Tiny Aggregation (TAG)
• In-network processing of aggregates
– Common data analysis operation
• Aka gather operation or reduction in || programming
– Communication reducing
• Operator dependent benefit
– Across nodes during same epoch
• Exploit query semantics to improve
efficiency!
Madden, Franklin, Hellerstein, Hong. Tiny AGgregation (TAG), OSDI 2002.
19
Basic Aggregation
• In each epoch:
– Each node samples local sensors once
– Generates partial state record (PSR)
• local readings
• readings from children
– Outputs PSR during assigned comm. interval
• At end of epoch, PSR for whole network
output at root
• New result on each successive epoch
1
2
3
4
5
• Extras:
– Predicate-based partitioning via GROUP BY
20
Illustration: Aggregation
SELECT COUNT(*)
FROM sensors
Sensor #
1
Interval #
4
2
3
Interval 4
1
4
Epoch
5
1
2
3
3
2
1
4
4
1
5
21
Illustration: Aggregation
SELECT COUNT(*)
FROM sensors
Sensor #
1
2
3
Interval #
2
1
4
4
3
Interval 3
5
1
2
3
2
2
4
1
4
5
22
Illustration: Aggregation
SELECT COUNT(*)
FROM sensors
Sensor #
1
2
3
Interval #
1
4
4
1
5
1
3
2
Interval 2
3
2
3
2
1
3
4
1
4
5
23
Illustration: Aggregation
SELECT COUNT(*)
FROM sensors
Sensor #
1
2
3
Interval #
2
3
2
2
4
5
1
3
Interval 1
1
4
4
1
5
1
3
4
5
5
24
Illustration: Aggregation
SELECT COUNT(*)
FROM sensors
Sensor #
1
2
3
Interval #
5
1
3
2
3
2
2
4
1
4
4
1
Interval 4
1
3
4
5
1
1
5
25
Aggregation Framework
• As in extensible databases, TinyDB supports any
aggregation function conforming to:
Aggn={finit, fmerge, fevaluate}
Finit {a0}
 <a0>
Partial State Record (PSR)
Fmerge {<a1>,<a2>}  <a12>
Fevaluate {<a1>}
 aggregate value
Example: Average
AVGinit
{v}
 <v,1>
AVGmerge {<S1, C1>, <S2, C2>}
 < S1 + S2 , C1 + C2>
AVGevaluate{<S, C>}
 S/C
Restriction: Merge associative, commutative
26
Taxonomy of Aggregates
• TAG insight: classify aggregates according to various
functional properties
– Yields a general set of optimizations that can automatically be
applied
Property
Partial State
Examples
MEDIAN : unbounded,
MAX : 1 record
Affects
Effectiveness of TAG
Monotonicity
COUNT : monotonic
AVG : non-monotonic
MAX : exemplary
COUNT: summary
MIN : dup. insensitive,
AVG : dup. sensitive
Hypothesis Testing, Snooping
Exemplary vs.
Summary
Duplicate
Sensitivity
Drives an API!
Applicability of Sampling,
Effect of Loss
Routing Redundancy
27
Use Multiple Parents
• Use graph structure
– Increase delivery probability with no communication overhead
• For duplicate insensitive aggregates, or
• Aggs expressible as sum of parts
– Send (part of) aggregate to all parents
SELECT COUNT(*)
• In just one message, via multicast
R
– Assuming independence, decreases variance
P(link xmit successful) = p
P(success from A->R) = p2
E(cnt) = c *
p2
Var(cnt) = c2 * p2 * (1 – p2)
V
# of parents = n
E(cnt) = n * (c/n * p2)
(c/n)2
Var(cnt) = n *
*
p2 * (1 – p2) = V/n
B
C
c
c/n
n=2
c/n
A
28
Multiple Parents Results
With Splitting
Benefit of Result Splitting
(COUNT query)
1400
1200
Avg. COUNT
No Splitting
• Better than
previous
analysis expected!
Critical
• Losses aren’t
Link!
independent!
• Insight: spreads data
over many links
1000
800
Splitting
No Splitting
600
400
200
0
(2500 nodes, lossy radio model, 6 parents per
node)
29
Acquisitional Query
Processing (ACQP)
• TinyDB acquires AND processes data
– Could generate an infinite number of samples
• An acqusitional query processor controls
– when,
– where,
– and with what frequency data is collected!
• Versus traditional systems where data is provided
a priori
Madden, Franklin, Hellerstein, and Hong. The Design of An
30
Acqusitional Query Processor. SIGMOD, 2003.
ACQP: What’s Different?
• How should the query be processed?
– Sampling as a first class operation
• How does the user control acquisition?
– Rates or lifetimes
– Event-based triggers
• Which nodes have relevant data?
– Index-like data structures
• Which samples should be transmitted?
– Prioritization, summary, and rate control
31
Operator Ordering: Interleave Sampling +
Selection
SELECT light, mag
FROM sensors
WHERE pred1(mag)
AND pred2(light)
EPOCH DURATION 1s
Traditional DBMS
(pred1)
(pred2)
At 1 sample / sec, total power savings
• could
E(sampling
mag) as
>> 3.5mW
E(sampling
be as much
 light)
1500 uJ vs.
uJ
Comparable
to 90
processor!
Correct ordering
(unless pred1 is very selective
and pred2 is not):
(pred1)
ACQP
Costly
(pred2)
Cheap
mag
light
mag
light
(pred2)
light
(pred1)
mag
32
Exemplary Aggregate
Pushdown
SELECT WINMAX(light,8s,8s)
FROM sensors
WHERE mag > x
EPOCH DURATION 1s
Traditional DBMS
WINMAX
(mag>x)
ACQP
WINMAX
(mag>x)
mag
• Novel, general
pushdown
technique
• Mag sampling is
the most
expensive
operation!
(light > MAX)
light
mag
light
33
Topics
•
•
•
•
•
•
•
•
•
In-network aggregation
Acquisitional Query Processing
Heterogeneity
Intermittent Connectivity
In-network Storage
Statistics-based summarization and sampling
In-network Joins
Adaptivity and Sensor Networks
Multiple Queries
34
Heterogeneous Sensor
Networks
• Leverage small numbers of high-end nodes
to benefit large numbers of inexpensive
nodes
• Still must be transparent and ad-hoc
• Key to scalability of sensor networks
• Interesting heterogeneities
–
–
–
–
–
Energy: battery vs. outlet power
Link bandwidth: Chipcon vs. 802.11x
Computing and storage: ATMega128 vs. Xscale
Pre-computed results
Sensing nodes vs. QP nodes
35
Computing Heterogeneity with
TinyDB
• Separate query processing from sensing
– Provide query processing on a small number of nodes
– Attract packets to query processors based on “service
value”
• Compare the total energy consumption of the
network
•
•
•
•
No aggregation
All aggregation
Opportunistic aggregation
HSN proactive
aggregation
Mark Yarvis and York Liu, Intel’s Heterogeneous Sensor
Network Project,
ftp://download.intel.com/research/people/HSN_IR_Day_Poster_03.pdf.
36
5x7 TinyDB/HSN Mica2 Testbed
37
Data Packet Saving
Data Packet Saving
0.00%
% Change in Data Packet Count
-5.00%
• How many
aggregators are
desired?
• Does placement
matter?
-10.00%
-15.00%
11% aggregators
achieve 72% of max
data reduction
-20.00%
-25.00%
-30.00%
-35.00%
-40.00%
-45.00%
-50.00%
1
2
3
4
5
6
All (35)
Number of Aggregator
Data Packet Saving - Aggregator Placement
% Change in Data Packet Counnt
0.00%
-5.00%
-10.00%
-15.00%
-20.00%
-25.00%
Optimal placement 2/3
distance from sink.
-30.00%
-35.00%
-40.00%
-45.00%
-50.00%
25
27
29
31
Aggregator Location
All (35)
38
Occasionally Connected
Sensornets
internet
TinyDB Server
Mobile GTWY
GTWY
TinyDB QP
Mobile GTWY
Mobile GTWY
GTWY
GTWY
TinyDB QP
TinyDB QP
39
Occasionally Connected
Sensornets Challenges
• Networking support
– Tradeoff between reliability, power consumption
and delay
– Data custody transfer: duplicates?
– Load shedding
– Routing of mobile gateways
• Query processing
– Operation placement: in-network vs. on mobile
gateways
– Proactive pre-computation and data movement
• Tight interaction between networking and QP
Fall, Hong and Madden, Custody Transfer for Reliable Delivery in Delay Tolerant
Networks, http://www.intel-research.net/Publications/Berkeley/081220030852_157.pdf .
40
Distributed In-network Storage
• Collectively, sensornets have large amounts
of in-network storage
• Good for in-network consumption or
caching
• Challenges
– Distributed indexing for fast query
dissemination
– Resilience to node or link failures
– Graceful adaptation to data skews
– Minimizing index insertion/maintenance cost
41
Example: DIM
• Functionality
– Efficient range query for
multidimensional data.
• Approaches
– Divide sensor field into bins.
– Locality preserving mapping
from m-d space to
geographic locations.
– Use geographic routing such
as GPSR.
E2= <0.6, 0.7>
E1 = <0.7, 0.8>
• Assumptions
– Nodes know their locations
and network boundary
– No node mobility
Q1=<.5-.7, .5-1>
Xin Li, Young Jin Kim, Ramesh Govindan and Wei Hong, Distributed Index
for Multi-dimentional Data (DIM) in Sensor Networks, SenSys 2003.
42
Statistical Techniques
• Approximations, summaries, and sampling
based on statistics and statistical models
• Applications:
– Limited bandwidth and large number of nodes ->
data reduction
– Lossiness -> predictive modeling
– Uncertainty -> tracking correlations and
changes over time
– Physical models -> improved query answering
43
Correlated Attributes
• Data in sensor networks is correlated; e.g.,
–
–
–
–
–
Temperature and voltage
Temperature and light
Temperature and humidity
Temperature and time of day
etc.
44
IDSQ
• Idea: task sensors in order of best
improvement to estimate of some value:
– Choose leader(s)
• Suppress subordinates
• Task subordinates, one at a time
– Until some measure of goodness (error bound) is met
» E.g. “Mahalanobis Distance” -- Accounts for
correlations in axes, tends to favor minimizing
principal axis
See “Scalable Information-Driven Sensor Querying and Routing for ad hoc Heterogeneous
Sensor Networks.” Chu, Haussecker and Zhao. Xerox TR P2001-10113. May, 2001. 45
Graphical Representation
Model location
estimate as a point
with 2-dimensional
Gaussian
uncertainty.
Residual 1
Residual 2
Area of residuals is equal
Principal Axis
S1
S2
Preferred
because it
reduces error
along principal 46
axis
MQSN: Model-based
Probabilistic Querying over
Sensor Networks
Joint work with Amol
Desphande, Carlos Guestrin, and
Joe Hellerstein
Model
Query Processor
1
3
4
2
5
6
7
8
9
47
MQSN: Model-based
Probabilistic Querying over
Sensor Networks
Probabilistic Query
select NodeID, Temp ± 0.1C
where NodeID in [1..9]
with conf(0.95)
Model
Consult
Model
Query Processor
Observation Plan
[Temp, 3], [Temp, 9]
1
3
4
2
5
6
7
8
9
48
MQSN: Model-based
Probabilistic Querying over
Sensor Networks
Probabilistic Query
select NodeID, Temp ± 0.1C
where NodeID in [1..9]
with conf(0.95)
Model
Consult
Model
Query Processor
Observation Plan
[Temp, 3], [Temp, 9]
1
3
4
2
5
6
7
8
9
49
MQSN: Model-based
Probabilistic Querying over
Sensor Networks
Query Results
Temperature
30
Model
25
20
15
10
Update
Model
1
Query Processor
2
3
4
Node ID
Data
[Temp, 3] = …, [Temp, 9] = …
1
3
4
2
5
6
7
8
9
50
Challenges
• What kind of models to use ?
• Optimization problem:
– Given a model and a query, find the best set of
attributes to observe
– Cost not easy to measure
• Non-uniform network communication costs
• Changing network topologies
– Large plan space
• Might be cheaper to observe attributes not in query
– e.g. Voltage instead of Temperature
• Conditional Plans:
– Change the observation plan based on observed values
51
MQSN: Current Prototype
• Multi-variate Gaussian Models
– Kalman Filters to capture correlations across time
• Handles:
– Range predicate queries
• sensor value within [x,y], w/ confidence
– Value queries
• sensor value = x, w/in epsilon, w/ confidence
– Simple aggregate queries
• AVG(sensor value)  n, w/in epsilon, w/confidence
• Uses a greedy algorithm to choose the observation plan
52
In-Net Regression
• Linear regressionX :vssimple
way to predict
Y w/ Curve Fit
future12values, identify outliers
y = 0.9703x - 0.0067
10
2
• Regression can be acrossRlocal
= 0.947or remote
values, 8multiple dimensions, or with high
degree6polynomials
– E.g., node
A readings vs. node B’s
4
– Or, location
(X,Y), versus temperature
2
E.g., over many nodes
0
1
3
5
7
9
Guestrin, Thibaux, Bodik, Paskin, Madden. “Distributed Regression: an Efficient53
Framework for Modeling Sensor Network Data .” Under submission.
In-Net Regression (Continued)
• Problem: may require data from all sensors to
build model
• Solution: partition sensors into overlapping
“kernels” that influence each other
– Run regression in each kernel
• Requiring just local communication
– Blend data between kernels
– Requires some clever matrix manipulation
• End result: regressed model at every node
– Useful in failure detection, missing value estimation
54
Exploiting Correlations in
Query Processing
• Simple idea:
– Given predicate P(A) over expensive attribute A
– Replace it with P’ over cheap attribute A’ such that P’ evaluates
to P
– Problem: unless A and A’ are perfectly correlated, P’ ≠ P for all
time
• So we could incorrectly accept or reject some readings
• Alternative: use correlations to improve selectivity
estimates in query optimization
– Construct conditional plans that vary predicate order based on
prior observations
55
Exploiting Correlations (Cont.)
•
•
•
Insight: by observing a (cheap and correlated) variable not involved
in the query, it may be possible to improve query performance
– Improves estimates of selectivities
Use conditional plans
Example
T
Light
Light>>
100
100Lux
Lux
Temp
Temp<<
20°
20°CC
Expected
ExpectedCost
Cost==150
110
Cost
Cost==100
100
Cost
Cost==100
100
Selectivity
Selectivity==.5.1 Selectivity
Selectivity==.5.9
Time in
[6pm, 6am]
F
Temp
Temp<<
20°
20°CC
Light
Light>>
100
100Lux
Lux
Cost
Cost==100
100
Cost
Cost==100
100
Selectivity
Selectivity==.5.1 Selectivity
Selectivity==.5.9
Expected
ExpectedCost
Cost==150
110
56
In-Network Join Strategies
• Types of joins:
– non-sensor -> sensor
– sensor -> sensor
• Optimization questions:
– Should the join be pushed down?
– If so, where should it be placed?
– What if a join table exceeds the memory
available on one node?
57
Choosing Where to Place
Operators
• Idea : choose a “join node” to run the operator
• Over time, explore other candidate placements
– Nodes advertise data rates to their neighbors
– Neighbors compute expected cost of running the
join based on these rates
– Neighbors advertise costs
– Current join node selects a new, lower cost node
Bonfils + Bonnet, Adaptive and Decentralized Operator
Placement for In-Network QueryProcessing IPSN 2003.
58
Topics
•
•
•
•
•
•
In-network aggregation
Acquisitional Query Processing
Heterogeneity
Intermittent Connectivity
In-network Storage
Statistics-based summarization and
sampling
• In-network Joins
• Adaptivity and Sensor Networks
• Multiple Queries
59
Adaptivity In Sensor Networks
• Queries are long running
• Selectivities change
– E.g. night vs day
• Network load and available energy vary
• All suggest that some adaptivity is needed
– Of data rates or granularity of aggregation
when optimizing for lifetimes
– Of operator orderings or placements when
selectivities change (c.f., conditional plans for
correlations)
• As far as we know, this is an open problem!
60
Multiple Queries and Work
Sharing
• As sensornets evolve, users will run many
queries simultaneously
– E.g., traffic monitoring
• Likely that queries will be similar
– But have different end points, parameters, etc
• Would like to share processing, routing as
much as possible
• But how? Again, an open problem.
61
Concluding Remarks
• Sensor networks are an exciting emerging technology,
with a wide variety of applications
• Many research challenges in all areas of computer science
– Database community included
– Some agreement that a declarative interface is right
• TinyDB and other early work are an important first step
• But there’s lots more to be done!
62
Download