Power Conserving Computation of Order-Statistics over Sensor Networks

advertisement
Power Conserving
Computation of Order-Statistics
over Sensor Networks
Michael B. Greenwald & Sanjeev Khanna
Dept. of Computer & Information Science
University of Pennsylvania
PODS 04, June 2004
MBG 1
Sensor Networks
• Cheap => plentiful, large
scale networks
• Low power => long life
• Easily deployed,
wireless =>
hazardous/inaccessible
locations
• Base Station
w/connection to outside
world => info flows to
station & we can extract data,
• Self-configuring => just
deploy, and network sets
itself up
PODS 04, June 2004
MBG 2
Management & Use of Sensor Nets
Power Consumption and
Network Lifetime
• Battery replacement is
difficult. Network lifetime
may be a function of worstcase battery life.
• Dominant cost is per-byte
cost of transmission
• Minimize worst-case power
consumption at any node,
maximize network lifetime.
PODS 04, June 2004
MBG 3
Management & Use of Sensor Nets
Primary operation is data
extraction
• Declarative DB-like interface
• Issue query, answers stream
towards basestation
• Not necessarily consistent
with power concerns
• COUNT, MIN, MAX easy to
optimize: aggregate in
network.
PODS 04, June 2004
MBG 4
Example of simple aggregate query:
MAX
Primary operation is data
extraction
• Aggregate in network
• Transmit only max
observation of all your
children
• O(1) per node.
But what about complicated, but natural, queries,
e.g. MEDIAN?
PODS 04, June 2004
MBG 5
Richer queries: Order-statistics
• Exact-order-statistic query:
-quantile(S,) returns the element with rank
r =  |S|
• Approximate order statistic query:
approx--quantile(S,,) returns an element with
rank r,
(1 -  ) |S| <= r <= (1 +  ) |S|
• Quantile summaries query:
quantile-summary(S,) returns a summary Q,
such that
-quantile(Q,) returns an element with rank r,
(1 -  ) |S| <= r <= (1 +  ) |S|
PODS 04, June 2004
MBG 6
Goals
• Respond accurately to quantile queries
• Balance power consumption over network
(load on any two nodes differs by at most a poly-log factor)
• Topology independent
• Structure solution in a manner amenable to
standard optimizations:
e.g. non-holistic/decomposable
PODS 04, June 2004
MBG 7
Naïve approaches
1. Send all |S| to basestation at root
– If root has few childre, then (|S|) cost
2. Sample w/probability 1/(|S|2)
–
–
–
Uniform sampling
Lost samples
Only probabilistic “guarantee”
Some way to handle summaries in such a way that they can
be aggregated in the network?
PODS 04, June 2004
MBG 8
Preliminaries: Representation
• Record min possible rank (rmin) and max rank
(rmax) for each stored value:
10 21 29 34 40 46… 189
1,1
2,2
3,3
4,4
5,5
6,6
…
10,10
10 24 34 40 44 …341
1,1
4,20 23,23 37,37 51,54 … 100,100
• Proposition: If for all 1 <= j < |Q|, rmax(j+1)- rmin(j)
<= 2|S|, then Q is an -approximate summary.
... 40 44 …
 = .1, |S| = 100
.. 37,37 51,54 …
Median = Q(S,.5)
rmin(4)
rmax(5) return entry w/rank .5 100 = 50
(.5-)*100 = 40, (.5+)*100 = 60
Choose largest tuple s.t.
rmin <= r - 
40 <= 51 <= rank(44) <= 54 <= 60
PODS 04, June 2004
MBG 9
Preliminaries:
COMBINE and PRUNE Operations
• COMBINE: merge n quantile summaries into one
• PRUNE: reduce number of entries in Q to B+1
• Q = COMBINE({Qi})
– Sort entries, compute new rmin and rmax for each entry.
– Let predecessor set for a given entry (v,rmin,rmax) consist of max
entry v’ from each input summary, such that v’ <= v (successor
set has min v’ such that v’ >= v)
– rmin =  rmin’ for v’ in predecessor set
– rmax = 1 +  (rmax’-1) for v’ in successor set
• Q’ = PRUNE(Q,B)
– Query for each of the [0, 1/B, 2/B, …, 1] quantiles from Q, and
return that set as Q’.
PODS 04, June 2004
MBG 10
Combining summaries
… 50
…
… 40
…
8,10
70 …
11,13
12,16
60 …
11,12
•
…
|S|=400,  = .01
… 50
…
…
… 43 51 …
… 121,125 130,140
|S|=300,  = .01
|S|=1000,  = .02
140,163
51 …
149,166
…
|S|=1700,  = .02
…
COMBINE({Qi}):
merge n quantile summaries into one
– Sort entries, and compute “merged” rmin and rmax.
• Theorem: Let Q = COMBINE({Qi}), then
|S| =  |Si|, |Q| =  |Qi|, and  <= max(i).
PODS 04, June 2004
MBG 11
Pruning Summary
… 60
…
[(1/B)-
160,183
81 …
334,353
•
…
(1/B)+]
[(2/B)-

|S|=1700,  = .02, 2|S| = 35
1/B
PRUNE(Q,B):
(2/B)+]

reduce number of entries in Q to B+1
• Theorem: Let Q’ = PRUNE(Q,B), then
|S| =  |Si|, |Q’| = B+1, and ’ <=  + 1/(2B)
– 2’|S| = (2 + (1/B))|S|
– ’ =  + 1/(2B)
PODS 04, June 2004
MBG 12
Naïve approaches
1. Send all |S| to basestation at root
– If root has few childre, then (|S|) cost
2. Sample w/probability 1/(|S|2)
–
–
–
Uniform sampling
Lost samples
Only probabilistic “guarantee”
3. COMBINE all children, and PRUNE before sending
to parent.
–
–
–
How do you choose B?
If h known and small (say, log |S|), then B=h/ and this has cost
O(h/)
But if topology is deep (linear?), this can be O(|S|).
Need an algorithm that is impervious to pathological
topologies
PODS 04, June 2004
MBG 13
Topology Independence
• Intuition: PRUNE based on the number of
observations, not on the topology. Only call PRUNE
when |S| doubles. then the # of PRUNE operations
is bounded by log(|S|)
• If the number of PRUNE operations is known, then
we can choose a value of B that is guaranteed to
never violate the precision guarantee. If there are
log(|S|) PRUNEs, then B=log(|S|)/(2).
• If there are restrictions on when we can perform
PRUNE, then the number/size of the summaries we
transmit to our parents will be larger than B.
PODS 04, June 2004
MBG 14
Algorithm
• The class of a quantile summary Q’ is floor(log (|S’|))
• Algorithm:
– Receive all summaries from children.
– COMBINE and PRUNE all summaries of each class with each other.
– Transmit the resulting summaries to your parent.
• Analysis
– At most log(|S|) classes
– B = log(|S|)/(2).
– Total communication cost to parent = log(|S|) log(|S|)/(2).
Theorem: There is a distributed algorithm to compute an
-approximate summary with a maximum transmission
load of O(log2(|S|)/(2) at each node.
PODS 04, June 2004
MBG 15
An Improved Algorithm
• New Algorithm:
If we know h in advance, then keep only the
log(8h/) largest classes.
Look only at min and max of each deleted
summary, and merge into the largest class,
introducing an error of at most /2h.
Maximum transmission load is
O((log(|S|)log(h/))/)
Details in paper….
PODS 04, June 2004
MBG 16
Computing Exact Order-Statistics
• Initially (can choose any ):
lbound = -, hbound = +, offset = 0,
rank = |S|, p = 1;
• Each round:
– query for
1. Q = an -approximate summary of [lbound,hbound]
2. offset = exact rank of lbound.
– r = rank-offset;
lbound = Q(r-p|S|/4), hbound = Q(r+p|S|/4); p = p+1;
• After at most p = log1/(|S|) passes, Q is exact,
and we can return the element with rank rank.
Theorem: There is a distributed algorithm to compute
exact order statistics with max transmission load of
O(log3(|S|)) values at each node.
PODS 04, June 2004
MBG 17
Practical Concerns
Median
• Amenable to general
optimization
techniques
• Exact: more precise e
increases msg size, but
reduces # of passes
PODS 04, June 2004
•Simulator of Madden &
Stanek
•Worst-case assumptions
for our algorithm
Summary
MBG 18
Related Work
• Power mgmt: e.g. [Madden et al, SIGMOD ‘03],
QUERY LIFETIME, sample rate adjusted to meet
lifetime requirement.
• Multi-resolution histograms: [Hellerstein et al,
IPSN ‘03], wavelet histograms, no guarantees,
multiple passes for finer resolution.
• Streaming data: [GK, SIGMOD ‘01], [MRL:
Manku et al, SIGMOD 98] avg-case cost, [CM:
Cormode, Muthukrishnan, LATIN ‘04] avgcase cost, probabilistic, space blowup.
• Multi-pass algorithms for exact orderstatistics, based on [Munro & Paterson,
Theoretical Computer Science, 80]
PODS 04, June 2004
MBG 19
Conclusions
• Substantial improvement in worst-case pernode transmission cost over existing
algorithms.
• Topology independent
• In simulation, nodes consume less power
than our worst-case analysis
• Some observations on streaming data vs.
sensor networks:….
PODS 04, June 2004
MBG 20
Sensor networks vs. Streaming
data
• If sensor values change slowly => multiple passes.
• Uniform sampling of sensor networks is hard;
unknown density, loss can eliminate entire subtree.
• Sensor nets are inherently parallel; no single processor
sees all values.
To compare 2 disjoint subtrees either (a) cost in communication
and memory, or (b) aggregate values in one subtree before
seeing other --- discarding info.
• Sensor net model is strictly harder than streaming:
PODS 04, June 2004
Input at any node is summary
of prior data plus new datum MBG 21
Questions?
PODS 04, June 2004
MBG 22
Power Consumption of Nodes
in Sensor Net
• 4000-4500 nJ to send a single bit.
• No noticeable startup penalty.
• A 5mW processor running at 4MHz uses 4-5nJ
per instruction.
• Transmitting a single-bit costs as much as
800-1000 instructions.
PODS 04, June 2004
MBG 23
Optimizations
• Conservative Pruning: Discard superfluous
i/B quantiles, if rmax(i+1)-rmin(i-1) < 2|S|
• Early Combining: COMBINE and PRUNE
different classes, as long as sum of |Si|
increases class of largest input summary
PODS 04, June 2004
MBG 24
Download