Fishing for Patterns in (Shallow) Geometric Streams Subhash Suri UC Santa Barbara

advertisement
Fishing for Patterns in (Shallow)
Geometric Streams
Subhash Suri
UC Santa Barbara
and
ETH Zurich
IIT Kanpur Workshop on
Data Streams Dec 18-20
Geometric Streams
• A stream of points (dim 1, 2, 3, …)
• Abstract view of multi-attribute data:
– IP packets, database transactions, geographic sensor data,
processor instruction stream etc.
DDoS
Worm
IP Traffic (NetViewer)
Sensor eScan
Code profiling
Shape of a Point Stream
• Form informs about the function
• Identifying visually interesting patterns (“shape”) of point stream
– Areas of high density (hot spots).
– Large empty areas (cold spots).
– Population estimates of geometric ranges
– A geometric summary of the distribution of the stream.
• Deliberately vague and ill-posed; some specifics later.
Outline
• No attempt to survey
• Adaptive Spatial Partitioning
– Generic summary structure (Algorithmica ‘06)
– Q-Digest: sensornet data aggregation (SenSys’04)
– Range Adaptive Profiling for Programs (CGO ‘06)
• Specialized geometric patterns and queries:
– Range queries (SoCG ‘04)
– Hierarchical Heavy-hitters (PODS ‘05)
• Shape of the stream: ClusterHulls (Alenex‘06)
• Conclusions
Adaptive Spatial Partitioning
•
A subdivision of space into square cells.
•
Each cell maintains O(1) size info,
essentially count of points in it.
•
Tension between coverage and precision:
 Large cells cover a lot, but with poor
precision
 Small cells have good precision, but poor
coverage
•
Dynamically adapt the subdivision to the
distribution of points in the stream.
•
Adaptive zoom: more precision (cells) where
the action is, and fewer elsewhere.
•
[HSST], ISAAC ‘04, Algorithmica ‘06
ASP Structure
•
Data structure size is function of
accuracy parameter ε
•
Initially, a single box (LxL), and its
counter.
•
When the count of a box b > εn
– Freeze b’s counter
– Split b into 4 sub-boxes
– Introduce a new counter for each subbox
•
This hierarchically defined structure of
boxes (a streaming quad tree) is our ASP.
L
Adaptivity: Refine and Unrefine
• The structure must adapt to the
changing distribution of points:
– New regions become heavy
– Previously heavy regions may become
light/cold.
• Refine operation puts new counters
where the action is increasing:
• Stream Processing: for each item x
– Locate the smallest box v containing
x, increment its count
– Refine: If count of v > εN
– Split v into 4 children sub-boxes,
each with a new counter, initialized
to 0.
– Old counter of v frozen.
Refine operation
Unrefine Operation
•
To conserve memory, boxes with low
counts must be deleted.
•
A previously heavy box may become light
because n, the size of the stream, has
increased, and so its count is below εn.
•
Unrefine: if count of box v and its
children < εn/2
•
•
– Delete the children boxes and
– Add their counts to count of v
– (v’s old counter revived)
Refinement occurs only at node of new
insert; refinement can occur anywhere
(non-locality).
A heap for fast unrefine ops.
Unrefine operation
The Data Structure
• ASP represented as a 4-ary tree
ASP-tree
L
Analysis of ASP
•
(Space Bound):
– For each node v, the count of v, its siblings,
and parent > εn/2
– Total number of boxes at most O(1/ε)
•
(Per-point Processing Time):
– Naïve will be O(lg L): tree height
– With heap, centroid tree (amortized) time
O(lg 1/ε)
•
(Count Bound):
– Each point counted in exactly one box
– Points contained in a box b are counted
at b or one of its ancestors
– Depth of the tree by the binary partitioning
rule is O(log L)
– Error in a leaf’s count is O(εn*log L).
– Using memory = O(1/ε * log L), the count
error bounded by εn.
ASP-tree
Spatial Summary
•
A partition into O(1/ε * log L) boxes,
with auto-adaptive zoom.
•
No undivided box has more than εn
points: only leaf nodes can.
•
Gives a qualitative summary of the
stream’s spatial distribution: a visual
sense of hot and cold regions.
Two applications and two theorems
• Data aggregation in sensor networks
– Distributed version of ASP structure
• Code profiling in processor streams
– Hardware implementation of ASP
• Theoretical bounds for range searching
– Worst-case guarantees for rectangle range searching
• Lower bounds on hierarchical heavy hitters
– Space complexity
Geometric Summaries in Sensornets
•
Self-organizing networks of tiny, cheap sensors,
– Integrated sensing, computing, radio communication,
– Continuous, real-time monitoring of remote, hard to reach areas.
– Limited power (battery), bandwidth, memory.
•
Communication typically the biggest drain on energy
•
Perform as much local processing as possible, and transmit smart
summaries.
•
Similar to synopses: distributed data, rather than one-pass.
•
Active area: in-network aggregation, compressed sensing.
Distributed ASP
•
Q-Digest: an approximate histogram
– Shrivastava, Buragohain, Agrawal, S. [SenSys ‘04]
•
ASP for 1-dim data signal (measurements of
sensors): vibration data, acoustics, toxin levels, etc.
•
Going beyond min, max, or average, and
approximating quantiles.
•
Sensors form an aggregation tree, rooted at base
station.
•
Data flows from leaves to the base station, always
reduced to size K summary. (user parameter).
•
The key point is that ASP is efficiently mergeable:
– Given q-digests of children, a node can compute
the merged q-digest.
•
Space/quality bounds of ASP carry over.
Base Station
A simulation
8000 sensors,
each generating a 2-byte integer
(death valley elevation data)
Error: (true - est) rank
< 5% with 160 byte Q-Digest
< 2% with 400 byte Q-Digest
Code Profiling
•
Stream of program instructions
•
Profiling: Understand code behavior
– Access patterns, cache behavior, load value
distributions
– Example: which program segments are hot,
and how hot?
•
Challenges
– Large item space: programs with 1M basic
blocks
– Profiling should take little space and add
little overhead
– ASP adaptation to profile high frequency
code segments
Code
Basic Blocks
push %ebp
mov
%esp,%ebp
sub
$0x38,%esp
and
$0xfffffff0,%esp
mov
$0x0,%eax
sub
%eax,%esp
sub
$0x8,%esp
push $0x28
push $0x8048468
call 80482b0 add
push %ebp
mov
%esp,%ebp
sub
$0x38,%esp
and
$0xfffffff0,%esp
mov
$0x0,%eax
sub
%eax,%esp
sub
$0x8,%esp
push $0x28
push $0x8048468
call 80482b0
add
$0x10,%esp
push %ebp
mov
%esp,%ebp
sub
$0x38,%esp
and
$0xfffffff0,%esp
mov
$0x0,%eax
sub
%eax,%esp
sub
$0x8,%esp
push $0x28
push $0x8048468
call 80482b0
add
$0x10,%esp
mov
%esp,%ebp
sub
$0x38,%esp
and
$0xfffffff0,%esp
mov
$0x0,%eax
sub
%eax,%esp
sub
$0x8,%esp
push $0x28
push $0x8048468
Frequent
Rare
Range Adaptive Profiling [CGO ‘06]
•
Small fixed memory (counters)
•
Dynamically zoom onto high
frequency code segments.
•
1d adaptation of ASP with
various “optimizations” to reduce
memory and processing time.
Hot range
– Lot of constants squeezing
– Batching of unrefinements
– Branching factor choices
•
Design specs for specialized
hardware for profiling
(www.cs.ucsb.edu/~arch/rap)
Cold range
Range Adaptive Profiling
•
Use RAP to estimate frequency of arbitrary ranges.
– Count errors due to not splitting early enough
– Regions undergoing hot/cold spells
Typical performance: 8K memory sufficient for 97% accuracy.
3
2.5
2
1.5
1
0.5
SPEC Benchmarks
av
er
ag
e
vp
r
x
vo
rte
pa
rs
er
m
cf
gz
ip
0
gc
c
Percent Error
•
Yes, but….
That’s well and nice in practice,
but how does it work in theory!
Range Searching in Streams
• A stream of k-dimensional points.
– Summary to approximate counts of geometric ranges.
• VC dimension, -nets and -approximation.
• “Nice” geometric ranges have small (bounded) VC dimension:
e.g. rectangles, balls, half planes etc.
• -approximation Theorem: For every range space (X, R) of fixed VC dim,
there exists subset A of X of size O(lg s.t.
• Iceberg error (n) unavoidable
-Approximations: challenges
• Large summary size: (-2)
– Would prefer O(1/
– -nets are small but can't estimate ranges
• Deterministic construction a space hog.
• The best streaming algorithm for -approximation
requires working space
O( (1/)d+1 lgO(d+1)n )
[BCEG ‘04]
Some Theorems [STZ, SoCG ‘04]
• Deterministic Multipass:
With d passes over data, can build a deterministic data structure for
rectangular queries of size
O(1/ lg2d-2 (1/.
• Randomized Single pass:
A data structure for rectangular range queries in 2d with error at
most n, with prob > 1 - o(1), of size
O(lg n
The data structure size is only slightly sub-quadratic for d > 2:
Another Theorem
• An implicit desire in ASP is to spot “pockets” of high population.
• Think of such a spatially correlated set as a “spatial heavy hitter”: many
different formal definitions possible.
• An important concept is hierarchical heavy hitter (HHH).
– Popularized by Estan-Varghese, Graham-Muthukrishnan
– Non-redundant heavy hitters
• Ranges often form a natural hierarchy (IP addresses, time, space, etc)
• Stream of points and a (hierarchical) set of boxes.
• Report boxes whose “discounted” frequency is above threshold.
Discounted
Frequency
B
A
C
C
Space Complexity of HHH
•
Elegant applications to IP network
monitoring, and clever algorithms by
EV and CKMS
•
Unlike flat heavy hitters, however,
2-sided approximation guarantees
seem difficult to achieve:
– Every HHH (with discounted
freq > n) should be caught
– Every box reported must have
discounted frequency > cn
•
HHH Space Theorem:
Any -HHH algorithm in d dim with
fixed accuracy factor c requires
Ω(1/d+1) memory.
[HSST, PODS ‘05]
C
B
A
C
B
A
Information loss in aggregation
Shape of a Point Stream
Caution:
entering highly speculative zone!!!
Shape of a Point Stream
[HSS, Alenex ‘06]
•
What is a natural summary to describe the geometric shape of a
streaming point set?
•
A simple first approximation is the convex hull, which preserves
basic extremal properties:
– Diameter, width, separation, containment, dist etc.
– Efficient streaming Hulls [AHV, CM, Chan, FKZ, HS].
– Max error O(Diam/r2) for summary size r
Shape of a Point Stream
•
Convex hull is a crude summary when the point stream has a richer
structure, especially in the interior.
•
Consider the simple example of L-shaped set.
•
A powerful technique for shape extraction is -hulls
– area left after subtracting all 1/ radius empty disks
•
Unfortunately, -hulls can have linear size and we don’t know how to
build a streaming approximation.
Cluster Hulls (ALENEX ‘06)
•
Generalizes the streaming convex hull
algorithm to represent the shape as a
collection of hulls.
•
Mimics -hull by using minimum area
coverage as metric.
•
It is not clustering:
– Objective is to approximate well the
boundary shape of components
– 2 dimensions only
– Problems with noise
•
But could be coupled with clustering.
Algorithm: ClusterHulls
• k convex hulls, H = {h1, h2,… hk}
• A cost function w(h) = area(h) + μ(perimeter(h))2
• Minimize w(H) = Σw(hi)
• For each point p in sequence
• If p inside an hi, assign p to hi without modifying hi
else create a new hull containing only p; add it to H
• If |H| > k
Choose a pair hi, hj to merge into a single hull,
s.t. the increase to w(H) is minimized.
• Revise the assignment of adaptive sampling directions to hulls in H
to minimize the overall error.
Choosing the cost function
•
Area only: merges pairs of points
from different clusters and
intersecting hulls.
•
Perimeter only: favors merging of
large hulls to reduce cost.
•
The combined area+perimeter
works well at both extremes.
Some Pictures
Input: West Nile
Virus Data
m = 256
ClusterHulls
m = 512
Why not Plain Clustering
ClusterHulls
k-median; k=5
CURE; k=5
k-median; k=45
CURE; k=45
m = 45
Extreme Examples
 Early choices can be fatal.
 Recover by discarding sparse CHs.
 Process points in rounds whose length doubles each time.
 Discard hulls h whose count(h) or
density(h) = count(h)/area(h) is small.
 On these extreme examples, most clustering algs fail
Input
ClusterHulls
Period-doubling Cleanup
Conclusions, Open Problems
• Is ClusterHull a good idea?
– Too early to tell. The problem seems interesting.
• Open theoretical questions:
– Complexity of covering a set of points with convex polygons:
at most k vertices, minimize the area.
– Covering by rectangles (arbitrarily oriented).
– Streaming versions?
• Other notions of stream shape.
• Space-efficient streaming range searching.
Danke Shun!
The Lower Bound in 1-D
• r intervals of length 2 each
(call them literals)
• Union of the r intervals is B.
• Each interval split into two unit
length sub-intervals.
B
• If stream points fall in the left
(resp. right) subinterval, we say the
literal has orientation 0 (resp. 1).
2r
Literal
0
1
The Construction
• Stream arrives in 2 phases.
• In 1st phase: Put 3N/r points in each
interval, either in left or right half.
• In 2nd phase: Adversary chooses either
left or right half for each sub-interval
and puts N points. Call these intervals
sticks.
• Heavy hitters:
– Each stick is a -HHH
– Discounted frequency of B (the union
interval) depends on literals whose
orientations in 1st and 2nd phase
differ
• Algorithms must keep track of (r)
orientations after 1st phase
B
The Lower Bound
• Suppose an algorithm A uses < 0.01r bits
of space.
– After phase 1, orientations of the r
literals encoded in 0.01r bits.
– There are 2r distinct orientation
– Two orientations that differ in at
least r/3 literals map to the same
(0.01r)-bit code ==> indistinguishable
to A.
• If orientations in 1st and 2nd phase are
same, frequency of B = 0, not a HHH.
• If r/3 literals differ, frequency of
B = r/3 * 3N/r = N, so B is a -HHH
• A misclassifies B in one sequence.
B
Completing the Lower Bound
• Make r independent copies of the
construction
2r
• Use only one of them to complete the
construction in the 2nd phase
B
• Need (r2) bits to track all orientations
r
• For r = 1/4, this gives (-2) lower
bound
Multi-dimensional lower bound
•
The 1-D lower bound is information-theoretic; applies to all
algorithms.
•
For higher dimensions, need a more restrictive model of
algorithms.
•
Box Counter Model.
– Algorithm with memory m has m counters
– These counters maintain frequency of boxes
– All deterministic heavy hitter algorithms fit this model
•
In the box counter model, finding -HHH in d-dim with any
fixed approximation requires
(d+1) memory
2D (Multi-Dim) Construction
• A box B and a set of descendants.
• B has side length 2r.
• 1st phase
– 2x2 (literal) boxes in upper left
quadrant (orientation 0 or 1)
0
1
Literal
Diagonal
• 2nd phase
– Diagonal: boxes in upper left
quadrant; all orientation 0
– Sticks: 1xr (or rx1) boxes
– Uniform: lower right quadrant
Stick
2r
Uniform
Multi-dimensional lower bound
• Intuition:
– Each stick combines with a diagonal
box to form a skinny -HHH box
– Diagonal boxes pair-up to form
-HHH
Fully
Covered
Half
Covered
Diagonal
• Skinny boxes form a checker-board
pattern in upper left quadrant
– Each literal is either fully covered
or half covered
• As in 1-D, adversary picks sticks
• Discounted frequency of B has
– Half covered literals and
– Points in the Uniform quadrant
Stick
2r
Uniform
The Lower Bound
• The algorithm must remember the (r2) literal
orientations.
• Otherwise, it cannot distinguish between two
sequences, where discounted frequency of B is m or
3m/2, resp. (for m = 20/29 N).
• Like before, by making r copies of the construction,
we get the lower bound of (r3).
• The basic construction generalized to d dimensions.
• Adjusting the hierarchy to get lower bound for any
arbitrary approximation
Download