Introduction to Data Streams

advertisement
Introduction to Data Streams
Yannis Theodoridis
Information Systems Laboratory (InfoLab)
Dept of Informatics, Univ. Piraeus
(http://infolab.cs.unipi.gr)
Outline
„
„
„
„
Introduction ‐ Applications
Data Stream Model
‰
Data Model
‰
Query Model
‰
Synopses
Data Stream Projects
Wireless Sensor Networks
Outline
„
„
„
„
Introduction ‐ Applications
Data Stream Model
‰
Data Model
‰
Query Model
‰
Synopses
Data Stream Projects
Wireless Sensor Networks
Introduction
„
Data streams differ from conventional stored relation model:
‰
‰
‰
‰
„
Data elements in the stream arrive online
System has no control over order in which data elements to be processed
Data streams are potentially unbounded in size
Once an element from a data stream has been processed, it is discarded or archived. It cannot be retrieved easily unless it is stored in memory, which is small relative to the size of data streams
Operating in data stream model does not preclude use of data in conventional stored relations.
Applications
„
„
Web‐based financial search engine that evaluates queries over real‐time streaming financial data such as stock tickers and news feeds. (e.g., Traderbot)
Modern security applications. (e.g., iPolicy Networks)
‰
‰
„
„
„
Provides integrated security platform providing services such as firewall support and intrusion detection over multi‐gigabit network packet streams.
Needs to perform complex stream processing including URL‐filtering based on table lookups and correlation across multiple network traffic flows.
Large web site monitor web logs (clickstreams) online to enable applications such as personalization, performance monitoring, and load‐balancing. (e.g., Yahoo)
Sensor monitoring
Network traffic management
The Database Model
User/Application
User/Application
Results
Register Query
Database Query Processor
S/W to access
Stored Data
Memory
Stored Data
Data Base
Management
System
(DBMS)
The Data Stream Model
User/Application
User/Application
Results
Register Query
Stream Query Processor
S/W to access
Streaming Data
Data Stream
Management
System
(DSMS)
Scratch Space
(Memory and/or Disk)
DBMS vs. DSMS
„
„
„
„
„
„
„
„
„
„
Persistent relations
One‐time queries
Random access
“Unbounded” disk store
Only current state matters
Passive repository
Relatively low update rate
No real‐time services
Assume precise data
Access plan determined by query processor, physical DB design
„
„
„
„
„
„
„
„
„
„
Transient streams Continuous queries
Sequential access
Bounded main memory
History/arrival‐order is critical
Active stores
Possibly multi‐GB arrival rate
Real‐time requirements
Data stale/imprecise
Unpredictable/variable data arrival and characteristics
Outline
„
„
„
„
Introduction ‐ Applications
Data Stream Model
‰
Data Model
‰
Query Model
‰
Synopses
Data Stream Projects
Wireless Sensor Networks
Data Model
„
Append‐only
‰
„
Updates
‰
„
Stock tickers
Deletes
‰
„
Call records
Transactional data
Meta‐Data
‰
Control signals, punctuations
Query Model
Query Registration
User/Application
Answer Availability
• Predefined
•
•
•
•
• Ad-hoc
• Predefined, inactive
until invoked
One-time
Event/timer based
Multiple-time, periodic
Continuous (stored or
streamed)
Query
Query Processor
Processor
Stream Access
• Arbitrary
• Weighted history
• Sliding window
(special case: size = 1)
DSMS
PODS 2002
Queries
„
One‐time queries and Continuous queries
‰
One‐time queries
„
‰
Evaluated once over a point‐in‐time snapshot of data set
Continuous queries
„
Evaluated continuously as data streams continue to arrive
„
May be stored and updated as new data arrives, or may produce data streams themselves
Queries
„
Predefined and Ad hoc queries
‰
Predefined
„
„
„
‰
Supplied to data stream management system before any relevant data has arrived
Usually continuous queries
Scheduled one‐time queries possible
Ad hoc
„
„
Can be either one‐time or continuous queries
Complicates design of data stream management system (DSMS), because they are not known in advance for purposes of query optimization and correctly answering it may require referencing data that may have already arrived on data streams and potentially have already been discarded
Making Things Concrete
callee
caller
Central
Office
Central
Office
Outgoing (call_ID, caller, time, event)
Incoming (call_ID, callee, time, event)
DSMS
event = start or end
Query 1 (self‐join)
„
Find all outgoing calls longer than 2 minutes
SELECT O1.call_ID, O1.caller
FROM Outgoing O1, Outgoing O2
WHERE (O2.time – O1.time > 2 AND O1.call_ID = O2.call_ID AND O1.event = start
AND O2.event = end)
„
„
„
Result requires unbounded storage
Can provide result as data stream
Can output after 2 min, without seeing end
Query 2 (join)
„
Pair up callers and callees
SELECT O.caller, I.callee
FROM Outgoing O, Incoming I
WHERE O.call_ID = I.call_ID
„
„
„
Can still provide result as data stream
Requires unbounded temporary storage …
… unless streams are near‐synchronized
Query 3 (group‐by aggregation)
„
Total connection time for each caller
SELECT O1.caller, sum(O2.time – O1.time)
FROM Outgoing O1, Outgoing O2
WHERE (O1.call_ID = O2.call_ID
AND O1.event = start
AND O2.event = end)
GROUP BY O1.caller
„
Cannot provide result in (append‐only) stream
‰
‰
‰
Output updates?
Provide current value on demand?
Memory?
Approximate Query Answering
„
Data streams are potentially unbounded in size, the amount of storage required to compute exact answer to a query may grow without bound
„
Since we are limited to bounded amount of memory, it may not be possible to produce exact answers
„
High‐quality approximate answers can be an acceptable solution
„
Techniques for data reduction and synopsis construction
18
Synopses
„
Histograms
„
Wavelets
„
Sketches
Synopses ‐ Histograms
„
Equi‐Depth Histograms
‰ Idea: Select buckets such that counts per bucket are equal
Count for
bucket
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
„
Domain values
V‐Optimal Histograms
‰ Idea: Select buckets to minimize frequency variance within C
buckets
minimize
( f − B )2
∑∑
B
v∈B
v
VB
Count for
bucket
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Domain values
Synopses ‐ Wavelets „
Wavelets: Mathematical tool for hierarchical decomposition of functions/signals „
Haar wavelets: Simplest wavelet basis, easy to understand and implement ‰
Recursive pairwise averaging and differencing at different resolutions
Resolution
3
Averages
Detail Coefficients
[2, 2, 0, 2, 3, 5, 4, 4]
2
[2,
1
1,
4,
[1.5,
0
----
4]
[0, -1, -1, 0]
4]
[0.5, 0]
[2.75]
[-1.25]
Haar wavelet decomposition: [2.75, -1.25, 0.5, 0, 0, -1, -1, 0]
• Compression by ignoring small coefficients
Haar Wavelet Coefficients „
Hierarchical decomposition structure
•
Reconstruct data values d(i)
–
d(i) =
2.75
+
∑
(+/-1) * (coefficient on path)
0.5
+
0
+
Original data
-1.25
+
2
-
+
2
0
-
+
-1
-1
- +
0
2
3
0
- +
5
4
• Coefficient thresholding : only B<<|D| coefficients can be kept
– B is determined by the available synopsis space
– B largest coefficients in absolute normalized value
– Provably optimal in terms of the overall Sum Squared (L2) Error
4
Sketches
„
e.g [Alon96]: internal product of α with O(log(N/δ)/ε2) pseudorandom {‐1,+1} vectors
α[i]
4
2
7
1
0
3
5
4
r1[i]
1
-1
-1
1
-1
1
1
1
8
r2[i]
-1
1
1
-1
1
1
-1
-1
-2
r3[i]
-1
-1
1
1
-1
-1
1
-1
0
sketch(α)
Streaming Model
Incoming Data Stream
Sketch
Seeds
Queries Wavelet/Histogram
Approximation
Outline
„
„
„
„
Introduction ‐ Applications
Data Stream Model
‰
Data Model
‰
Query Model
‰
Synopses
Data Stream Projects
Wireless Sensor Networks
Stream Projects
„
„
„
„
„
„
„
„
„
Amazon/Cougar (Cornell) – sensors Aurora (Brown/MIT) – sensor monitoring, dataflow
Hancock (AT&T) –
telecom streams
Hancock Niagara (OGI/Wisconsin) – Internet XML databases
OpenCQ (Georgia) – triggers, incr. view maintenance
Stream (Stanford) – general‐purpose DSMS
Tapestry (Xerox) – pub/sub content‐based filtering
Telegraph (Berkeley) – adaptive engine for sensors
Tribeca (Bellcore) – network monitoring
26
Aurora/STREAM Overview
Synopses
Output
streams
Query Plans
Running Op
Ready Op
π
Waiting Op
σ
σ
Applications
register
continuous queries
Users issue
continuous and
ad-hoc queries
Historical
Storage
Administrator monitors
query execution and
adjusts run-time
parameters
Input streams
Adaptivity (Telegraph)
Output
Queues
STeMs for join
R
grouped
filter (R.A)
EDDY
S
T
R
S
T
R
S
T
grouped
filter (S.B)
Input Streams
„
„
„
Runtime Adaptivity
Multi‐query Optimization
Framework – implements arbitrary schemes
Query‐Split Scheme (Niagara)
trig.Act.i
trig.Act.j
scan
scan
file i
…
…
IBM
file i
MSFT file j
…
…
file j
split
Quotes.XML
join
scan
„
„
„
Symbol = Const.Value
constant
table
scan
Aggregate subscription for efficiency
Split – evaluate trigger only when file updated
Triggers – multi‐query optimization
Shared Predicates [Niagara, Telegraph]
Predicates
for R.A
R.A > 1
R.A > 7
R.A > 11
R.A < 3
R.A < 5
R.A = 6
R.A = 8
R.A ≠ 9
>
7
1
11
A>1
<
A>7
3
A<3
=
≠
A<5
6
8
9
A>11
Tuple
A=8
Outline
„
„
„
„
Introduction ‐ Applications
Data Stream Model
‰
Data Model
‰
Query Model
‰
Synopses
Data Stream Projects
Wireless Sensor Networks
WSNs – An Introduction (1/3)
„
The vision:
ƒ
Push connectivity out of the pc and into the real world
ƒ
Billions of sensors and actuators everywhere
ƒ
Zero configuration and administrative cost
ƒ
Build everything out of CMOS so that each device costs pennies
ƒ
Enable new sensing paradigms New challenges in data stream management
WSNs – An Introduction (2/3)
Wireless Sensor Networks utility:
„
ƒ
Scatter cheap, tiny motes in an area of interest
ƒ
Perform querying operations
ƒ
Obtain reports of physical quantities under study
ƒ
Support sampling procedures, alert mechanism infrastructures, decision making processes etc WSNs – An Introduction (3/3)
„
„
Mote Features
ƒ
Low Power Supply, Low Power, Low Power...
ƒ
Low processing capabilities
ƒ
Constrained memory capacity
Network Features
ƒ
Wireless, multi‐hop communication using ISM radio zones (433MHz – 2,4GHz)
ƒ
Ad‐hoc network topologies
Sensor Net Sample Apps
Habitat Monitoring: Storm petrels on great duck island, microclimates on James Reserve.
Vehicle detection: sensors along a
road, collect data about passing
vehicles.
Earthquake monitoring in shaketest sites.
Traditional monitoring apparatus.
Communication In Sensor Nets
„
Radio communication has high link‐level losses
‰
A
typically about 20% @ 5m
B
„
C
Ad‐hoc neighbor discovery
D
„
TAG ÆTree‐based routing
F
E
Declarative Queries for Sensor Nets
Examples:
„
1
SELECT nodeid, light
FROM sensors
WHERE light > 400
EPOCH DURATION 1s
Sensors
Epoch Nodeid Light Temp Accel Sound
0
1
455
x
x
x
0
2
389
x
x
x
1
1
422
x
x
x
1
2
405
x
x
x
Aggregation Queries
2
SELECT AVG(sound)
FROM sensors
EPOCH DURATION 10s
3
SELECT roomNo, AVG(sound)
FROM sensors
Epoch
0
AVG(sound)
440
1
445
Epoch roomNo AVG(sound)
0
1
360
0
2
520
GROUP BY roomNo
1
1
370
HAVING AVG(sound) > 200
1
2
520
EPOCH DURATION 10s
Rooms w/ sound > 200
Example: Network Organization into Clusters
Base Station
SELECT faggr(attr), attr,…
FROM sensors
WHERE predicate
OUTPUT ACTION action
SAMPLE PERIOD t sec FOR d sec
Query dissemination:
The basestation poses a query of interest
•The query is received by nearby clusterheads
•Clusterheads propagate the query to their peers
•…and to nodes within their cluster
•
Example (cont.)
Query answer:
•Clusterheads collect mote measurements
•Send data to forwarders
•Forwarders propagate the data towards the basestation
•Answer reaches the basestation
Summary
„
Novel Data Management issues arise when dealing with data streams
‰
„
„
Data modeling, Query Processing, etc.
Exact answers are hard to be given so approximation along with theoretical bounds suffices
There is a lot of ongoing work that deal with streams, while wireless sensor networks pose new challenges in handling them
Reading list
„
„
„
„
„
N. Alon, Y. Matias, M. Szegedy. The space complexity of approximating the frequency moments. Proc. ACM STOC, 1996.
A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M. Strauss. Surfing Wavelets on Streams: One Pass Summaries for Approximate Aggregate Queries. Proc. VLDB, 2001.
S.Madden, M. J. Franklin, J. M. Hellerstein, W. Hong . TAG: a Tiny AGgregation Service for Ad‐Hoc Sensor Networks,” Proc. OSDI, 2002.
D. J. Abadi , D. Carney, U. Cetintemel , M. Cherniack , C. Convey C. , S. Lee, M. Stonebraker , N. Tatbul , S. Zdonik. Aurora: a new model and architecture for data stream management. VLDB Journal (2003).
S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong , S. Krishnamurthy, S. R. Madden, F. Reiss, M. A. Shah. TelegraphCQ: continuous dataflow processing. VLDB Journal (2003).
Download