Streaming Queries over Streaming Data

advertisement
Streaming Queries
over Streaming Data
Sirish Chandrasekaran (UC Berkeley)
Michael J. Franklin (UC Berkeley)
Presented by Andy Williamson
About Me
3rd Year ISYE major
 Minor in Computer Science
 From Austin, TX
 Have visited every state but Alaska
 Intern at Deloitte Consulting focusing
on SAP implementation

Agenda


Background/Motivation
PSoup








Introduction
System Overview
Query Processing Techniques
Implementation
Performance
Aggregation Queries
Conclusions
Critique
Background/Motivation

Continuous Query (CQ) Systems
Treat queries as fixed entities and
stream data over them
 Previous systems only allowed
streaming of either data or queries
 Continuously deliver results as they
are computed (infeasible/inefficient)

• Data Recharging
• Monitoring
PSoup: Introduction
Query processor based on Telegraph
query processing framework
 Allows both data and queries to be
streamed
 Partially stores results to support
disconnected operation and improve
data throughput and response time

PSoup: System Overview
User initially registers query specification with system
 System returns handle which can be used to invoke results
of query later
 Example Query:
SELECT *
FROM Data_Stream D_s
WHERE (D_s.a < x ^ D_s.b > y)
BEGIN(NOW – 10)
END(NOW);
 Begin-End Clause allows:





Snapshot (constant beginning and ending time)
Landmark (constant beginning and variable ending time)
Sliding window (variable beginning and ending time)
Limited by size of memory
PSoup: System Overview



PSoup treats execution of query streams as
a join of query and data streams
Maintains State
Modules (SteMs)
for queries and data
One query SteM for
all queries in the system, and one data
SteM for each data stream
PSoup: Query Processing
Techniques

Overview




PSoup assigns unique queryID that it returns
to the user
Client can disconnect, reconnect and
execute query to obtain updated results
PSoup continuously matches data to query
predicates in background and stores the
results in its Results Structure
When a query is invoked, PSoup applies the
appropriate input window to the Results
Structure to return the current results
PSoup: Query Processing
Techniques

Entry of new Query specs

New queries split into two parts:
• Standing Query Clause (SQC): consists of the
SELECT-FROM-WHERE clauses
• BEGIN-END clause, stored in separate
WindowsTable structure



SQC inserted into Query SteM
Used to probe Data SteMs corresponding to
tables in FROM clause
Resulting tuples stored in Results Structure
PSoup: Query Processing
Techniques

Entry of new data
New tuples assigned globally unique
tupleID and physical timestamp
(physicalID) based on system clock
 Inserted into appropriate Data SteM
 Then used to probe Query SteM to
determine which SQCs it satisfies
 TupleIDs and physicalIDs stored in
Results Structure

PSoup: Query Processing
Techniques

Selection Queries over a single
stream
PSoup: Query Processing
Techniques

Join Queries Over Multiple Streams
PSoup: Query Processing
Techniques

Query Invocation and Result Construction




Results Structure maintains info about which
tuples in Data SteM(s) satisfy which SQCs in
Query SteM
For each result tuple of each query, it stores
tupleID and physicalID of all constituent
base tuples of result tuple
Results of a query can be accessed by its
queryID
Ordered by timestamp (physicalID)
PSoup: Implementation

Eddy



Each tuple has a predicate attribute and an
Interest List dictating where it is to be routed
Provides Stream Prefix Consistency by
storing new and temporary tuples separately
in New Tuple Pool and Temporary Tuple
Pool
Begins by selecting a tuple from the NTP
and then processing everything in the TTP
before pickign another tuple from the NTP
PSoup: Implementation

Data SteM
Use tree-based index for data to
provide efficient access to probing
queries
 One red-black tree for every attribute
 Maintains hash-based index over
tupleIDs for fast access

PSoup: Implementation

Query SteM


Allows sharing of work between queries that have
overlapping FROM clauses
Use red-black trees to index single-attribute singlerelation boolean factors of a query
PSoup: Implementation

Query SteM
 For queries involving joins of multiple attributes, tree
structure doesn’t work
 Instead, a linked list called the predicateList is used
 Query SteM contains an array in which each cell
represents a query
 At beginning of probe by a data tuple, each cell is set
to the number of boolean factors in corresponding
query
 Every time tuple satisfies a boolean factor, cell value
is decremented
 At end of probe, if cell = 0, that means the data tuple
satisfies the given query
PSoup: Implementation

Results Structure




Stores metadata indicating which tuples
satisfy which SQCs
Can either be accomplished by previouslymentioned bitmap or by associating a linked
list of satisfactory data tuples for each query
Ordering by timestamp is simple for singletable queries
For Join queries, typically use oldest
timestamp
PSoup: Performance


Implemented in Java with customized
versions of Eddy and SteMs
Examined performance of two versions:



PSoup-Partial (PSoup-P): Maintain results
corresponding to SQCs in Results Structure,
and apply BEGIN-END clauses to retrieve
current results on query invocation
PSoup-Complete (PSoup-C): Continuously
maintains results corresponding to current
input window for each query in linked lists
NoMat: Measurements of a system that
doesn’t materialize results
PSoup: Performance

Storage Requirements



NoMat: Storage cost = space taken to store
base data streams within maximum window
over which queries are supported, plus size
of structures
PSoup-P: Storage cost = storage cost of
NoMat + size of Results Structure (either
bitarray or linked-list)
PSoup-C: Storage cost >> storage cost of
PSoup-P since C always stores current
results of standing queries at a given time
PSoup: Performance

Experimental Setup





Varied window sizes (27-216) and number(18)/type of boolean factors
Measured response time and maximum
supportable data arrival rate
Examined both P and C with and without
predicate indexes
Tested scheme to remove redundancies
arising from joins
Used synthetic generated query(27-212) /data
streams
PSoup: Performance

Response Time vs. Window Size
PSoup: Performance

Response Time vs. # Interval
Predicates
PSoup: Performance

Data Arrival Rate vs. # SQCs
PSoup: Performance

Summary of Results





Materializing results of queries supports
higher query invocation rates
Indexing queries and lazily applying windows
improves maximum data throughput
PSoup-C requires more memory
PSoup-C optimizes query invocation rate
PSoup-P optimizes data arrival rate
PSoup: Performance

Removing Redundancy in Join
processing
Entry of a query
specification or
new data
 Composite tuples
in joins

PSoup: Aggregation Queries
PSoup can support aggregate
functions
 Only possible to share data structures
across queries with identical SELECTPROJECT-JOIN clause

PSoup: Conclusions




Treats data and query streams analogously
Can support queries that require access to data that
arrived before and after the query
Materializes results to cut down on response time and
to support disconnected operation
 Enables data recharging and monitoring
Future work:
 Write data streams to disk and execute queries over
them
 Transfer queries between disk and memory, allowing
query execution to be scheduled
 Confront resource constraints when dealing with
infinite streams
 Query browser for temporal data
Critique


Strengths
 Very well written, easy to follow
 Clear examples, excellent explanation of performance
results
 Strong method that reduces processing time with
increase in interval predicates
Weaknesses
 Lacking sufficient data on storage costs
 Experimentation only tested one multiple-relation
boolean factor for joins; unrealistic
 Didn’t address whether same (or similar) query could
be entered twice and accidentally given two ID’s
Download