Efficient Query Processing for Modern Data Management Utkarsh Srivastava Stanford University

advertisement
Efficient Query Processing for
Modern Data Management
Utkarsh Srivastava
Stanford University
Data Management System: User’s View
Data Management System
User/
Client
Query
Results
Query
Processor
Data
July 26, 2016
Stanford InfoSeminar
Behind the Scenes
Query
Planning
Component
Cost Model
Statistics
Profiling
Component
July 26, 2016
Data
Stanford InfoSeminar
Execution Results
Component
A Conventional Relational DBMS
Histograms,
Available
indices,
Cardinalities
Planning
Component
Cost Model
Disk I/O
Statistics
Profiling
Component
July 26, 2016
Execution
Component
Stanford InfoSeminar
Sensor Data Management
Query
Results
Coordinator
Node
July 26, 2016
Stanford InfoSeminar
Sensor Data Management
Sensor and
network power
Histograms,
consumption,
Available
Sensor
indices,
capabilities
Cardinalities
Planning
Component
Cost Model
Total
Diskpower
I/O
consumed,
Latency
Statistics
Profiling
Component
July 26, 2016
Data
Stanford InfoSeminar
Execution
Component
Data Stream Management
Streamed Results
Continuous
Queries
…
Data
Data Streams
July 26, 2016
e.g., find all flows
that are using
> 0.1% of
the bandwidth
e.g., stream of
network packet
headers
Stanford InfoSeminar
Data Stream Management
1-pass
histograms,
Histograms,
Stream
Available
arrival
rates,
indices,
Burst
characteristics
Cardinalities
Planning
Component
Cost Model
Throughput,
Disk I/O
Accuracy
Statistics
Profiling
Component
July 26, 2016
…
Data
Data Streams
Stanford InfoSeminar
Execution
Component
Query Processing over Web Services
Query
Results
Web/Enterprise
company symbol
info.
symbol
WS1
WS2
NASDAQ
July 26, 2016
Stanford InfoSeminar
stock
price
info.
NYSE
Query Processing over Web Services
Web service
Histograms,
response
Available
times, WS costs,
indices,
Histograms
cardinalities
Planning
Component
Cost Model
Latency,
Disk I/O
Monetary cost
Statistics
WS1
Profiling
Component
July 26, 2016
Data
WSn
Stanford InfoSeminar
Execution
Component
Summary of Contributions
 Query Processing over Web Services
 Finding the optimal query execution plan [TR05]
 Data Stream and Sensor Data Management
 Optimal operator placement in the network to
minimize power consumption [PODS04]
 Near-optimal sharing strategies for queries with
common expensive filters [TR05]
 Maximizing accuracy of stream joins in the presence
of memory or CPU constraints [VLDB04, ICDE06]
 Profiling Component
 Collecting statistics from execution feedback [ICDE06]
 Collecting statistics from block-level sampling
[SIGMOD04]
July 26, 2016
Stanford InfoSeminar
Summary of Contributions
 Query Processing over Web Services
 Finding the optimal query execution plan [TR05]
 Data Stream and Sensor Data Management
 Optimal operator placement in the network to
minimize power consumption [PODS04]
 Near-optimal sharing strategies for queries with
common expensive filters [TR05]
 Maximizing accuracy of stream joins in the presence
of memory or CPU constraints [VLDB04, ICDE06]
 Profiling Component
 Collecting statistics from execution feedback
[ICDE06]
 Collecting statistics from block-level sampling
[SIGMOD04]
July 26, 2016
Stanford InfoSeminar
Summary of Contributions
 Query Processing over Web Services
 Finding the optimal query execution plan [TR05]
 Data Stream and Sensor Data Management
 Optimal operator placement in the network to
minimize power consumption [PODS04]
 Near-optimal sharing strategies for queries with
common expensive filters [TR05]
 Maximizing accuracy of stream joins in the presence
of memory or CPU constraints [VLDB04, ICDE06]
 Profiling Component
 Collecting statistics from execution feedback [ICDE06]
 Collecting statistics from block-level sampling
[SIGMOD04]
July 26, 2016
Stanford InfoSeminar
A Web Service Management System
Declarative Interface
WS Invocations
WSMS
Metadata Component
Query and
Input Data
Client
Results
WS
Registration
Schema
Mapper
Query Processing Component
Plan
Selection
WS2
Plan
Execution
Profiling and Statistics Component
Response
Time Profiler
July 26, 2016
WS1
Stanford InfoSeminar
Statistics
Tracker
WSn
Query over Web Services – An Example
 Credit card company wishes to send offers to those
a)Who have a high credit rating, and,
b)Who have a good payment history on a prior card.
 Company has at its disposal
L: List of potential recipient SSNs
WS1: SSN ! credit rating
WS2: SSN ! card no(s)
WS3: card no. ! payment history
July 26, 2016
Stanford InfoSeminar
Class of Queries Considered
 “Select-Project-Join” queries over:
Set of web services having interface Xi ! Yi
Input data
 Each input attribute for WSi is provided by:
Either the attributes in the input data
Or, output attribute of another web service WSj
) Precedence constraint WSj ! WSi
In general, precedence constraints form a DAG.
July 26, 2016
Stanford InfoSeminar
Plan 1
SSN
SSN, cr
WSMS
WS1 SSN ! cr
Filter on cr, keep SSN
L (SSN)
Client
Results
WS2 SSN ! ccn
SSN, ccn
WS3
SSN, ccn, ph
Filter on ph, keep SSN
Note pipelined processing
July 26, 2016
Stanford InfoSeminar
ccn ! ph
Simple Representation of Plan 1
L
July 26, 2016
WS1
WS2
Stanford InfoSeminar
WS3
Results
Plan 2
WSMS
SSN
SSN, cr
WS1 SSN ! cr
Filter on cr, keep SSN
SSN
L (SSN)
Client
Join
Results
WS2 SSN ! ccn
SSN, ccn
WS3
SSN, ccn, ph
Filter on ph, keep SSN
July 26, 2016
Stanford InfoSeminar
ccn ! ph
Simple Representation of Plan 2
WS1
L
Results
WS2
July 26, 2016
WS3
Stanford InfoSeminar
Which is More Efficient?
L
WS1
WS2
WS3
Results
WS1
L
Results
WS2
WS3
In Plan 1, WS2 has to process only filtered SSNs
In Plan 2, WS2 has to process all SSNs
July 26, 2016
Stanford InfoSeminar
Query Planning Recap
 Possible plans P1, …, Pn
 Statistics S
 Cost model cost(Pi, S)
 Want to find least-cost plan
July 26, 2016
Stanford InfoSeminar
Statistics
 ci: per-tuple response time of WSi
Assume independent response times
 si: selectivity of WSi
Average number of output tuples per input tuple to
WSi
Assume independent selectivites
May be · 1 or > 1
July 26, 2016
Stanford InfoSeminar
Bottleneck Cost Metric
Lunch Buffet
Dish 1
Dish 2
Dish 3
Dish 4
Overall per-item processing time
=
Response time of slowest or bottleneck stage in pipeline
July 26, 2016
Stanford InfoSeminar
Cost Expression for Plan P
Ri(P): Predecessors of WSi in plan P
Fraction of input tuples seen by WSi =
Response time per original input tuple at WSi
Assumption:
Contrast
WSMS
withcost
sumiscost
not metric
the bottleneck
July 26, 2016
Stanford InfoSeminar
Problem Statement
Input:
Set of web services WS1, …, WSn
Response times c1, …, cn
Selectivities s1, …, sn
Precedence constraints among web
services
Output:
Plan P arranging web services into a DAG
P respects all precedence constraints
cost(P) by the bottleneck metric is
minimized
July 26, 2016
Stanford InfoSeminar
No Precedence Constraints
 All selectivities · 1
Theorem: Optimal to linearly order by increasing ci
 General case
proliferative
web services
selective web services
…
in increasing cost order
July 26, 2016
Stanford InfoSeminar
Local
Join
at
WSMS
Results
With Precedence Constraints
Greedy Algorithm
while (not all web services added to P) {
from all web services that can be added to P
choose WSx that can be added to P at least
cost
add WSx to P
}
July 26, 2016
linear program solved
using network flow techniques
Running Time: O(n5)
Stanford InfoSeminar
Implementation
 Building a prototype general-purpose WSMS
Written in Java
Uses Apache Axis, an open source
implementation of SOAP
Axis generates a Java class for each web service,
dynamically loaded using Java Reflection
 Experiments so far primarily show validity of
theoretical results [TR05].
July 26, 2016
Stanford InfoSeminar
Summary of Contributions
 Query Processing over Web Services
 Finding the optimal query execution plan [TR05]
 Data Stream and Sensor Data Management
 Optimal operator placement in the network to
minimize power consumption [PODS04]
 Near-optimal sharing strategies for queries with
common expensive filters [TR05]
 Maximizing accuracy of stream joins in the presence
of memory or CPU constraints [VLDB04, ICDE06]
 Profiling Component
 Collecting statistics from execution feedback
[ICDE06]
 Collecting statistics from block-level sampling
[SIGMOD04]
July 26, 2016
Stanford InfoSeminar
Gathering Statistics from Query Feedback
Planning
Component
Cost Model
Statistics
WS1
Profiling
Component
Data
WS
n
July 26, 2016
Stanford InfoSeminar
Execution
Component
Gathering Statistics from Query Feedback
Planning
Component
Cost Model
Statistics
Profiling
Component
July 26, 2016
Execution
Component
Stanford InfoSeminar
Basic Idea
 User queries predicate P ) cardinality(P,
database)
 Over time predicates P1, P2, … queried
 Can be used to build and refine histogram
over time
July 26, 2016
Stanford InfoSeminar
Example
Fans(Name, Country, Sport)
Soccer
Cricket
UK
India
Prior knowledge: Total number of tuples = 100
July 26, 2016
Stanford InfoSeminar
Example
Fans(Name, Country, Sport)
Soccer
Cricket
UK
25
25
India
25
25
From table stats: Total number of tuples = 100
July 26, 2016
Stanford InfoSeminar
Example
Fans(Name, Country, Sport)
Soccer
Cricket
UK
25
25
India
25
25
SELECT Name
FROM Fans
WHERE Country = ‘UK’
20 tuples
July 26, 2016
Stanford InfoSeminar
Example
Fans(Name, Country, Sport)
Soccer
Cricket
UK
10
10
India
25
25
SELECT Name
FROM Fans
WHERE Country = ‘UK’
20 tuples
July 26, 2016
Stanford InfoSeminar
Example
Fans(Name, Country, Sport)
Soccer
Cricket
UK
10
10
India
40
40
SELECT Name
FROM Fans
WHERE Country = ‘UK’
20 tuples
July 26, 2016
Stanford InfoSeminar
Example
Fans(Name, Country, Sport)
Soccer
Cricket
UK
10
10
India
40
40
Accuracy improves even for India
July 26, 2016
Stanford InfoSeminar
Example
Fans(Name, Country, Sport)
Soccer
Cricket
UK
10
10
India
40
40
SELECT Name
FROM Fans
WHERE Sport = ‘Soccer’
70 tuples
July 26, 2016
Stanford InfoSeminar
Example
Fans(Name, Country, Sport)
Not possible!
Soccer
Cricket
UK
35
10
India
35
40
SELECT Name
FROM Fans
WHERE Sport = ‘Soccer’
70 tuples
July 26, 2016
Stanford InfoSeminar
Example
Fans(Name, Country, Sport)
Uniform
Soccer
Cricket
UK
10
10
India
60
40
SELECT Name
FROM Fans
WHERE Sport = ‘Soccer’
70 tuples
July 26, 2016
Stanford InfoSeminar
Non-Uniform
Solution
 Goals
Maintain consistency with observed feedback
Maintain uniformity when feedback gives no
information
 Use the maximum-entropy principle
Entropy is a measure of uniformity of a distribution.
 Leads to an optimization problem
Variables:
Number of tuples in histogram
buckets
Constraints: Feedback observed so far
July 26, 2016
Stanford InfoSeminar
Objective:
Maximize
entropy
Solution to Example
Soccer Cricket
UK
India
Total tuples = 100
Total tuples with country as ‘UK’ = 20
Total tuples with sport as ‘Soccer’ = 70
July 26, 2016
Stanford InfoSeminar
Solution to Example
Soccer Cricket
UK
14
6
India
56
24
Total tuples = 100
Total tuples with country as ‘UK’ = 20
Total tuples with sport as ‘Soccer’ = 70
 Maintains consistency and uniformity
70-30 Soccer-Cricket ratio at both countries
20-80 UK-India ratio for both sports
July 26, 2016
Stanford InfoSeminar
Making it Work in a Real System (IBM DB2)
 Maintaining a controlled-size histogram
Solving maximum-entropy yields an importance
measure for feedback.
Discard “unimportant” feedback by this measure
 Updates may lead to inconsistent feedback
Heuristic to eliminate inconsistent feedback
Preferentially discard older feedback
July 26, 2016
Stanford InfoSeminar
Experimental Evaluation
Static Data
Average Relative Error
1
0.8
0.6
ISOMER
STGrid
0.4
Optimizer
0.2
0
0
100
200
300
400
Number of Queries
July 26, 2016
Stanford InfoSeminar
500
Experimental Evaluations
Average Relative Error
Dynamic Data
Uniform to Correlated
1.4
1.2
1
0.8
ISOMER
0.6
STGrid
0.4
Optimizer
0.2
0
0
500
1000
1500
Number of Queries
July 26, 2016
Stanford InfoSeminar
2000
Related Work
 Web Service Composition
 No formal cost model or provably optimal strategies
 Targeted towards workflow-oriented applications
 Parallel and Distributed Query Optimization
 Freedom to move query operators around
 Much larger space of execution plans
 Data Integration, Mediators
 Integrate general sources of data
 Optimize the cost at the integration system itself
 Feedback-based statistics
 Self-tuning histograms, STHoles
July 26, 2016
Stanford InfoSeminar
Summary of Contributions
 Query Processing over Web Services
 Finding the optimal query execution plan [TR05]
 Data Stream and Sensor Data Management
 Optimal operator placement in the network to
minimize power consumption [PODS05]
 Near-optimal sharing strategies for queries with
common expensive filters [TR05]
 Maximizing accuracy of stream joins in the
presence of memory or CPU constraints [VLDB04,
ICDE06]
 Profiling Component
 Collecting statistics from execution feedback [ICDE06]
 Collecting statistics from block-level sampling
[SIGMOD04]
July 26, 2016
Stanford InfoSeminar
Operator Placement in Sensor Networks [PODS05]
increasing
computational
power
cheaper
links
Data Acquisition
Problem: Tradeoff between processing and communication;
what is the best operator placement?
Solution: Operator placement algorithm for filter-join
queries guaranteeing minimum total cost.
July 26, 2016
Stanford InfoSeminar
Sharing Among Continuous Queries [TR05]
SELECT * FROM S WHERE F1 Æ F2 Æ F3
SELECT AVG( a ) FROM S WHERE F1 Æ F2 Æ F4
SELECT MAX( b ) FROM S WHERE F2 Æ F3
 Problem: What is the best way to save
redundant work due to shared expensive
filters?
 Solution:
Optimal plan must be adaptive
Achieve within O(log2 m log n) of best sharing
Implementation shows large gain at little
July 26, 2016
Stanford InfoSeminar
overhead.
Maximizing Accuracy of Stream Joins
[VLDB04]
Aggregation
SELECT AVG(p2.ts – p1.ts)
FROM Packets1 p1, Packets2 p2
WHERE p1.id = p2.id
Input
State
rates
may
may
be
be very
very high
large
 Problem: Maximizing join accuracy with resource
constraints.
 Solution: Given memory or CPU constraints
Find maximum-subset of join result
Find largest possible random sample of join result
July 26, 2016
Stanford InfoSeminar
Statistics using Block-Level Sampling [SIGMOD04]
cost of reading
a random byte
¼
cost of reading
entire block
100 tuples per block
Uniform Random
Sampling
1000 samples ¼ 1000 IOs
Block-Level Sampling
1000 samples = 10 IOs
Problem: Naïvely applying techniques for uniform
samples to block-level samples can cause huge
errors.
Solution:
 New techniques for computing histograms and distinct
values from a block-level sample.
 Robust and optimal for all layouts
July 26, 2016
Stanford InfoSeminar
Future Directions
 Development of comprehensive WSMS.
Web service reliability/quality.
Extension to handle workflows.
Apply feedback-based profiling.
Web services with monetary cost.
 Query processing in other new data
management scenarios.
Infrastructures that enforce data privacy [CIDR05].
July 26, 2016
Stanford InfoSeminar
Download