Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan*

advertisement
Primitives for Workload
Summarization and Implications
for SQL
Prasanna Ganesan*
Stanford University
Surajit Chaudhuri
Vivek Narasayya
Microsoft Research
*Work done at Microsoft Research
1
Motivation
• Workload: Set of SQL Statements
• Many tasks exploit workload information
– DB Admin, Index Tuning, Statistics building,
Approximate Query Processing
• DBMS profilers produce large workloads
(+additional info)
• Most tasks need small workloads
• Goal: Summarization - Find a “representative”
subset of a given, large workload.
– Sometimes a weighted subset
2
Why Not Random Sampling?
• One Size does not fit all
– Different definitions of “representative subset”
– Random sampling may lose valuable info
• Ignores additional info associated with
statements
• Shown to work poorly, e.g., for Index Selection
[chaudhuri02]
– May oversample queries on some tables, while
ignoring less frequent queries on other tables
3
Our Solution
1. Treat input as a relation
•
Each SQL statement (+associated info) is a tuple
2. Extend SQL with new language primitives
•
•
Allow declarative specification of desired subset
Usable on arbitrary relations, not just workloads
3. Implement extensions inside query engine
•
•
Why? Primitives appear widely applicable
Other implementation options available
4
The Architecture
Query
ID
SQL
String
FROM
Tables
Q1
SELECT *
FROM R1, R2
{R1, R2}
Q2
…
…
..
…
…
Execution
Engine
SELECT *, DOMSUM(Count) FROM WkldTbl
DOMINATE WITH PARTITIONING BY FromTables, JoinConds,
WhereCols
(SLAVE.GroupByCols  MASTER.GroupByCols) AND
(SLAVE.OrderByCols PREFIX MASTER.OrderByCols)
REPRESENT WITH PARTITIONING BY FromTables,
JoinConds, WhereCols MAXIMIZING SUM(DOM_Count)
GLOBAL CONSTRAINT Count(*) ≤ 200
LOCAL CONSTRAINT
2.5 Count(*) ≥
3.03
int(200*LOCAL.Count(*)/GLOBAL.Count(*))
…… Estimated
Cost
..
Execution
Cost
…
…
…
…
Summary
Application
5
Outline
• New Primitives for Summarization (Subsetting)
– Dominance
– Representation
• Implementing summarization primitives in SQL
• Experiments
6
Dominance
• Idea: Filter and aggregate using a partial
order on tuples
• Specify condition for one tuple to dominate
another
– Transitive condition
– Encapsulates application knowledge
• Output: Keep throwing away tuples that are
dominated
– Retain aggregate info about dominated tuples
7
A Graphical Representation
2Cattivo
Vendor
Quality
Price
6Buono
375
25
3
50
2
50
8
Applying Dominance to Workloads
• Example: Index Selection
Q2
Q1
SELECT … FROM R
SELECT ... FROM R
GROUP BY A, B, C
dominates
GROUP BY A, B
– An index useful for Q1 likely to be useful for Q2
MASTER.FromTables=SLAVE.FromTables AND
MASTER.GroupByCols  SLAVE.GroupByCols AND
MASTER.OrderByCols PREFIX SLAVE.OrderByCols
9
Outline
• New Primitives for Summarization (Subsetting)
– Dominance
– Representation
• Implementing Summarization Primitives in SQL
• Experiments
10
Representation
• Dominance only gets us so far
– Need a “lossier” way to select a subset
•
Idea: Pick a subset that solves
a Linear Program
– Optimize some criterion
– Satisfy lots of constraints
– Support concept of partitioning
11
Details
• Partition tuples by a set of attributes
A
1
1
1
•
•
B
10
..
..
C
..
..
..
A
2
2
2
B
5
..
B..
C
..
..
..
A
C
1
10
..
2
5
..
Criterion: Maximize/Minimize
Aggregate
3
7
– E.g., Minimize1 Count(*)
…
2
…
..
Global Constraints
3
…
… > 60% Sum(B)
..
– E.g., Sum(B) ..in chosen subset
A
3
3
3
B
7
..
..
C
..
..
..
in input
• Local Constraints - apply to each partition
– E.g., Sum(B) in chosen subset > 40% Sum(B) in that partition
12
An Index Selection Example
• Partition by Tables, Join Conditions and
attributes in WHERE clause
• Criterion: Maximize Sum(ExecutionCost)
– Need best “coverage”
• Global Constraint: Count(*) ≤ 200
• Local Constraint: Proportionate representation
– A partition with 20% of input should have 20% of
output
– Count(*) ≥int(200*LOCAL.Count(*)/GLOBAL.Count(*))
13
Putting it all together
1. Apply dominance criterion (as earlier).
2. Apply representation (as earlier, but maximize
SUM(DOM_Count) ).
3. Weight each tuple by the number of tuples it dominates.
SELECT SqlString, DOMSUM(Count) FROM WkldTbl
DOMINATE WITH PARTITIONING BY FromTables, JoinConds, WhereCols
(SLAVE.GroupByCols  MASTER.GroupByCols) AND (SLAVE.OrderByCols PREFIX
MASTER.OrderByCols)
REPRESENT WITH PARTITIONING BY FromTables, JoinConds, WhereCols
MAXIMIZING SUM(DOM_Count)
GLOBAL CONSTRAINT Count(*) ≤ 200
LOCAL CONSTRAINT Count(*) ≥ int(200*LOCAL.Count(*)/GLOBAL.Count(*))
14
Outline
• New Primitives for Summarization (Subsetting)
– Dominance
– Representation
• Implementing Summarization Primitives in SQL
• Experiments
15
Implementing Summarization
Primitives in SQL
• Assume set and sequence support in SQL
– The mills of the standards bodies…
• Partitioning useful for both primitives
– Hashing, Sort-based, Index-based…
• Implementing Dominance
– Naïve O(n2) algorithm
– Techniques from group-wise processing
– Leverage Skyline optimizations
16
Representation
• Implementing directly is LP-hard
• Many queries are much simpler
– Fall into one of two special cases
• Other queries are handled by a simple heuristic
– User-guided search
• Implement as multiple operators
17
User-Guided Search
• Scan tuples in a specific order
– User-specified, or heuristically chosen
• Will always minimize/maximize Count(*)
– Use ordering to transform other objectives
– Slightly different algorithms for the two cases
18
A Minimization Example
F
E
Satisfied
D
C
Output
B
A
Violated
19
Two Special Cases
• Maximize SUM(Attr)
– All constraints are on Count(*)
– Use partitioning and sort-order access
• Minimize Count(*)
– Single constraint: Again easily solved
– More special cases also solvable
– Multiple constraints: Approximation algorithm
20
Experiments
• Evaluate utility for index selection
• Compare to sophisticated Wkld. Compression
[chaudhuri02]
– Clusters using a complex distance function
• Simple query as described earlier
– Constrained to output same number of statements as
Workload Compression
– Orders of magnitude faster
• TPC-H 1GB database
– Multiple synthetic workloads introduced in
[chaudhuri02]
21
Experiments (Contd.)
Workload
Compress
Tuning
Wizard
Evaluate
Total Estimated Cost
22
Comparing Estimated Costs
Wkld Compression
Proportionate(Syntactic)
Estimated Cost
90000
80000
70000
60000
50000
40000
30000
20000
10000
0
SPJ
SPJ-GB
SPJ-GBOB
SingleTable
Workloads
23
Conclusion
• Our contributions
– Summarization can be expressed declaratively
– Introduction of new operators for summarization
– Discussion of SQL implementation
• The Future
– An automatic monitoring and tuning infrastructure?
– More workload-sensitive tasks?
24
Download