dynamic-opt

advertisement
Dynamic Query Optimization
2
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Problems with static optimization



Cost function instability: cardinality error of n-way join grows exponentially
with n
Unknown run-time bindings for host variables
Changing environment parameters: amount of available space, concurrency
rate, etc
Static optimization comes in two flavours:
1. Optimize query Q, store the plan, run it whenever Q is posed
2. Every time when Q is posed, optimize it and run it
3
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Early Solutions
1. run several plans simultaneously for a short time, and then select one “best”
plan and run it for a long time
2. at every point in a standard query plan where the optimizer cannot accurately
estimate the selectivity of an input, a choose-plan operator is inserted
Select
Unbound predicate
Choose-Plan
File Scan
B-tree-scan
Get-Set R
4
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Dynamic Mid-Query Reoptimization
Features of the algorithm:





5
Annotated query execution plan
Runtime collection of statistics
Dynamic resource reallocation
Query plan modification
Keeping overhead low
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Motivating Example
select
from
where
and
and
and
6
avg(Rel1.selectattr1),
avg(Rel1.selectattr2),
Rel1.groupattr
Rel1, Rel2, Rel3
Rel1.selectatrr1 <: value1
Rel1.selectatrr2 <: value2
Rel1.jointatrr2 = Rel2.jointatrr2
Rel1.jointatrr3 = Rel3.jointatrr3
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Collection of Statistics
Limitations:


7
Can only collect statistics that can be gathered in one pass
Not useful for pipelined execution
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Dynamic Resource Reallocation




8
Assume 8MB memory available and 4.2MB necessary for each hash-join
The optimizer allocates 4.2MB for the first hash-join and 250KB for the second
(causing it to execute in two passes)
During execution, the statistics collector find out that only 7,500 tuples produced
by the filter
The memory manager allocates each of the two hash-joins 2.05MB
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Query Plan Modification

Once the statistics are available, modify the plan on the fly
– Hard to implement!
Original plan
9
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Modified plan – optimal solution
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Query Plan Modification: practical solution
• Store a partially computed query to disk
• Submit a new query using the partial results
select
avg(Temp1.selectattr1),
avg(Temp1.selectattr2),
Temp1.groupattr
from
Temp1, Rel3
where
Temp1.joinatrr3=Rel3.joinattr3
group by Temp1.groupattr
10
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Robust Query Processing through Progressive
Optimization
11
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Motivation
 Estimation errors in query optimization
– Due to correlations in data
SELECT count(*) from cars, accidents, owners
WHERE c.id = a.cid and c.id=o.cid and
c.make=‘Honda’ and c.model=‘Accord’
– Over-specified queries
SELECT * from customers where SSN=blah and name=blah’
– Mis-estimated single-predicate selectivity
SELECT count(*) from cars where c.make=?
– Out-of-date statistics
 Can cause bad plans
 Leads to unpredictable performance
12
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Traditional Query Processing
SQL Compilation
Statistics
Optimizer
Optimizer
Best
Best
Plan
Plan
Plan
Execution
13
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
LEO: DB2’s Learning Optimizer
Statistics
SQL Compilation
Optimizer
Optimizer
4. Exploit
3. Feedback
Best
Best
Plan
Plan
Adjustments
Adjustments
2. Analyze
Plan
Plan
Execution
Execution
1. Monitor
14
Estimated
Estimated
Cardinalities
Cardinalities
Actual
Actual
Cardinalities
Cardinalities
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Use feedback from
cardinality errors to
improve future plans
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Progressive Optimization (POP)
Statistics
SQL Compilation
Optimizer
Optimizer
3
4
New Best Plan
“MQT”with
Actual
Cardinality
Best Plan
Best
Plan
With
CHECK
5
2
New
knl Plan
Execution
6
15
Plan
Execution
with CHECK
Partial
Results
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
1
Confidential
Re-optimize If
CHECK fails
Use feedback from
cardinality errors to
improve current plan
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Outline


16
Progressive Optimization
– Solution overview
– Checkpoint placement
– Validity range computation
Performance Results
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Progressive Optimization


Why wait till query is finished to correct problem?
– Can detect problem early!
– Correct the plan dynamically before we waste any more time!
May never execute this exact query again
– Parameter markers
– Rare correlations
– Complex predicates

Long-running query won’t notice re-optimization overhead
Result: Plan more robust to optimizer mis-estimates
17
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Solution Overview


18
Add CHECKpoints to Query Execution Plans
– Check Estimated cardinalities vs. Actuals at runtime
When checking fails:
– Treat already computed (intermediate) results as materialized views
– Correct the cardinality estimates based on the actual cardinalities
– Re-optimize the query, possibly exploiting already performed work

Questions:
– Where to add checkpoints?
– When is an error big enough to be worth reoptimizing?

Tradeoff between opportunity (# reoptimization points) and
risk
(performance regression)
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
CHECK Placement (1)
 Three constraints
 Must not have performed side-effects
– Given out results to application
– Performed updates
 Want to reuse as much as possible
 Don’t reoptimize if the plan is almost finished
19
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
CHECK Placement (2)
 Lazy CHECK:
NLJN
– Just above a dam: TEMP, SORT, HSJN inner
– Very low risk of regression
Lazy Check
– Provides safeguard for hash-join, merge-join, etc.
 Lazy Checking with Eager Materialization
DAM
– Pro-actively add dams to enable checkpointing
– E.g. outer of nested-loops join
Eager Check
 Eager Checking
– It may be too late to wait until the dam is complete
– Check cardinalities before tuples are inserted into the dam
 Can extrapolate to estimate final cardinality
20
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
CHECK Operator Execution
 IF actual cardinality not in [low, high]):
– Save as a “view match structure” whose
 Definition (“matching”) was pre-computed at compile time
 Cardinality is actual cardinality
– Terminate execution & return special error code
– Re-invoke query compiler
 ELSE continue execution
 How to set the [low,high] range?
21
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Outline


22
Progressive Query Processing
– Solution overview
– Checkpoint placement
– Validity range computation
Performance Results
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Validity Range Determination (1)



23
At a given operator, what input cardinality change will cause a plan change? i.e. when is
this plan valid
In general, equivalent to parametric optimization
– Super-exponential explosion of alternative plans to consider
– Finds optimal plan for each value range, for each subset of predicates,
So we focus on changes in a single operator
– Local decision
– E.g. NLJN HSJN
– Not join order changes
– Advantage: Can be tracked during original optimization
– Disadvantage: Pessimistic model, since it misses reoptimization opportunities
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Validity Range Determination (2)
L2
L1
P2
P1
P
inner
outer
inner
outer
P
Q
Q

Suppose P1 and P2 considered during optimizer pruning
– cost(P1, est_cardouter) < cost(P2, est_cardouter)
– Estimate upper and lower bounds on cardouter s.t. P2 dominates P1
– Use bounds to update (narrow) the validity range of outer (likewise for inner)
 Applies to arbitrary operators
 Can be applied all the way up the plan tree
24
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Example of a Cost Analysis
 Lineitem × Orders query
– Vary selectivity of o_orderdate < ‘date’ predicate
 N1,M1,H1: Orders as outer
– N1, M1: SORT on outer
– N1: ISCAN on inner
 N2,M2,H2: Lineitem as outer
 Optimal Plan: N1H2M1
25
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
H2
M1
N1
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Upper Bounds from pruning M1 with N1
 Upper bounds vary
 Misses pruning with H2 because outer/inner reversed
 Still upper bounds set conservatively; no false reoptimization
26
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Lower Bounds from pruning N1 with M1
M1
H2
N1
27
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Outline


28
Progressive Query Processing
– Solution overview
– Checkpoint placement
– Validity range computation
Performance Results
– Parameter markers (TPCH query)
– Correlations (customer workload for a motor vehicles department)
– Re-optimization Opportunities with POP
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Robustness for Parameter Marker in TPC-H Query 10
800
Default Selectivity Estimate, with POP
Default Selectivity Estimate
Correct Selectivity Estimate
Execution Time
700
600
500
4-way Join:
goes thru 5
different
optimal plans
400
300
200
100
0
0
20
40
60
80
100
Actual Selectivity
29
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Response Time of DMV with and without POP
1400
response time
1200
1000
800
600
400
200
0
standard
POP
Box: 25th to 75th percentile of queries
30
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Speed-Up (+) vs. Regression (-) of DMV with POP
Speedup (+) vs. Regression (-)
Factor
90
80
70
60
50
40
30
20
10
0
-10
39 Real-World Complex Queries
31
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Scatter Plot of Response Times for DMV
1500
Response time with POP
1250
Degradation
1000
750
Improvement
500
250
0
0
250
500
750
1000
1250
1500
Response Time without POP
32
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Reoptimization Opportunities with POP
Fraction of Query Execution Completed
1.2
LC (above HJ)
LCEM
LC (above TMP/SORT)
1
0.8
0.6
0.4
0.2
0
Q2 Q3
33
Q4
Q5 Q7
Q8
Queries
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
Q11 Q18
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Related Work





34
Choose-Plans: Graefe/Cole, Redbrick, …
Parametric Query Optimization
Least-expected cost optimization
Kabra/DeWitt Mid-query re-optimization,
Query Scrambling
Runtime Adaptation
– Adaptive Operators:
 DB2/zOS, DEC RDB, …: adaptive selection of access methods
 Ingres: adaptive nested loop join
 XJoin, Tukwila: adaptive hash join
 Pang/Carey/Livny, Zhang/Larson: dynamic memory adjustment
 …
– Convergent query processing
– Eddies: adaptation of join orders
– SteMs: adaptation of join algorithms, spanning trees, …
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Conclusions
35




POP makes plans for complex queries more robust to optimizer misestimates
Significant performance improvement on real workloads
Overhead of re-optimization is very low, scales with DB size
Validity ranges tell us how risky a plan is
– Can be used for many applications to act upon cardinality sensitivity

Future Work:
– CHECK estimates other than cardinality
 # concurrent applications
 Memory available in buffer pool, sort heap
 Actual run time, actual # I/Os
– Avoid re-optimization too late in plan of if cost of optimization too high
– Re-optimization in shared-nothing query plans
– Extend validity ranges to more general plan robustness measures
Progressive
Query
Processing
Transparent
Access
to Grid| ACM
DataSIGMOD
Objects2004
| IBM
Confidential
© 2004
IBM
Corporation
© 2003
IBM
Corporation
Download