Query Optimization

advertisement
CS 540
Database Management Systems
Lecture 8: Query Optimization
1
DBMS Architecture
User/Web Forms/Applications/DBA
query
Today’s
lecture
Query Parser
transaction
Transaction Manager
Query Rewriter
Query Optimizer
Lock Manager
Logging &
Recovery
Query Executor
Files & Access Methods
Buffer Manager
Buffers
Lock Tables
Main Memory
Storage Manager
Storage
Past lectures
Many query plans to execute a SQL query
• Compute the join of R(A,B) S(B,C) T(C,D) U(D,E)
U
U
T
T
R
U
S
S
R
T
R
S
• Even more plans: multiple algorithms to execute each operation
hash join
Sort-merge
Sort-merge
index-scan
R
U
Table-scan
T index-scan
S Table-scan
3
Query optimization: picking the fastest plan
• Optimal approach plan
–
–
–
–
enumerate each possible plan
measure its performance by running it
pick the fastest one
What’s wrong?
• Cost-based optimization
– predict the cost of each plan
– search the plan space to find the fastest one
– do it efficiently
• Optimization itself should be fast!
4
Cost-based optimization
• Plan space
– which plans to consider?
– it is time consuming to explore all alternatives.
• Cost estimator
– how to estimate the cost of each plan without executing it?
– like to have accurate estimation
• Search algorithm
– how to search the plan space fast?
– like to avoid checking inefficient plans
5
Reduce plan space by query rewriting
• Multiple logical query plan for each SQL query
Star(name, birthdate), StarsIn(movie, name, year)
SELECT movie
FROM Stars, StarsIn
WHERE Star.name = StarsIn.name AND year = 1950
movie
movie
s year=1950
StarsIn.name = Star.name
StarsIn.name = Star.name
StarsIn
Star
year=1950
StarsIn
Star
Generally Faster
6
Reduce plan space by query rewriting
• Push selection down to reduce # of rows
• Push projection down to reduce # of columns
SELECT movie, name
FROM Stars, StarsIn
WHERE Star.name = StarsIn.name
movei, name
movie, name
StarsIn.name = Star.name
StarsIn.name = Star.name
movie, name
StarsIn
movie, name
Star
StarsIn
Less effective than pushing down selection.
Star
7
Cost estimation
• Cost of an operator depends on its input size
– sort-merge join of R ⋈ S : 3 B(R) + 3 B(S)
– The inputs may be output of other operators
sort-merge join for (R ⋈ S) ⋈ T
3 B(T) + 3 B(R ⋈ S) + 3 B(R) + 3 B(S)
• For each operator in a plan compute
– input size
– cost
– output size (for the next operator)
8
Cost estimation
• Input: stats on each table
– NCARD(R): Number of tuples in R
– TCARD(R): Number of blocks in R
• TCARD(R) = NCARD(R ) / block size
– ICARD(R,A): Number of distinct values of attribute A in R
• too much information to keep, use histogram
• Assumptions on attribute and predicate independence
– we need relative accuracy not exact values.
9
Output size estimation: selection
• For relation R(A, B)
– S = sA=a(R)
• NCARD(S) ranges from 0 to NCARD(R) – ICARD(R,A) + 1
• consider its mean: NCARD (S) = NCARD(R) / ICARD (R,A)
– S = sA<a(R)
• NCARD(S) = NCARD (R) * (max(A) - a) / (max(A) – min(A))
• If not athematic use magic number
– NCARD(S) = NCARD (R)/3
– S = s b <A<a(R)
• NCARD(S) = NCARD (R) * (a - b) / (max(A) – min(A))
• If not athematic use magic number
– NCARD(S) = NCARD (R)/3
10
Output size estimation: selection
• S = sA=1 AND B<10(R)
– multiply 1/ICARD(R,A) for equality and 1/3 for inequality
– NCARD(R) = 10,000, ICARD(R,A) = 50
– T(S) = 10000 / (50 * 3) = 66
• S = sA=1 OR B<10(R)
– sum of estimates of predicates minus their product
– NCARD(R) = 10,000, ICARD(R,A) = 50
– NCARD(S) = 200 + 3333 – 66 = 3467
11
Output size estimation: join
• Containment of values assumption
ICARD(S,A) <= ICARD (R,A): A values in S is a subset of A values in R
– special case: A is a a key in R and a foreign key in S
• Let’s assume ICARD (S,A) <= ICARD (R,A)
– Each tuple t in S joins x tuple(s) in R
– consider its mean: x = NCARD(R) / ICARD (R,A)
– NCARD(R ⋈A S) = NCARD (R) * NCARD(S) /
ICARD(R,A)
NCARD(R ⋈A S) =
NCARD (R) * NCARD(S) / max(ICARD(R,A), ICARD(S,A))
12
Cost estimation
• System- R cost model
– Sum of I/O and CPU
– #(PAGE) + W * (RSI calls)
• Current cost formulas?
13
Search the plan space
• Baseline: exhaustive search
– enumerate all combinations and compare their costs
– huge space!
U
U
T
R
S
T
U
R
S
T
R
S
• System-R style (Selinger style)
– dynamic programming
– bottom up construction of the plan
• start from the base tables and work up the tree to form a plan
• compute the cost of larger plans based on its sub-trees.
• greedily remove sub-trees that are costly (locally optimal)
14
Dynamic programming
•
•
•
•
Step 1: best plans for {R1}, {R2}, …, {Rn}
Step 2: best plans for {R1,R2}, {R1,R3}, …, {Rn-1, Rn}
…
Step n: best plan for {R1, …, Rn}
15
Dynamic Programming
• Step 1: For each {Ri}:
– size({Ri}) = B(Ri)
– plan({Ri}) = Ri
– cost({Ri}) = cost of access to Ri
• e.g. B(Ri) if no index on Ri
• Step 2: For each {Ri, Rj}:
– size({Ri,Rj}) = estimate of the size of join
– plan({Ri,Rj}) = join algorithm
– cost = cost function of size of Ri and Rj
• #I/O access of the chosen join algorithm
– plan({Ri,Rj}): the join algorithm with smallest cost
16
Dynamic programming
• Step i: For each S ⊆ {R1, …, Rn} of cardinality i do:
– compute Size(S)
– for every S1 ,S2 s.t. S = S1  S2
c = cost(S1) + cost(S2) + cost(S1 ⋈ S2)
– cost(S) = the smallest C
– plan(S) = the plan for cost(S)
• Return Plan({R1, …, Rn})
17
Dynamic programming: example
• Let’s assume that the cost of each join is the size of
its intermediate results.
– to simplify the example
– other cost measures, #I/O access, are possible.
• cost( R ) = 0 (no intermediate results)
• cost(R ⋈ S) = 0
(no intermediate results)
• cost( (R ⋈ S) ⋈ T)
= cost(R ⋈ S) + cost(T) + size( R ⋈ S )
= size(R ⋈ S)
18
Dynamic programming: example
• Relations: R, S, T, U
• Number of tuples: 2000, 5000, 3000, 1000
• We use a toy size estimation method:
– size (A ⋈ B) = 0.01 * T(A) * T(B)
19
Query
Size
Cost
Plan
RS
RT
RU
ST
SU
TU
RST
RSU
RTU
STU
RSTU
20
Query
Size
Cost
Plan
RS
100k
0
RS
RT
60k
0
RT
RU
20k
0
UR
ST
150k
0
TS
SU
50k
0
US
TU
30k
0
UT
RST
RSU
RTU
STU
RSTU
21
Query
Size
Cost
Plan
RS
100k
0
RS
RT
60k
0
RT
RU
20k
0
UR
ST
150k
0
TS
SU
50k
0
US
TU
30k
0
UT
RST
3M
60k
S(RT)
RSU
1M
20k
S(UR)
RTU
0.6M
20k
T(UR)
STU
1.5M
30k
S(UT)
RSTU
22
Query
Size
Cost
Plan
RS
100k
0
RS
RT
60k
0
RT
RU
20k
0
UR
ST
150k
0
TS
SU
50k
0
US
TU
30k
0
UT
RST
3M
60k
S(RT)
RSU
1M
20k
S(UR)
RTU
0.6M
20k
T(UR)
STU
1.5M
30k
S(UT)
RSTU
30M
110k
(US)(RT)
23
Reducing search space
• The algorithm requires exponential computation!
• System-R style considers only left-deep joins
U
U
T
R
S
T
T
U
R
S
S
R
• Left-deep trees allow us to generate all fully pipelined plans
– Intermediate results not written to temporary files.
• Not all left-deep trees are fully pipelined (e.g., SM join).
•
24
Reducing search space
• System R-style ignores plans with Cartesian products
– The size of a Cartesian product is generally larger than
(natural) joins.
• Example: R(A,B), S(B,C), U(C,D)
(R ⋈ U) ⋈ S has a Cartesian product
pick (R ⋈ S) ⋈ U instead
25
What you should know
• The importance of query optimization
– difference between fast and slow plans
• Query optimization problem
– find the fast plans efficiently.
• The components of a cost-based (system R style)
query optimizer:
– plan space definition
– cost estimation
– search algorithm
26
Download