CS 540 Database Management Systems Query Optimization 1

advertisement
CS 540
Database Management Systems
Query Optimization
1
DBMS Architecture
User/Web Forms/Applications/DBA
query
Today’s
lecture
Query Parser
transaction
Transaction Manager
Query Rewriter
Query Optimizer
Lock Manager
Logging &
Recovery
Query Executor
Files & Access Methods
Buffer Manager
Buffers
Lock Tables
Main Memory
Storage Manager
Storage
Past lectures
Many query plans to execute a SQL query
• Compute the join of R(A,B) S(B,C) T(C,D) U(D,E)
U
U
T
T
R
U
S
S
R
T
R
S
• Even more plans: multiple algorithms to execute each operation
hash join
Sort-merge
Sort-merge
index-scan
R
U
Table-scan
T index-scan
S Table-scan
3
Query optimization: picking the fastest plan
• Optimal approach plan
–
–
–
–
enumerate each possible plan
measure its performance by running it
pick the fastest one
What’s wrong?
• Rule-based optimization
– Use a set of pre-defined rules to generate a fast plan
• e.g. If there is an index over a table, use it for scan and join.
4
Definitions
• Statistics on table R:
– NCARD(R): Number of tuples in R
– TCARD(R): Number of blocks in R
• TCARD(R) = NCARD(R ) / block size
– ICARD(R,A): Number of distinct values of attribute A in R
5
Definitions
• Clustered index:
– The relation is stored on the disk according to the order of
index.
– Sorted relation.
INDEX
DATA
10
10
30
20
50
70
30
40
90
110
50
60
70
80
6
Plans to select tuples from R: sA=a(R)
• We have a clustered index on R
• Plans:
– (Clustered) indexed-based scan
– Table-scan (sequential access)
• Statistics on R
– TCARD(R)=5000, NCARD(R)=200,000
– ICARD(R,A) = 2, one value appears in 95% of tuples.
• Clustered indexed scan vs. table-scan ?
7
Query optimization methods
• Rule-based optimizer fails
– It uses static rules
– The rules do not consider the distribution of the data.
• Cost-based optimization
– predict the cost of each plan
– search the plan space to find the fastest one
– do it efficiently
• Optimization itself should be fast!
8
Cost-based optimization
• Plan space
– which plans to consider?
– it is time consuming to explore all alternatives.
• Cost estimator
– how to estimate the cost of each plan without executing it?
– we would like to have accurate estimation
• Search algorithm
– how to search the plan space fast?
– we would like to avoid checking inefficient plans
9
Space of query plans
• Selection
– algorithms: sequential, index-based
– ordering: why does it matter?
• Join
– algorithms: nested loop, sort-merge, hash
– ordering
• Ordering/ Grouping
– can an “interesting order” be produced by join/
selection?
– algorithms: sorting, hash-based
10
Reducing plan space
• Multiple logical query plan for each SQL query
Star(name, birthdate), StarsIn(movie, name,
year)
SELECT movie
FROM Stars, StarsIn
WHERE Star.name = StarsIn.name AND year = 1950
movie
movie
s year=1950
StarsIn.name = Star.name
StarsIn.name = Star.name
StarsIn
Star
year=1950
StarsIn
Star
Generally Faster
11
Reducing plan space
• Push selection down to reduce # of rows
• Push projection down to reduce # of columns
SELECT movie, name
FROM Stars, StarsIn
WHERE Star.name = StarsIn.name
movei, name
movie, name
StarsIn.name = Star.name
StarsIn.name = Star.name
movie, name
StarsIn
movie, name
Star
StarsIn
Less effective than pushing down selection.
Star
12
Reducing plan space
• The algorithm requires exponential computation!
• System-R style considers only left-deep joins
U
U
T
R
S
T
T
U
R
S
S
R
• Left-deep trees allow us to generate all fully pipelined plans
– Intermediate results not written to temporary files.
• Not all left-deep trees are fully pipelined (e.g., SM join).
•
13
Reducing plan space
• System R-style avoids the plans with Cartesian
products
– The size of a Cartesian product is generally larger than
(natural) joins.
• Example: R(A,B), S(B,C), U(C,D)
(R ⋈ U) ⋈ S has a Cartesian product
pick (R ⋈ S) ⋈ U instead
• If cannot avoid Cartesian products, delay them.
14
Cost estimation
• Relative accuracy
– Goal is to compare plans, not to predict exact cost
– More of an art than an exact science
• Each operator: input size, cost, output size
– estimate cost based on input size
• Example:
– sort-merge join of R ⋈ S is 3 TCARD(R) + 3 TCARD(S)
– estimate output size (for next operator) or selectivity
• selectivity: ratio of output to input
15
Cost estimation: Selinger Style
• Input: stats on each table
– NCARD(R): Number of tuples in R
– TCARD(R): Number of blocks in R
• TCARD(R) = NCARD(R ) / block size
– ICARD(R,A): Number of distinct values of attribute A in R
• Assumptions on attribute and predicate independence
• When no estimate available, use magic numbers.
• New alternative approach
– Histogram of database
16
Selectivity factors: selection
• Point selection: S = sA=a(R)
– NCARD(S) ranges from 0 to NCARD(R) – ICARD(R,A) + 1
– consider its mean: F = 1 / ICARD (R,A)
• Range selection: S = sA<a(R)
– F = (max(A) – a) / (max(A) – min(A))
– not-athematic inequality: use magic number
• F=1/3
• Range selection: S = s b <A<a(R)
– F = (a - b) / (max(A) – min(A))
– If not athematic use magic number
• F=1/4
17
Selectivity factors: selection
• Range selection: column in (set of values)
– F: union of point selections
18
Selectivity factors: selection
• S = sA=1 AND B<10(R)
– multiply 1/ICARD(R,A) for equality and 1/3 for inequality
– NCARD(R) = 10,000, ICARD(R,A) = 50
– NCARD(S) = 10000 / (50 * 3) = 66
• S = sA=1 OR B<10(R)
– sum of estimates of predicates minus their product
– NCARD(R) = 10,000, ICARD(R,A) = 50
– NCARD(S) = 200 + 3333 – 66 = 3467
19
Selectivity factors: join predicates
• Containment of values assumption
ICARD(S,A) <= ICARD (R,A): A values in S is a subset of A values in R
• Let’s assume ICARD (S,A) <= ICARD (R,A)
– Each tuple t in S joins x tuple(s) in R
– consider its mean: x = NCARD(R) / ICARD (R,A)
– NCARD(R ⋈A S) = NCARD (S) * NCARD(R) /
ICARD(R,A)
NCARD(R ⋈A S) =
NCARD (R) * NCARD(S) / max(ICARD(R,A), ICARD(S,A))
20
Cost estimation
• System- R cost model
– Sum of I/O and CPU
– #(BLOCK) + W * (RSI calls)
• Current cost formulas?
21
Search the plan space
• Baseline: exhaustive search
– enumerate all combinations and compare their costs
– enormous space!
U
U
T
S
R
T
U
• Search method parameters
R
T
S
R
S
– plan tree development
• construction: bottom-up, top-down
• modification: improve a somehow-connected tree
– algorithms
•
•
•
•
heuristic selections: make choices based on heuristics
branch and bound: search bounded by current best tree
hill climbing: find “nearby” plans with lowest cost
Dynamic programming: construction by greedy selection
22
Plan search: System-R style
• A.K.A: Selinger style optimization
• Bottom-up
– start from the ground relation (in FROM)
– work up the tree to form a plan
– compute the cost of larger plans based on its sub-trees.
• Dynamic programming
– greedily remove sub-trees that are costly (useless)
23
Dynamic programming
• Step 1: For each {Ri}:
– size({Ri}) = TCARD(Ri)
– plan({Ri}) = Ri
– cost({Ri}) = cost of access to Ri
• e.g. TCARD(Ri) if no index on Ri
• Step 2: For each {Ri, Rj}:
– size({Ri,Rj}) = estimate of the size of join
– plan({Ri,Rj}) = join algorithm
– cost = cost function of size of Ri and Rj
• #I/O access of the chosen join algorithm
– plan({Ri,Rj}): the join algorithm with smallest cost
24
Dynamic programming
• Step i: For each S ⊆ {R1, …, Rn} of cardinality i do:
– Compute size(S)
– for every S1 ,S2 s.t. S = S1  S2
c = cost(S1) + cost(S2) + cost(S1 ⋈ S2)
– cost(S) = the smallest C
– plan(S) = the plan for cost(S)
• Return Plan({R1, …, Rn})
25
Dynamic programming: example
• Let’s assume that the cost of each join is the size of
its intermediate results.
– to simplify the example
– other cost measures, #I/O access, are possible.
• cost(R) = 0 (no intermediate results)
• cost(R ⋈ S) = 0
(no intermediate results)
• cost( (R ⋈ S) ⋈ T)
= cost(R ⋈ S) + cost(T) + size( R ⋈ S )
= size(R ⋈ S)
26
Dynamic programming: example
• Relations: R, S, T, U
• Number of tuples: 2000, 5000, 3000, 1000
• We use a toy size estimation method:
– size (A ⋈ B) = 0.01 * T(A) * T(B)
27
Query
Size
Cost
Plan
RS
RT
RU
ST
SU
TU
RST
RSU
RTU
STU
RSTU
28
Query
Size
Cost
Plan
RS
100k
0
RS
RT
60k
0
RT
RU
20k
0
UR
ST
150k
0
TS
SU
50k
0
US
TU
30k
0
UT
RST
RSU
RTU
STU
RSTU
29
Query
Size
Cost
Plan
RS
100k
0
RS
RT
60k
0
RT
RU
20k
0
UR
ST
150k
0
TS
SU
50k
0
US
TU
30k
0
UT
RST
3M
60k
S(RT)
RSU
1M
20k
S(UR)
RTU
0.6M
20k
T(UR)
STU
1.5M
30k
S(UT)
RSTU
30
Query
Size
Cost
Plan
RS
100k
0
RS
RT
60k
0
RT
RU
20k
0
UR
ST
150k
0
TS
SU
50k
0
US
TU
30k
0
UT
RST
3M
60k
S(RT)
RSU
1M
20k
S(UR)
RTU
0.6M
20k
T(UR)
STU
1.5M
30k
S(UT)
RSTU
30M
110k
(US)(RT)
31
Plan search: all operations
• Base relations access
– find all plans for accessing each base relations
– push down selections and projections
– choose good plans, discard bad ones
• keep the cheapest plan for unordered and each
interesting order
• Join ordering
– use the bottom-up dynamic programming
– consider only left-deep join trees: n! ordering for n tables
– postpone Cartesian product
• Finally: grouping/ ordering
– use interesting order
– addition sorting
32
Nested subqueries
• Subqueries are optimized separately
• Correlation: order of evaluation
– uncorrelated queries
• nested subqueries do not reference outer subqueries
• evaluate the most deeply nested subquery first
– correlated queries: nested subqueries reference the outer
subqueries
Select name
From employee X
Where salary > (Select salary
From employee
Where employee_num = X.manager)
33
Nested subqueries – cont.
• correlated queries: nested subqueries reference the outer
subqueries
Select name
From employee X
Where salary > (Select salary
From employee
Where employee_num = X.manager)
•
•
The nested subquery is evaluated once for each tuple in the
outer query.
If there are small number of distinct values in the outer
relation, it is worth sorting the tuples.
–
reduces the #evaluation of the nested query.
34
Summary: optimization
• Plan space
– Huge number of alternatives, semantically equivalent
• Why important
– Difference between good/bad plabs could be order of
magnitude
• Idea goal
– map a declarative query to the most efficient plan
• Conventional wisdom: at least avoid bad plans
35
State of the art
• Academic: always a core database research topic
– Optimizing for interactive querying
– Optimizing for novel parallel frameworks
• Industry: most optimizers use System-R style
– They started with rule-based.
• Oracle 7 and its prior versions used rule-based
• Oracle 7 – 10: rule based and cost based
• Oracle 10g (2003): cost-based
36
What you should know
• The importance of query optimization
– difference between fast and slow plans
• Query optimization problem
– find the fast plans efficiently.
• The components of a cost-based (system R style)
query optimizer:
– plan space definition
– cost estimation
– search algorithm
37
Carry away messages
• Hard problems can be solved practically!
• Many times, our goal is to find “good enough”
solutions.
Download