Query Optimization R&G, Chapter 15 Lecture 17

advertisement
Query Optimization
R&G, Chapter 15
Lecture 17
Administrivia
•
Homework 3 available from class website
– Due date: Tuesday, March 20 by end of class period
•
Homework 4 available today
– Implement nested loops and hash join operators for (new!)
minibase
– Due date: April 10 (after Spring Break)
•
Midterm 2 is 3/22, 2 weeks from today
– In class, covers lectures 10-17
– Review will be held Tuesday 3/20 7-9 pm 306 Soda Hall
•
Internships at Google this summer…
– See http://www.postgresql.org/developer/summerofcode
– Booth at the UCB TechExpo this Thursday:
– http://csba.berkeley.edu/tech_expo.html
– Contact josh@postgresql.org
Review
We are here
Query Optimization
and Execution
Relational Operators
Files and Access Methods
Buffer Management
Disk Space Management
DB
•Query plans are a tree of operators that
compute the result of a query
•Optimization is the process of picking the
best plan
•Execution is the process of executing the
plan
Query Plans: turning text into tuples
Query Result
Query
Shift
Name
SELECT A.aname, max(F.feedingshift)
FROM Animals A, Feeding F
WHERE A.aid = F.aid AND
(A.species = 'Big Cat' or A.species = 'Bear')
GROUP BY A.aname
HAVING COUNT(*) > 1
Aslan
3
Bageera
3
Elsa
3
Shere Khan
3
Tigger
2
Operators
Query Plan
40
100
Aslan
Big Cat
50
300
Baloo
Bear
60
100
Bageera
Big Cat
3
70
100
Shere
Khan
Big Cat
100
3
90
100
Dumbo
Elephant
…
…
…
…
…
…
10
2
1
100
3
10
3
2
100
3
20
3
2
100
3
20
2
3
100
30
3
2
…
…
…
Query Optimization steps
SELECT A.aname, max(F.feedingshift)
FROM Animals A, Feeding F
WHERE A.aid = F.aid AND
(A.species = 'Big Cat' or A.species = 'Bear')
GROUP BY A.aname
HAVING COUNT(*) > 1
1. Parse query from text to
‘intermediate model’
2. Traverse ‘intermediate model’ and
produce alternate query plans
–
–
–
–
Query plan = relational algebra
tree
Plan cost = cumulative cost of tree
Consider equivalent but alternative
relational algebra tress
Optimizer keeps track of cost and
properties of plans
3. Pick the cheapest plan
Query parser
Block 3
Block 2
Block 1
Query optimizer
Cost = 200
Cost = 150
Cost = 500
To execution engine
Optimized query is 124x cheaper
than the original!
SELECT S.sname
FROM Reserves R, Sailors S
WHERE R.sid=S.sid AND
R.bid=100 AND S.rating>5
(On-the-fly)
sname
bid=100
sname
rating > 5
Sailors
Reserves
500,500 IOs
(Page-Oriented
sid=sid Nested loops)
(On-the-fly)
(Page-Oriented
sid=sid Nested loops)
(On-the-fly)
bid=100
rating>5
(On-the-fly)
Sailors
(Scan &
Write to
temp T2)
Reserves
4010 IOs
System R-Style Optimization
• Impact:
– Most widely used currently; works well for < 10 joins.
• Cost estimation:
– Plan cost = cumulative cost of plan tree
– Cost = function(I/O, CPU) costs
– Very inexact, but works ok in practice.
– Statistics, maintained in system catalogs, used to estimate
cost of operations and result sizes.
• Plan Space: Too large, must be pruned.
– Many plans share common, “overpriced” subtrees
•
–
Only left-deep plans are considered.
•
–
ignore them all!
May cause optimizer to miss good plans
Avoid Cartesian products.
Query Optimization steps
Block 3
Block 2
1. Parse query from text to
‘intermediate model’
-> A list of query blocks
2. For each query block, pick the best
plan
– Convert block to tree of
relational algebra operators
– Build alternative plans bottom
up:
–
–
–
Rel 2
access path
Leaf nodes represent access
paths to relations
Consider reordering relational
algebra tree
–
–
–
–
Rel 1
access path
Block 1
X
Push selections and
projections down
Consider left-deep join plans
Avoid cross products
Consider all possible join
methods
Prune expensive plans with the
same properties
BNL Cost = 200
no sorted order
X
BNL Cost = 150
no sorted order
HJ Cost = 125
no sorted order
X
HJ Cost = 125
X order
no sorted
SM Cost = 300
Sorted by bid
SM Cost = 300
Sorted by bid
X
Schema for Examples
Sailors (sid: integer, sname: string, rating: integer, age: real)
Reserves (sid: integer, bid: integer, day: dates, rname: string)
• Reserves:
– Each tuple is 40 bytes long, 100 tuples per page, 1000
pages. 100 distinct bids.
• Sailors:
– Each tuple is 50 bytes long, 80 tuples per page, 500 pages.
10 ratings, 40,000 sids.
Translating SQL to Relational Algebra
SELECT S.sid, MIN (R.day)
FROM Sailors S, Reserves R, Boats B
WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red”
AND S.rating > 5
GROUP BY S.sid
HAVING COUNT (*) >= 2
For each sailor with a rating > 5 that has reserved
at least 2 red boats, find the sailor id and the
earliest date on which the sailor has a reservation
for a red boat.
Translating SQL to Relational Algebra
SELECT S.sid, MIN (R.day)
FROM Sailors S, Reserves R, Boats B
WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red”
AND S.rating > 5
GROUP BY S.sid
p S.sid, MIN(R.day)
HAVING COUNT (*) >= 2
HAVING COUNT(*)>2
GROUP BY S.Sid
Sailors
Reserves
V
sB.color = “red”
S.rating > 5
Boats
Relational Algebra Equivalences
• Allow us to choose different join orders and to `push’ selections
and projections ahead of joins.
• Selections can be cascaded:
p S.sid, MIN(R.day)
sc1…cn(R)  sc1(…(scn(R))…)
HAVING COUNT(*)>2
SELECT S.sid, MIN (R.day)
FROM Sailors S, Reserves R, Boats B
WHERE S.sid = R.sid AND R.bid = B.bid
AND B.color = “red” AND S.rating > 5
GROUP BY S.sid
HAVING COUNT (*) >= 2
Can apply these predicates separately
•Can ‘push’ S.rating > 5 down to Sailors
GROUP BY S.Sid
sB.color = “red”
Reserves
sS.rating > 5
Sailors
Boats
Relational Algebra Equivalences
• Selections can be commuted:
sc1(sc2(R))  sc2(sc1(R))
SELECT S.sid, MIN (R.day)
FROM Sailors S, Reserves R, Boats B
WHERE S.sid = R.sid AND R.bid = B.bid
AND B.color = “red” AND S.rating > 5
GROUP BY S.sid
HAVING COUNT (*) >= 2
Can apply these predicates in different order
p
S.sid, MIN(R.day)
HAVING COUNT(*)>2
GROUP BY S.Sid
sS.Rating > 5
Reserves
sB.color = “red”
Boats
Sailors
Relational Algebra Equivalences
•
Projections can be cascaded:
pa1(R)  pa1(…(pa1, …, an(R))…)
SELECT S.sid, MIN (R.day)
FROM Sailors S, Reserves R, Boats B
WHERE S.sid = R.sid AND R.bid = B.bid
AND B.color = “red” AND S.rating > 5
GROUP BY S.sid
HAVING COUNT (*) >= 2
p
S.sid, MIN(R.day)
HAVING COUNT(*)>2
GROUP BY S.Sid
sB.color = “red”
Can project S.sid to reduce size of tuples
p
sS.rating > 5
Sailors
Reserves
S.sid
Boats
Relational Algebra Equivalences
SELECT S.sid, MIN (R.day)
FROM Sailors S, Reserves R, Boats B
WHERE S.sid = R.sid AND R.bid = B.bid
AND B.color = “red” AND S.rating > 5
GROUP BY S.sid
HAVING COUNT (*) >= 2
• Eager projection
– Can cascade and “push”
some projections thru
selection
– Can cascade and “push”
some projections below
one side of a join
– Rule of thumb: can project
anything not needed
“downstream”
S.sid, MIN(R.day)
HAVING COUNT(*)>2
GROUP BY S.Sid
p
??
S.sid,
R.day
sB.color = “red”
p
sS.rating > 5
Sailors
p
S.sid
??
p
p
B.bid,
B.color
??
R.sid,
R.bid, R.day
??
Reserves
Boats
Relational Algebra Equivalences
• Cartesian Product and Joins
– R  (S  T)

(R  S)  T
(associative)
–RS SR
(commutative)
– This means we can do joins in any order.
• But…beware of cartesian product!
– Only consider R X S if there is a join predicate between R &
Select S.sid, R.bid, B.bid
No need to
From Sailors S, Reserves R, Boats B
consider plans
Where S.sid = R.sid and R.bid = B.bid
with S X B
What about this query?
Select S.sid, B.bid
From Sailors S, Reserves R, Boats B
Where S.sid = R.sid
User asked for
Cartesian product so
you have to compute
it!
Cost Estimation
• For each plan considered, must estimate total cost:
– Must estimate cost of each operation in plan tree.
– Plan cost = cumulative cost of the operators
– Operator cost is computed from:
• Input cardinalities.
• Cost of each operator
– e.g. (sequential scan, index scan, joins, etc.)
–
Must estimate size of result for each operation in tree!
It will contribute to the input cardinality for the next operator up
the tree!
• Computed from cardinality of input relations + selecitivity of the
predicates.
• For selections and joins, assume predicates are independent.
•
– Q: Is “cost” the same as estimated “run time”?
Statistics and Catalogs
• Optimizer needs statistics about relations and their indexes to
compute cardinality and selectivity estimates.
• System Catalogs contain metadata about tables and relations
– Just another set of relations; you can query them like any other
table!
• For optimizer, they typically contain:
– # tuples (cardinality) and # pages (NPages) per rel’n.
– # distinct key values (NKeys) for each index.
– low/high key values (Low/High) for each index.
– Index height (IHeight) for each tree index.
– # index pages (INPages) for each index.
• Statistics must be kept up to date
– Updating whenever data changes is too expensive; lots of
approximation anyway, so slight inconsistency ok.
– Updated periodically
•
–
E.g. DB2: runstats on table feeding
Statistics out of date -> optimizer choosing very bad plans!
The optimizer relies on good statistics
Table Stats
Name NumPages
X
Sailors
1 100
Reserves
1
T1: Create table Sailors(…)
T2: Create table Reserves(…)
T3: Runstats on table Sailors
T4: Runstats on table Reserves
T5: Insert 100 pages of Sailors into Sailors
T6: Runstats on Sailors
T6: Insert 1000 pages of Reservations into Reserves
T7: Run Query:
SELECT S.sname, R.bid
FROM Sailors S, Reserves R
Optimizer would incorrectly
WHERE S.sid = R.sid
pick this plan!
•Assume 7 pages of memory and only BNL
BNLJ: SailorsXReserves
100 + 100/5x1 = 120
ReservesXSailors
1 + 1/5x100 = 101
Size Estimation and Reduction Factors
SELECT attribute list
FROM relation list
WHERE pred1 AND ... AND predk
• Consider a query block:
• Maximum Cardinality = product of the cardinalities of relations
in the FROM clause.
• Reduction factor (RF) associated with each predicate reflects
the impact of the predicate in reducing result size.
– Result cardinality = Max # tuples * product of all RF’s.
• RF usually called “selectivity”
– only R&G seem to call it Reduction Factor
Result Size Estimation
• Result cardinality =
Max # tuples * product of all RF’s.
• Term col=value (given index I on col, or
knowledge about # column values)
RF = 1/NKeys(I)
• Term col1=col2 (This is handy for joins too…)
RF = 1/MAX(NKeys(I1), NKeys(I2))
• Term col>value
RF = (High(I)-value)/(High(I)-Low(I))
(Implicit assumptions: values are uniformly
distributed and predicates are independent!)
•
Question: What if the optimizer has no statistics?
No statistics is a common case
•RF = 1/NKeys(I) for a column with an index
• What about a non-index column,
e.g. R.day = ’01/31/2007’?
•The original System R paper suggested 1/10 for this case,
and many systems today still use that!
•Here is what POSTGRES does:
/* default selectivity estimate for
equalities such as "A = b" */
#define DEFAULT_EQ_SEL 0.005
/* default selectivity estimate for
inequalities such as "A < b" */
#define DEFAULT_INEQ_SEL
0.3333333333333333
/* default selectivity estimate for
range inequalities "A > b AND A < c"
*/
#define DEFAULT_RANGE_INEQ_SEL 0.005
/* default selectivity estimate
for pattern-match operators
such as LIKE */
#define DEFAULT_MATCH_SEL
0.005
/* default number of distinct
values in a table */
#define DEFAULT_NUM_DISTINCT 200
/* default selectivity estimate
for boolean and null test
nodes */
#define DEFAULT_UNK_SEL
0.005
#define DEFAULT_NOT_UNK_SEL
(1.0 - DEFAULT_UNK_SEL)
Estimating Join Cardinality
• Q: Given a join of R and S, what is the range of possible
result sizes (in #of tuples)?
– Suppose the join is on a key for R and S
• e.g. Students(sid, sname, did), Dorm(did,d.addr)
Select S.sid, D.address
From Students S, Dorms D
Where S.did = D.did
What is the cardinality?
A student can only live in at most 1 dorm
-> each S tuple can match with at most 1 D tuple
-> cardinality (S join D) = cardinality of S
Estimating Join Cardinality
• General case: join on {A} ({A} is key for neither)
– estimate each tuple r of R generates uniform number of matches in S
and each tuple s of S generates uniform number of matches in R, e.g.
RF = min(NTuples(R) * NTuples(S)/NKeys(A,S)
NTuples(S) * NTuples(R)/NKeys(A,R))
Sailors: 100 tuples, 75 unique names -> 1.3 tuples for each sailor name
Boats: 20 tuples, 10 unique names -> 2 tuples for each boat name
e.g. SELECT S.name, B.name
FROM Sailors S, Boats B
WHERE S.name = B.name
RF = 100*20/10 = 200
RF = 20*100/75 = 26.6
Plan Enumeration
• A heuristic decision in System R:
only left-deep join trees are considered.
–
–
As the number of joins increases, the number of alternative plans
grows rapidly; we need to restrict the search space.
Left-deep trees allow us to generate all fully pipelined plans.
• Intermediate results not written to temporary files.
• Not all left-deep trees are fully pipelined (e.g., SM join).
D
D
C
A
B
C
A
B
A
B
C
D
Enumeration of Left-Deep Plans
• Left-deep plans differ:
– Order of relations
– Access method for each relation,
– Join method for each join.
• Enumerated using N passes (if N relations joined):
–
–
–
Pass 1: Find best 1-relation plan for each relation.
Pass 2: Find best way to join result of each 1-relation plan (as outer)
to another relation. (All 2-relation plans.)
Pass N: Find best way to join result of a (N-1)-relation plan (as outer)
to the N’th relation. (All N-relation plans.)
• For each subset of relations, retain only:
–
–
Cheapest plan overall, plus
Cheapest plan for each interesting order of the tuples.
A Note on “Interesting Orders”
• An intermediate result has an
“interesting order” if it is sorted
by any of:
– ORDER BY attributes
– GROUP BY attributes
– Join attributes of yet-to-be-added (downstream)
joins
Enumeration of Plans (Contd.)
• Avoid Cartesian products!
– An N-1 way plan is not combined with an additional
relation unless there is a join condition between them,
or all predicates in WHERE have been used up.
– i.e., consider a Cartesian product only if the user
specified one!
• ORDER BY, GROUP BY, aggregates
– handled as a final step, using either an `interestingly
ordered’ plan or an additonal sort/hash operator.
• In spite of pruning plan space, System R approach is
still exponential in the # of tables.
Download