Database Systems (資料庫系統) December 13, 2004 Chapter 15

advertisement
Database Systems
(資料庫系統)
December 13, 2004
Chapter 15
By Hao-hua Chu (朱浩華)
1
Announcement
• Assignment #9 is due next Thursday.
• Read chapter 16 for next lecture.
2
Cool Ubicomp Project
Hyperdragging (Sony, 1999)
• Not enough working surface on your computer
screen .…
• No shared working surface when collaborating
with other people ….
3
A “Typical” Query Optimizer
Chapter 15
4
How does a query optimizer work
in general?
• Decompose a SQL query (can have nested queries) into
query blocks (without nested queries).
• Translate each query block into relational algebra
expressions.
• Optimize the relational algebra expression, one query
block at a time:
– Enumerate a subset of possible evaluation plans (also call
explore the plan space)
– Estimate the cost (disk I/Os) of each explored plan using system
catalogs and statistics
– Pick the plan with the least estimated cost
5
Decompose a Query into Query
Blocks (Example)
SELECT S.sid, MIN(R.day)
FROM Sailors S, Reserves R, Boats B
WHERE S.sid=R.sid AND R.bid=B.bid AND B.color=‘red’
AND S.rating = (SELECT MAX(S2.rating)
FROM Sailors S2)
GROUP BY S.sid
HAVING COUNT(*) > 1
SELECT S.sid, MIN(R.day)
FROM Sailors S, Reserves R, Boats B
WHERE S.sid=R.sid AND R.bid=B.bid
AND B.color=‘red’
AND S.rating = Reference to the nested
block
GROUP BY S.sid
HAVING COUNT(*) > 1
SELECT MAX(S2.rating)
FROM Sailors S2
Nested block
Outer block
6
Optimizing A Query Block
• Query blocks are optimized one block at a time.
– Nested blocks are usually treated as calls to a subroutine, made
once per tuple in the outer block.
•
This is an over-simplification, but good enough for now.
• To estimate I/O cost, the optimizer estimates the size of
(intermediate) results.
–
–
–
System catalogs about the lengths of (projected) fields
Statistics about referenced relationships (file size & # tuples)
Available access methods (indexes & selection conditions), for
each relationship in from clause
7
Translate Query Block into
Relational Algebra Expr (1)
SELECT S.sid, MIN(R.day)
FROM Sailors S, Reserves R, Boats B
WHERE S.sid=R.sid AND R.bid=B.bid
AND B.color=‘red’
AND S.rating = Reference to the nested block
GROUP BY S.sid
HAVING COUNT(*) > 1
• Assume that
GROUP BY and
HAVING are also
operators.
πS.sid,MIN(R.day) (
HAVING COUNT(*)>2
GROUP BY S.sid (
σ S.sid=R.sid AND R.bid=B.bid
AND B.color=‘red’
AND S.rating=value_from_nested_block (
SailorsχReserves χBoats))))
8
Translate Query
Block (2)
• Try to simplify query block
further into σπχ expression.
– Why do this? Easier to find
equivalent σπχ expressions
(alternative plans), and they
may have cheaper costs.
• How about GROUP BY and
HAVING?
– They are carried out after the
result of σπχ.
– Add attributes specified in
GROUP BY and HAVING
into projection list.
πS.sid,MIN(R.day) (
HAVING COUNT(*)>2
GROUP BY S.sid (
σ S.sid=R.sid AND R.bid=B.bid
AND B.color=‘red’
AND S.rating=value_from_nested_block (
SailorsχReserves χBoats))))
πS.sid,MIN(R.day) (
HAVING COUNT(*)>2
GROUP BY S.sid (σπχ expression )
πS.sid,R.day (
σ S.sid=R.sid
AND R.bid=B.bid
AND B.color=‘red’
AND S.rating= value_from_nested_block (
Sailors χ Reserves χ Boats))))))
9
Relational Algebra Tree
• Represent a plan, which is a relational algebra (RA)
expression, as a RA tree.
sname
SELECT S.sname
FROM Reserves R, Sailors S
WHERE R.sid=S.sid AND
R.bid=100 AND S.rating>5
σbid=100 ^ rating > 5
sid=sid
Reserves
Sailors
10
Estimate Cost of a Plan
• For each enumerated plan, estimate its cost:
– Each node in the tree involves a relational operator. We must
estimate the cost of evaluating a relational operator.
• Size of inputs (#pages), table statistics (selection conditions),
available indexes, and chosen algorithms for evaluating operators
(Chapter 14).
– For each node in the tree, we need to estimate the size of the
results and whether the results are sorted or not.
• Since the results are inputs to the upper node, they are used to
estimate the cost of the upper node (operator).
11
Notes on Query Optimizer
• Cost estimation is only an approximation.
– Consider the cost of Disk I/Os.
• Plan Space:
– Too large -> too many possible plans to enumerate and too
expensive to estimate the costs for all of them, must be restricted.
– Consider only the space of left-deep plans. Why?
• Left-deep plans allow output of each operator to be pipelined into the
next (parent) operator without physically storing it in a temporary
relation.
• Avoid cartesian products.
12
Left Deep Join Trees
• Restrict the plan search space to only left-deep join trees.
–
–
As the number of joins increases, the number of alternative plans
grows rapidly; we need to restrict the search space.
Left-deep trees allow us to generate all fully pipelined plans (if we
choose so).
• Intermediate results not written to temporary files.
• Not all left-deep trees are fully pipelined, depending on the
choice of join algorithm (e.g., Sort-Merge join).
⋈
⋈
⋈
A
⋈
⋈
B C
⋈
D
A
D
C
B
13
Estimating Result Sizes
SELECT attribute list
FROM
relation list
WHERE term_1 AND term_2 AND term_3 … AND term_n
• How to estimate the size of result by an operator on given
inputs?
– Use information from system catalogs and statistics.
– For each term, find tuple reduction factors (# expected input tuples
/ # expected qualified tuples).
Column = value
1 / NKeys(I), if column is index(I), or 10%.
Column1 = Column2
1 / MAX (NKeys(I1), NKeys(I2)), or 1 / NKeys(I), or 10%.
Column > value
(High(I) – value) / (High(I) – Low(I)), or <50%.
Column in (list of values) (reduction factor for column = value) * # values
Very Rough Estimation of Reduction Factors
14
Improved Statistics: Histograms
• The rough estimation assumes uniform distributions of
values.
– What if that assumption is not true? For more accurate
estimations, use histograms.
• Histogram is a data structure to approximate data
distribution.
– Term # children with (age > 3): result size = 2 tuples.
7
5
2
1
2
3
1
1
4
5
ages
15
Relational Algebra Equivalences
• Allow us to choose different join orders and to push
selections and projections ahead of joins.
• Selections (cascading of selections):
– Break a selection condition into many smaller selections.
– Combine several selections into one selection.
σ c1 AND c2 AND … cn (R) ≡ σ c1 (σ c2 ( …σ cn (R)) …))
• Selection (commutative):
– Test conditions in either order.
σ c1 (σ c2 (R)) ≡ σ c2 (σ c1 (R))
16
Relational Algebra Equivalences
(Projections, Joins, and Cross-Products)
• Projections (cascading projections)
– Successively eliminating columns is same as eliminating all but
the columns of final projection.
πa1.(R) ≡ πa1.( πa2.(…(πan R)) ..)), where a1 is a set of attributes
of relation R, and ai is a subset of ai+1
πsid.(R) ≡ πsid.( πsid, bid (Reserves))
• Joins and Cross-Products (commutative)
– Freedom to choose inner or outer relations
RχS ≡ SχR
• Joins and Cross-Products (associative)
– Join pairs of relations in any order
Rχ(S χT) ≡ (RχS) χT
17
Relational Algebra Equivalences
Involving Two or More Operators (1)
• Commute a selection with a projection
– when the selection condition c involves only attributes retained
by the projection a.
π a (σ c (R)) ≡ σ c (π a (R))
π sid (σ sid=10 (R)) ≡ σ sid=10 (π sid (R))
π sid (σ bid=10 (R)) ≠ σ bid=10 (π sid (R))
• Combine a selection with a cross-product to form a join.
• Push a selection into a cross-product (join)
– When the selection condition c involves only attributes of one of
the arguments to the cross-product (join).
σ c (R χ S ) ≡ σ c (R) χ S
σ R.bid=10 (R χ S ) ≡ σ R.bid=10 (R) χ S
σ S.rating=10 (R ⋈ S ) ≠ σ S.rating=10 (R) ⋈ S
18
Relational Algebra Equivalences
Involving Two or More Operators (2)
• Push a selection into a cross-product (join)
– Replace a selection with cascading selections, and commute selections.
– c1 involves attributes of both R & S (c2 attrs of R, c3 attrs of S).
σ c (R χ S ) ≡ σ c1 ^ c2 ^ c3 (R χ S )
≡ σ c1 (σ c2 (σ c3 (R χ S )))
≡ σ c1 (σ c2 (R) χ σ c3 (S ))
σ sid=10, rname=‘Jane’ ^ sname=‘Paul’ (R χ S )
≡ σ sid=10 (σ sname=‘Jane’ (R) χ σ rname=‘Paul’ (S ))
• Push a projection into a cross-product (join)
– when subsets (a1,a2) of projection attribute a involves only attributes of
one of the arguments to the cross-product (join).
– Same as to push a selection with cross-product (join)
π a (R χ S ) ≡ π a1 (R) χ π a2 (S)
π R.sid, S.sname (R χ S ) ≡ π R.sid (R) χ π S.sname (S)
19
Relational Algebra Equivalences
Involving Two or More Operators (3)
• Push a projection into a join
– When a1 is subset of a in R, a2 is a subset of a in S, and c is in a.
π a (R ⋈c S ) ≡ π a1 (R) ⋈c π a2 (S)
π R.sid, S.sname, S.sid (R ⋈ R.sid=S.sid S )
≡ (π R.sid (R) ) ⋈ R.sid=S.sid (π S.sname,S.sid (S) )
• Push a projection into a join if
∆
∆
∆
∆
– a1 is subset of R that appear in a and c, and a2 is a subset of S
that appear in a and c
π a (R
c S ) ≡ π a ( π a1 (R)
c π a2 (S))
π R.sid, S.sname (R
R.sid=S.sid S ) ≡
π R.sid,S.sname ( π R.sid,R.sid (R)
R.sid=S.sid π S.sname,S.sid (S))
∆
∆
∆
∆
20
Enumeration of Alternative Plans
• There are two main cases:
–
–
Single-relation plans: (SELECT … FROM Sailors S …)
Multiple-relation plans: (SELECT … FROM Sailors S, Reserves R …)
• For queries over a single relation, queries consist of a
combination of selections, projections, and aggregate ops
(no joins).
• The general strategy has two parts:
–
–
For selections, consider each available access path (file scan / index)
and pick the one with the least estimated cost.
Projections and aggregations are carried out together with selections.
•
For example, if an index is used for a selection, projection is done for
each retrieved tuple, and the resulting tuples are pipelined into the
aggregate computation.
21
Single-Relation Queries
• Example:
– “For each rating greater than 5, print the rating and the number of 20year old sailors with that rating, provided that there are at last two such
sailors with different names”
• Plan without indexes:
– Scan Sailors, apply
selections and
projections.
– Write out tuples.
– Sort tuples based on
S.rating (GROUP BY).
– Apply HAVING on the fly
at the last sorting step.
πS.rating,COUNT(*) (
HAVING COUNT DISTINCT(S.sname)>2
GROUP BY S.rating (
πS.raing,S.sname (
σ S.rating>5 AND S.age=20 (Sailors)))))
σ π χ expression
22
Single-Relation Queries: Plans
Utilizing an Index
• Single-Index Access Path
– If several indexes match selection condition(s), pick the best access
path. Apply projections and non-primary selections. Sort by grouping
attributes. Do aggregations.
• Multiple-Index Access Path
– If several indexes match selection condition(s), use them to retrieve sets
of RIDs. Take intersection of RID sets. Sort resulting RID by page ID.
Retrieve all tuples on the same page, while applying projections,
selections. Sort by grouping attributes. Do aggregations.
• Sorted Index Access Path
– If GROUP BY attributes matches a tree index, use the index to retrieve
tuples in order. Then apply selections, projections, and aggregation
operations.
• Index-Only Access Path
– If all attributes mentioned in (SELECT, WHERE, GROUP BY, HAVING)
are in the index, can do an index-only scan.
23
Example Using Single-Index Access
Path
• B+ tree index on rating, hash index on age, and B+ tree index on
<rating, sname, age>
•
•
•
•
Retrieve Sailors tuple
(age=20) using hash index
on age. (most selective
path)
For retrieved tuple, apply
(rating >5 ) and projection
out attributes (rating and
sname).
Write results to temp table.
Sort temp table on rating. At
the last sorting step, apply
HAVING and final projection.
πS.rating,COUNT(*) (
HAVING COUNT DISTINCT(S.sname)>2
GROUP BY S.rating (
πS.raing,S.sname (
σ S.rating>5 AND S.age=20 (Sailors)))))
σ π χ expression
24
Queries Over Multiple Relations
• The general strategy has three parts:
– Consider and enumerate only left-deep join trees. Why?
•
•
–
–
Restrict the search space.
Left-deep trees allow us to generate all fully pipelined plans.
Consider selections and projections as early as possible (Push
selections and projections into lower joins)
Estimate the cost for each left-deep plan. Pick the one with the
lowest cost.
D
C
C
A
B
D
D
A
B
A
C
B
25
Enumeration of Left-Deep Plans
• Left-deep plans differ from each other in
– The order of relations
– The access method for each relation
– The join algorithm for each join operation
• We have discussed how to enumerate different access
methods and estimate their costs for one relation.
• Different join algorithms (e.g., nested loop join, sort-merge
join, hash join, …) and their cost analysis are discussed in
Chapter 14.
26
Plan Enumeration Algorithm
• Enumerated using N passes (if N relations joined):
–
Pass 1: Find best 1-relation plan for each relation.
•
•
–
Pass 2: Find best way to join result of each 1-relation plan (as outer)
to another relation. (All 2-relation plans.)
•
•
–
Push selection terms & projection attributes to that 1-relation (using
equivalences)
Consider all different access paths, and pick the one with lowest cost.
This is same algorithm as 1-relation query.
Again push selection terms & projection attributes into the inner relation
using equivalences.
Try to pipeline the selected/projected tuples from the inner relation into
join operation with outer relation. (Sometimes cannot, e.g., sort-merge
join)
Pass N: Find best way to join result of a (N-1)-relation plan (as
outer) to the N’th relation. (All N-relation plans.)
27
Plan Enumeration Algorithm Illustration
Pass 1:
Same as SingleRelation Query
Pass 2
A
B
pipelined
into join
consider
alt join algs
B
A
outer
C
push sel/proj
A
C
inner
Pass 3
B
C
A
B
A
C
28
Enumeration of Plans (Contd.)
• For each subset of relations, retain only:
–
–
Cheapest plan overall, plus
Cheapest plan for each interesting order of the tuples.
• ORDER BY, GROUP BY, aggregates etc. handled as a final
step. If tuples going into them are not sorted, apply
additional sorting.
• Consider left-deep trees, this approach is still exponential in
the # of relations.
– N relations -> O( N! ) plans
29
Example
• Pass1:
–
Sailors:
B+ tree index on rating
Hash index on sid
Reserves:
B+ tree index on bid
Sailors: B+ tree matches rating>5, and is probably
cheapest. However, if this selection is expected to
retrieve a lot of tuples, and index is unclustered, file
scan may be cheaper.
sname
sid=sid
bid=100 rating > 5
• Still, B+ tree plan kept (because tuples are in rating
order).
–
•
Reserves: B+ tree on bid matches bid=500; cheapest.
Reserves Sailors
Pass 2:
–
We consider each plan retained from Pass 1 as the outer, and consider
how to join it with the (only) other relation.
• Reserves as outer (better): hash index can be used to get Sailors tuples that
match sid = outer tuple’s sid value (hash join)
• Sailors as outer (worse): B+ tree index can be used to get Reserves tuples that
match bid. Then pipeline tuples into join, or write them out for sort-merge join.
30
Nested Subqueries
• Nested block is optimized independently
(sometimes just evaluated once).
• In general, nested queries are dealt with SELECT S.sname
using some form of nested loops
FROM Sailors S
evaluation.
WHERE S.rating =
– Main query as the outer loop, the
subquery as the inner loop
– This is necessary for correlated queries
below (S is used both in the main query
and the subquery.).
SELECT S.sname
FROM Sailors S
WHERE EXISTS
(SELECT *
FROM Reserves R
WHERE R.bid=103 AND
S.sid=R.rid)
(SELECT MAX(S2.ratings)
FROM Sailors S2)
SELECT S.sname
FROM Sailors S
WHERE S.sid IN
(SELECT R.sid
FROM Reserves R
WHERE R.bid=103)
31
Summary
• Query optimization is an important task in a relational
DBMS.
• Must understand optimization in order to understand the
performance impact of a given database design (relations,
indexes) on a workload (set of queries).
• Two parts to optimizing a query:
–
Consider a set of alternative plans.
• Must prune search space; typically, left-deep plans only.
–
Must estimate cost of each plan that is considered.
• Must estimate size of result and cost for each plan node.
• Key issues: Statistics, indexes, choice of operator algorithms.
32
Summary (Contd.)
• Single-relation queries:
–
All access paths considered, cheapest is chosen.
• Multiple-relation queries:
–
–
–
–
All single-relation plans are first enumerated.
• Selections/projections considered as early as possible.
Next, for each 1-relation plan, all different ways of joining another
relation (as inner) are considered.
Next, for each 2-relation plan that is `retained’, all ways of joining
another relation (as inner) are considered, etc.
At each level, for each subset of relations, only best plan for each
interesting order of tuples is `retained’.
33
Download