Relational Query Optimization

advertisement
Relational Query Optimization
Chapter 15
1
Query Blocks: Units of
Optimization

An SQL query is parsed into a collection
of query blocks :
 An SQL query with no nesting and exactly one
SELECT, FROM, WHERE, GROUP BY, and
HAVING clause.
 WHERE is in conjunctive normal form.
2
Query Blocks: Units of
Optimization


Nested blocks are usually treated as calls to a
subroutine, made once per outer tuple.
Optimization is one block at a time.
SELECT S.sname
FROM Sailors S
WHERE S.age IN
(SELECT MAX (S2.age)
FROM Sailors S2
GROUP BY S2.rating)
Outer block
SELECT S.sname
FROM Sailors S
WHERE S.age IN
Reference to nested block
SELECT MAX (S2.age)
FROM Sailors S2
GROUP BY S2.rating
Nested block
3
Query Block

For each block, the plans considered are:
– All available access methods
– for each relation in FROM clause.
– All left-deep join trees
– all ways to join the relations one-at-a-time, with
inner relation in FROM clause, considering all relation
permutations and join methods.
4
Cost Estimation

For each plan considered, must estimate cost:

Must estimate cost of each operation in plan tree.
• Depends on input cardinalities.
• Depends on algorithm (sequential scan, index scan).

Must also estimate size of result for each operation.
• Use information about the input relations.
• Must make assumptions about effect of predicates.

Cost of plan = sum of cost of each operator in tree.
5
Size Estimation and Reduction Factors
SELECT attribute list
FROM relation list
WHERE term1 AND ... AND termk

Goal : Estimate result size !
6
Size Estimation and Reduction Factors
SELECT attribute list
FROM relation list
WHERE term1 AND ... AND termk

Consider a query block:

Given maximum # tuples in result :
 product of cardinalities of relations in FROM clause.

Reduction factor (RF) associated with each term :
 reflects impact of term in reducing result size.

Result size =
 product of cardinalities of involved relations (FROM) * product of
reduction factors (WHERE).
7
Assumptions
 Uniform distribution of values in domain
 Independent distribution of values in different columns.
 For selections and joins, assume independence of predicates.
8
Reduction Factors
 Column = value.
- Given the index I on column, assume uniform distribution.
• 1/Nkeys(I).
• Otherwise, fixed reduction factor 1/10
 Column1 = column2
- Given indexes I1 and I2 on column1 and column2, assuming each key
value in I1 (smaller one) has a matching value in I2.
• 1/MAX(Nkeys(I1), Nkeys(I2)).
• One index I, 1/Nkeys(I)
• otherwise, 1/10
 Column > values
- Given an index I on column, arithmetic type.
• High(I) – value / High(I) – Low(I).
• Not arithmetic type, or no index.
• A fraction Less than half is arbitrarily chosen.
 Column IN (list of values)
- reduction factor for (column = value) * number of items in list.
10
More on Estimation
 Uniform distribution is not accurate since real data is
not uniformly distributed.
 Histogram: a data structure maintained by a DBMS to
approximate a data distribution.
11
Estimation



Equi-width: divide range of column values into subranges (buckets).
Assuming the distribution within the histogram bucket is uniform.
Equi-depth: number of tuples within each bucket is equal (almost).
Compressed equi-depth: maintain separate counts for a small
number of very frequent values, and maintain equi-depth histogram
to cover the remaining values.
Equiwidth
5.0
5.0
5.0
Equidepth
9.0
2 3 3 1 2 1 3 4 8 2 0 1 2 4 9
2.67
2.25
1.33
2.5
1.0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1.75
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Bucket 1
Count 8
2
4
3
15
4
3
5
15
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Bucket 1
Count 9
2
10
3
10
4
7
5
9
12
Relational Algebra Equivalences

Allow us to choose different join orders

Allow us to `push’ selections and projections
ahead of joins.
13
Relational Algebra Equivalences

Selections: 
c1 ...  cn  R    c1  . . .  cn  R  
(Cascade)
Combine several selections into one selection.
Evaluate conjunctive condition to each of the tuple.
Separate conjunctions into smaller selection operations, Able to combine as Join.
 c1  c 2  R    c 2  c1  R 
(Commute)
14
Relational Algebra Equivalences



Projections:  a1  R   a1 . . .  an  R 
(Cascade)
Where ai  ai+1 for i =1…n-1.
15
Relational Algebra Equivalences

Joins: R  (S  T)
 (R S)  T
(R  S)  (S  R)
(Associative)
(Commute)
• When join several relations, we are free to join the
relations in any order we choose.
16
Selects, Projects and Joins

Case1: A projection commutes with a selection that
only uses attributes retained by the projection.
a (c(R))  c (a(R))

Case2: Combine a selection with a cross-product to
form a join.
R  c S  c (RS)

Case3: A selection on just attributes of R commutes
with Join. i.e.,  (R  S)   (R)  S
17
Selects, Projects and Joins

Selection and Joins :
c1c2 c3 (R S)  c1(c2 (R)  c3 (S))

Project and Joins
a (RS)  a1 (R)  a2 (S)
a (R c S)  a1 (R) c a2 (S)
(Where c appear in a)
18
Enumeration of Alternative Plans

There are two main cases:


Single-relation plans
Multiple-relation plans
19
Single Relation Plans

Queries over a single relation: combination of selects,
projects, and aggregate operations:

Main decision : Which access path for retrieving tuples.
•

Most selective access path (file scan / index) if only single operator
considered.
The different operations essentially carried out together.
•
e.g., if an index is used for a selection, projection is done for each
retrieved tuple, and the resulting tuples are pipelined into the
aggregate computation.
20
Single Relation Plans without Index
SELECT S.rating, COUNT(*)
FROM Sailors S
WHERE S.rating > 5 AND S.age = 20
GROUP BY S.rating
HAVING COUNT DISTINCT (S.sname) > 2
Sailors = 500 pages
 s.rating, COUNT(*)(
HAVING COUNT DISTINCT(S.sname) > 2(
GROUP Bys.rating(
s.rating, S.sname (
 s.rating>5  S.age=20 (
Sailors)))))
• File scan to retrieve tuples and apply the selections and projections.
• Writing out tuples after the selections and projections.
• Sorting these tuples to implement the GROUP BY clause.
* Then GROUP BY and HAVING are done on-the-fly.
e.g., Cost = Cost1(scan) + cost2 (writing <S.rating, S.sname> pairs)
+ cost3 (sorting as per the GROUP BY clause).
cost1 = 500 pages
cost2 = 500* ratio of tuple size * RFs = 20 pages
ratio of tuple size = pair size / tuple size = 0.8
RF(s.rating>5) = 0.5 RF(S.age=20) = 0.1
cost3 = 3*Npages = 60 (assuming two passes)
21
Single-Relation Plans with Index

Single-index access path
• When several indexes match the selection conditions, choose
the most selective access path.

Multiple-index access path
• Several indexes using Alternatives (2) or (3) for data entries
match the selection condition.
• Retrieve rids using them individually.
• Intersect result set. Sort by page id.

Sorted-index access path
• Grouping attributes is a prefix of a tree index.
• Using index to retrieve tuples in order required by GROUP BY

Index-only access path
• All attributes mentioned in the query are included in the search
key for some dense index on the relation in the FROM clause.
• Index-only scan to compute the answer.
22
Single-Relation Plans with Index

Index I on primary key matches selection:


Clustered index I matching one or more selects:


(NPages(I)+NPages(R)) * product of RF’s of matching selects.
Non-clustered index I matching one or more selects:


Cost is Height(I)+1 for a B+ tree, about 1.2+1 for hash index.
(NPages(I)+NTuples(R)) * product of RF’s of matching selects.
Sequential scan of file:

NPages(R).
23
Single-Relation Example


Sailors: 500 pages, 80 tuples/page
Case2: index on rating.


Clustered index:
(1/NKeys(I)) * (NPages(I)+NPages(S)) = (1/10) * (50+500) pages
Unclustered index:
(1/NKeys(I)) * (NPages(I)+NTuples(S)) = (1/10) * (50+40000) pages
Case3: index on sid.


SELECT S.sid
FROM Sailors S
WHERE S.rating=8
Would have to retrieve all tuples/pages. With a clustered index, the
cost is 50+500, with unclustered index, 50+40000.
Case4: file scan:

We retrieve all file pages (500).
24
Queries Over Multiple Relations

As the number of joins increases,
the number of alternative plans
grows rapidly

We need to restrict search space !
25
Queries Over Multiple Relations

Fundamental decision in System R (IBM):
 Only left-deep join trees are considered.
 Left-deep trees can generate all fully pipelined plans.
• Intermediate results not written to temporary files.
• Not all left-deep trees are fully pipelined (e.g., SM join).
D
D
C
A
B
C
D
A
B
C
A
B
26
Enumeration of Left-Deep Plans

Left-deep plans differ in :
 the order of relations,
 the access method for each relation, and
 the join method for each join.
27
Enumeration of Left-Deep Plans

Enumerated using N passes (if N relations joined):




Pass 1: Find best 1-relation plan for each relation.
Pass 2: Find best way to join result of each 1-relation plan
(as outer) to another relation. (All 2-relation plans.)
Pass N: Find best way to join result of a (N-1)-relation plan
(as outer) to the N’th relation. (All N-relation plans.)
For each subset of relations, retain :


Cheapest plan overall, plus
Cheapest plan for each interesting
order of the tuples.
Pass 1
A
B
C
D
Pass 2
Pass 3
28
Enumeration of Left-Deep Plans : Pass 1
 Identify selection terms in WHERE clause that mention
only attributes of A. (perform access of A, before Join)
 Identify attributes of A not mentioned in SELECT or
WHERE (project out when first access of A, before Join)
 Keep cheapest overall plan for fetching all tuples :
• a file scan
 Keep cheapest plan with tuples in search key order :
• B+ tree index
29
Left-Deep Plans: All 2-relation Plans
 Each of the single-relation plan from Pass 1 as the outer
relation, and every other relation as the inner relation.
 Examine WHERE clauses:
• Selections involving only attributes of inner relation (apply
before Join).
• Selections defining the Join.
• Selections involving attributes of other relations (apply after
Join).
 Only selections that are really applied before the join are
those that match the chosen access paths for A and B.
 Depending on the Join algorithm chosen, the cost may
include materializing the outer relation.
30
Enumeration of Plans (Contd.)

ORDER BY, GROUP BY, aggregates etc. handled as a
final step, using either an `interestingly ordered’
plan or an additional sorting operator.

An N-1 way plan is not combined with another
relation unless join condition between them :


avoid Cartesian products if possible.
In spite of pruning plan space, this approach is still
exponential in the # of tables.
31
Example:
Pass 1
Sailors:
B+ tree on rating
Hash on sid
Reserves:
B+ tree on bid
sname
sid=sid
bid=100 rating > 5

Sailors:
Reserves Sailors
Choice1: B+ tree matches rating>5,
probably cheapest.
 If selection is expected to retrieve a lot of tuples,
and index is unclustered, file scan may be cheaper.
 Decision: B+ tree plan kept (because tuples are in
rating order).


Reserves:

B+ tree on bid matches bid=100; cheapest.
32
Example
Pass 2

Pass1:
 Sailors:
•

sname
sid=sid
Choice1: B+ tree matches rating>5,
Reserves:
•

Sailors:
B+ tree on rating
Hash on sid
Reserves:
B+ tree on bid
bid=100 rating > 5
B+ tree on bid matches bid=100.
Reserves Sailors
Pass 2:
– each plan retained from Pass 1 as the outer.
– how to join it with the (only) other relation.
– e.g., Reserves as outer : Hash index can be used to get Sailors tuples
that satisfy sid = outer tuple’s sid value.
– e.g., Sailors as outer : Could possibly use sort-merge join, etc.
33
Highlights of System R Optimizer

Impact of R Optimizer:


Cost estimation: Approximate art at best.



Most widely used currently; works well for < 10 joins.
Statistics, maintained in system catalogs, used to estimate
cost of operations and result sizes.
Considers combination of CPU and I/O costs.
Plan Space: Too large, must be pruned.

Only the space of left-deep plans is considered.
• Left-deep plans allow output of each operator to be pipelined into
next operator without storing it in temporary relation.

Cartesian products avoided.
37
Other Approaches to Query Optimization

Exhaustive search is not suitable for large number of Joins.

Rule-based optimizers: a set of rules to guide the generation
of candidate plans.

Randomized plan generation: probabilistic algorithms to
explore a large space of plans.

Estimating the size of intermediate relations accurately.
• Parametric query optimization: find good plans for a given query for each
of several different conditions that might be encountered at run-time.
• Multiple-query optimization: take concurrent execution of several queries
into account.
38
Summary

Two parts to optimizing a query:

Consider a set of alternative plans.
• Must prune search space;
• Typically, left-deep plans only.

Must estimate cost of each plan that is considered.
• Estimate size of result
• Estimate cost for each plan node.
• Key issues: Statistics, indexes, operator implementations.
40
Download