Ddb-L42

advertisement
Distributed Query Optimization Algorithms
System R and R*
 Hill Climbing and SDD-1

L4.2.2. Distributed Query Optimization Algorithms -- 1
System R (Centralized) Algorithm
Simple (one relation) queries are executed
according to the best access path.
 Execute joins

 Determine the possible ordering of joins
 Determine the cost of each ordering
 Choose the join ordering with the minimal cost

For joins, two join methods are considered:
 Nested loops
 Merge join
L4.2.2. Distributed Query Optimization Algorithms -- 2
System R Algorithm -- Example
Names of employees working on the CAD/CAM
project
 Assume
 EMP has an index on ENO,
 ASG has an index on PNO,
 PROJ has an index on PNO and an index on PNAME
L4.2.2. Distributed Query Optimization Algorithms -- 3
System R Algorithm -- Example

Choose the best access paths to each relation
 EMP: sequential scan (no selection on EMP)
 ASG: sequential scan (no selection on ASG)
 PROJ: index on PNAME (there is a selection on PROJ based on
PNAME)

Determine the best join ordering







EMP
ASG
PROJ
ASG
PROJ
EMP
PROJ
ASG
EMP
ASG
EMP
PROJ
EMP  PROJ
ASG
PROJ  EMP
ASG
Select the best ordering based on the join costs evaluated according
to the two methods
L4.2.2. Distributed Query Optimization Algorithms -- 4
System R Example (cont'd)
EMP
EMP
ASG EMP × PROJ ASG
(ASG

PROJ
ASG
EMP ASG
EMP)
PROJ PROJ
PROJ
ASG PROJ × EMP
(PROJ
ASG)
ASG)
EMP
EMP
Best total join order is one of
(ASG
EMP)
PROJ
(PROJ
L4.2.2. Distributed Query Optimization Algorithms -- 5
System R Algorithm

(PROJ
ASG)
EMP has a useful index on the select
attribute and direct access to the join attributes of ASG and
EMP.

Final plan:
 select PROJ using index on PNAME
 then join with ASG using index on PNO
 then join with EMP using index on ENO
L4.2.2. Distributed Query Optimization Algorithms -- 6
System R* Distributed Query Optimization

Total-cost minimization. Cost function includes
local processing as well as transmission.

Algorithm
 For each relation in query tree find the best access path
 For the join of n relations find the optimal join order
strategy
 each local site optimizes the local query processing
L4.2.2. Distributed Query Optimization Algorithms -- 7
Data Transfer Strategies

Ship-whole. entire relation is shipped and stored as
temporary relation. If merge join algorithm is used,
no need for temporary storage, and can be done in
pipeline mode

Fetch-as-needed. this method is equivalent to
semijoin of the inner relation with the outer relation
tuple
L4.2.2. Distributed Query Optimization Algorithms -- 8
Join Strategy 1

External relation R with internal relation S, let LC
be local processing cost, CC be data transfer cost,
let average number of tuples of S that match one
tuple of R be s

Strategy 1. Ship the entire outer relation to the site
of internal relation
TC = LC(get R)
+ CC(size(R))
+ LC(get s tuples from S)*card(R)
L4.2.2. Distributed Query Optimization Algorithms -- 9
Join Strategy 2

Ship the entire inner relation to the site of the outer
relation
TC = LC(get S)
+ CC(size(S))
+ LC(store S)
+ LC(get R)
+ LC(get s tuples from S)*card(R)
L4.2.2. Distributed Query Optimization Algorithms -- 10
Join Strategy 3

Fetch tuples of the inner relation for each tuple of
the outer relation
TC = LC(get R)
+ CC(len(A)) * card(R)
+ LC(get s tuples from S) * card(R)
+ CC(s*len(S))*card(R)
L4.2.2. Distributed Query Optimization Algorithms -- 11
Join Strategy 4
Move both relations to 3rd site and join there
TC = LC(get R)
+ LC(get S)
+ CC(size(S))
+ LC(store S)
+ CC(size(R))
+ LC(get s tuples from S)*card(R)
 Conceptually, the algorithm does an exhaustive
search among all alternatives and selects one that
minimizes total cost

L4.2.2. Distributed Query Optimization Algorithms -- 12
Hill Climbing Algorithm - Algorithm
Inputs
query graph, locations of relations, and relation statistics
Initial solution
the least costly among all when the relations are sent to a candidate
result site denoted by ES0, and the site as chosen site
Splits ES0 into
ES1: ship one relation of join to the site of other relation
ES2: these two relations are joined locally and the result is
transmitted to the chosen site
If cost(ES1) + cost(ES2) + LC > cost (ES0) select ES0,
else select ES1 and ES2.
The process can be recursively applied to ES1 and ES2 till no
more benefit occurs
L4.2.2. Distributed Query Optimization Algorithms -- 13
Hill Climbing Algorithm - Example
Relation Size Site
EMP
8
1
PAY
4
2
PROJ
1
3
ASG
10
4
SAL
PROJ
PAY
EMP
PNO
Site1
EMP(8)
Site2
PAY(4)
TITLE
ENO
PNAME=“CAD/CAM”
ES0
Cost = 13
4
Site3
PROJ(1)
1
8
Site4
ASG(10)
ASG
Ignore the local processing cost
Length of tuples is 1 for all relation
L4.2.2. Distributed Query Optimization Algorithms -- 14
ES1
HCA - Example
Solution 1
Cost =
ES0
Cost = 13
Site1
EMP(8)
Site2
PAY(4)
4
Site3
PROJ(1)
1
Site4
ASG(10)
Solution 2
Cost =
Site2
PAY(4
)
TITLE
Site3
PROJ(1)
8
ES1
Site2
PAY(4)
?
Site3
PROJ(1)
ES2
?
Site4
?
ASG(10)
ES3
Site1
EMP(8)
ES2
ES3
Site1
EMP(8)
Site4
ASG(10)
ESo is the
“BEST”
L4.2.2. Distributed Query Optimization Algorithms -- 15
Hill Climbing Algorithm - Comments



Greedy algorithm:
determines an initial
feasible solution and
iteratively tries to improve
it.
If there are local minimas,
it may not find the global
minima
If the optimal solution has
a high initial cost, it won’t
be found since it won’t be
chosen as the initial feasible
solution.
Site1
EMP(8)
Site2
PAY(4)
Site3
PROJ(1)
Site4
ASG(10)
COST =
L4.2.2. Distributed Query Optimization Algorithms -- 16
SDD-1 Algorithm
SDD-1 algorithm generalized the hill-climbing
algorithm to determine ordering of beneficial
semijoins; and uses statistics on the database, called
database profiles.
 Cost of semijoin:
Cost (R SJA S) = CMSG + CTR*size(A(S))
 Benefit is the cost of transferring irrelevant tuple
Benefit(R SJA S) = (1-SFSJ(S.A)) * size(R) * CTR


A semijoin is beneficial if cost < benefit.
L4.2.2. Distributed Query Optimization Algorithms -- 17
SDD-1: The Algorithm
initialization phase generates all beneficial
semijoins, and an execution strategy that includes
only local processing
 most beneficial semijoin is selected; statistics are
modified and new beneficial semijoins are selected
 the above step is done until no more beneficial joins
are left
 assembly site selection to perform local operations
 postoptimization removes unnecessary semijoins

L4.2.2. Distributed Query Optimization Algorithms -- 18
SDD1 - Example
SELECT
FROM
WHERE
AND
*
Relation Card Tup_Len Rel_size
EMP, ASG, PROJ
30
50
1500
EMP.ENO = ASG.ENO EMP
ASG
100
30
3000
ASG.PNO = PROJ.PNO
PROJ
Site 2
ASG
ENO
Site 1
EMP
PNO
Site 3
PROJ
50
40
2000
Relation SFsj Size(PJ(attr))
EMP.ENO 0.3
120
ASG.ENO 0.8
400
ASG.PNO 1.0
400
PROJ.PNO 0.4
200
L4.2.2. Distributed Query Optimization Algorithms -- 19
SDD1 - First Iteration



SJ1: ASG SJ EMP
benefit = (1-0.3)*3000 = 2100;
cost = 120
SJ2: ASG SJ PROJ
benefit = (1-0.4)*3000 = 1800
cost = 200
SJ3: EMP SJ ASG
benefit = (1-0.8)*1500 = 300;
cost = 400

SJ4: PROJ SJ ASG
benefit = 0;
cost = 400



SJ1 is selected
ASG size is reduced to
3000*0.3=900
ASG’ = ASG SJ EMP
Semijoin selectivity
factor is reduced; it is
approximated by
SFSJ(G’.ENO)= 0.8*0.3 =
0.24,
SFSJ(G’PNO)=1.0*0.3
=0.3, size(G’.ENO)=
400*0.3=120,
size(G’.PNO) = 120
L4.2.2. Distributed Query Optimization Algorithms -- 20
SDD-1 - Second & Third Iterations
Second iteration
 SJ2: ASG’ SJ PROJ benefit=(10.4)*900=540
cost=200;
 SJ3: EMP SJ ASG’; benefit=(10.24)*1500=1140
cost=120
 SJ4: PROJ SJ ASG’, benefit=(10.3)*2000=1400
cost=120
 SJ4 is selected
PROJ’ = PROJ SJ ASG’
size(PROJ’) = 2000*0.3 = 600
SFSJ(J’)=0.4*0.3=0.12
size(J’.PNO)=200*0.3=60
Third Iteration




SJ2: ASG’ SJ PROJ
benefit=(1-0.12)*900=792
cost=60;
SJ3: EMP SJ ASG’; benefit=(10.24)*1500=1140
cost=120
SJ3 is selected
reduces size of E to
1500*0.24=360
Finally SJ2 is selected, with size
of G as 108
L4.2.2. Distributed Query Optimization Algorithms -- 21
Local Optimization
Each site optimizes the plan to be executed at the
site
 A centralized query optimization problem

L4.2.2. Distributed Query Optimization Algorithms -- 22
SDD-1 - Assembly Site Selection



After reduction
EMP is at site 1 with size 360
ASG is at site 2 with size 108
PROJ is at site 3 with size 600
Site 3 is chosen as assembly site
Site1
EMP
SJ4 is removed in post
optimization.
(ASG SJ EMP) SJ PROJ  site 3
(EMP SJ ASG)  site 3
join at site 3
Site3
PROJ
Site2
ASG
L4.2.2. Distributed Query Optimization Algorithms -- 23
Download