Ddb-L42

Distributed Query Optimization Algorithms System R and R*  Hill Climbing and SDD-1  L4.2.2. Distributed Query Optimization Algorithms -- 1 System R (Centralized) Algorithm Simple (one relation) queries are executed according to the best access path.  Execute joins   Determine the possible ordering of joins  Determine the cost of each ordering  Choose the join ordering with the minimal cost  For joins, two join methods are considered:  Nested loops  Merge join L4.2.2. Distributed Query Optimization Algorithms -- 2 System R Algorithm -- Example Names of employees working on the CAD/CAM project  Assume  EMP has an index on ENO,  ASG has an index on PNO,  PROJ has an index on PNO and an index on PNAME L4.2.2. Distributed Query Optimization Algorithms -- 3 System R Algorithm -- Example  Choose the best access paths to each relation  EMP: sequential scan (no selection on EMP)  ASG: sequential scan (no selection on ASG)  PROJ: index on PNAME (there is a selection on PROJ based on PNAME)  Determine the best join ordering        EMP ASG PROJ ASG PROJ EMP PROJ ASG EMP ASG EMP PROJ EMP  PROJ ASG PROJ  EMP ASG Select the best ordering based on the join costs evaluated according to the two methods L4.2.2. Distributed Query Optimization Algorithms -- 4 System R Example (cont'd) EMP EMP ASG EMP × PROJ ASG (ASG  PROJ ASG EMP ASG EMP) PROJ PROJ PROJ ASG PROJ × EMP (PROJ ASG) ASG) EMP EMP Best total join order is one of (ASG EMP) PROJ (PROJ L4.2.2. Distributed Query Optimization Algorithms -- 5 System R Algorithm  (PROJ ASG) EMP has a useful index on the select attribute and direct access to the join attributes of ASG and EMP.  Final plan:  select PROJ using index on PNAME  then join with ASG using index on PNO  then join with EMP using index on ENO L4.2.2. Distributed Query Optimization Algorithms -- 6 System R* Distributed Query Optimization  Total-cost minimization. Cost function includes local processing as well as transmission.  Algorithm  For each relation in query tree find the best access path  For the join of n relations find the optimal join order strategy  each local site optimizes the local query processing L4.2.2. Distributed Query Optimization Algorithms -- 7 Data Transfer Strategies  Ship-whole. entire relation is shipped and stored as temporary relation. If merge join algorithm is used, no need for temporary storage, and can be done in pipeline mode  Fetch-as-needed. this method is equivalent to semijoin of the inner relation with the outer relation tuple L4.2.2. Distributed Query Optimization Algorithms -- 8 Join Strategy 1  External relation R with internal relation S, let LC be local processing cost, CC be data transfer cost, let average number of tuples of S that match one tuple of R be s  Strategy 1. Ship the entire outer relation to the site of internal relation TC = LC(get R) + CC(size(R)) + LC(get s tuples from S)*card(R) L4.2.2. Distributed Query Optimization Algorithms -- 9 Join Strategy 2  Ship the entire inner relation to the site of the outer relation TC = LC(get S) + CC(size(S)) + LC(store S) + LC(get R) + LC(get s tuples from S)*card(R) L4.2.2. Distributed Query Optimization Algorithms -- 10 Join Strategy 3  Fetch tuples of the inner relation for each tuple of the outer relation TC = LC(get R) + CC(len(A)) * card(R) + LC(get s tuples from S) * card(R) + CC(s*len(S))*card(R) L4.2.2. Distributed Query Optimization Algorithms -- 11 Join Strategy 4 Move both relations to 3rd site and join there TC = LC(get R) + LC(get S) + CC(size(S)) + LC(store S) + CC(size(R)) + LC(get s tuples from S)*card(R)  Conceptually, the algorithm does an exhaustive search among all alternatives and selects one that minimizes total cost  L4.2.2. Distributed Query Optimization Algorithms -- 12 Hill Climbing Algorithm - Algorithm Inputs query graph, locations of relations, and relation statistics Initial solution the least costly among all when the relations are sent to a candidate result site denoted by ES0, and the site as chosen site Splits ES0 into ES1: ship one relation of join to the site of other relation ES2: these two relations are joined locally and the result is transmitted to the chosen site If cost(ES1) + cost(ES2) + LC > cost (ES0) select ES0, else select ES1 and ES2. The process can be recursively applied to ES1 and ES2 till no more benefit occurs L4.2.2. Distributed Query Optimization Algorithms -- 13 Hill Climbing Algorithm - Example Relation Size Site EMP 8 1 PAY 4 2 PROJ 1 3 ASG 10 4 SAL PROJ PAY EMP PNO Site1 EMP(8) Site2 PAY(4) TITLE ENO PNAME=“CAD/CAM” ES0 Cost = 13 4 Site3 PROJ(1) 1 8 Site4 ASG(10) ASG Ignore the local processing cost Length of tuples is 1 for all relation L4.2.2. Distributed Query Optimization Algorithms -- 14 ES1 HCA - Example Solution 1 Cost = ES0 Cost = 13 Site1 EMP(8) Site2 PAY(4) 4 Site3 PROJ(1) 1 Site4 ASG(10) Solution 2 Cost = Site2 PAY(4 ) TITLE Site3 PROJ(1) 8 ES1 Site2 PAY(4) ? Site3 PROJ(1) ES2 ? Site4 ? ASG(10) ES3 Site1 EMP(8) ES2 ES3 Site1 EMP(8) Site4 ASG(10) ESo is the “BEST” L4.2.2. Distributed Query Optimization Algorithms -- 15 Hill Climbing Algorithm - Comments    Greedy algorithm: determines an initial feasible solution and iteratively tries to improve it. If there are local minimas, it may not find the global minima If the optimal solution has a high initial cost, it won’t be found since it won’t be chosen as the initial feasible solution. Site1 EMP(8) Site2 PAY(4) Site3 PROJ(1) Site4 ASG(10) COST = L4.2.2. Distributed Query Optimization Algorithms -- 16 SDD-1 Algorithm SDD-1 algorithm generalized the hill-climbing algorithm to determine ordering of beneficial semijoins; and uses statistics on the database, called database profiles.  Cost of semijoin: Cost (R SJA S) = CMSG + CTR*size(A(S))  Benefit is the cost of transferring irrelevant tuple Benefit(R SJA S) = (1-SFSJ(S.A)) * size(R) * CTR   A semijoin is beneficial if cost < benefit. L4.2.2. Distributed Query Optimization Algorithms -- 17 SDD-1: The Algorithm initialization phase generates all beneficial semijoins, and an execution strategy that includes only local processing  most beneficial semijoin is selected; statistics are modified and new beneficial semijoins are selected  the above step is done until no more beneficial joins are left  assembly site selection to perform local operations  postoptimization removes unnecessary semijoins  L4.2.2. Distributed Query Optimization Algorithms -- 18 SDD1 - Example SELECT FROM WHERE AND * Relation Card Tup_Len Rel_size EMP, ASG, PROJ 30 50 1500 EMP.ENO = ASG.ENO EMP ASG 100 30 3000 ASG.PNO = PROJ.PNO PROJ Site 2 ASG ENO Site 1 EMP PNO Site 3 PROJ 50 40 2000 Relation SFsj Size(PJ(attr)) EMP.ENO 0.3 120 ASG.ENO 0.8 400 ASG.PNO 1.0 400 PROJ.PNO 0.4 200 L4.2.2. Distributed Query Optimization Algorithms -- 19 SDD1 - First Iteration    SJ1: ASG SJ EMP benefit = (1-0.3)*3000 = 2100; cost = 120 SJ2: ASG SJ PROJ benefit = (1-0.4)*3000 = 1800 cost = 200 SJ3: EMP SJ ASG benefit = (1-0.8)*1500 = 300; cost = 400  SJ4: PROJ SJ ASG benefit = 0; cost = 400    SJ1 is selected ASG size is reduced to 3000*0.3=900 ASG’ = ASG SJ EMP Semijoin selectivity factor is reduced; it is approximated by SFSJ(G’.ENO)= 0.8*0.3 = 0.24, SFSJ(G’PNO)=1.0*0.3 =0.3, size(G’.ENO)= 400*0.3=120, size(G’.PNO) = 120 L4.2.2. Distributed Query Optimization Algorithms -- 20 SDD-1 - Second & Third Iterations Second iteration  SJ2: ASG’ SJ PROJ benefit=(10.4)*900=540 cost=200;  SJ3: EMP SJ ASG’; benefit=(10.24)*1500=1140 cost=120  SJ4: PROJ SJ ASG’, benefit=(10.3)*2000=1400 cost=120  SJ4 is selected PROJ’ = PROJ SJ ASG’ size(PROJ’) = 2000*0.3 = 600 SFSJ(J’)=0.4*0.3=0.12 size(J’.PNO)=200*0.3=60 Third Iteration     SJ2: ASG’ SJ PROJ benefit=(1-0.12)*900=792 cost=60; SJ3: EMP SJ ASG’; benefit=(10.24)*1500=1140 cost=120 SJ3 is selected reduces size of E to 1500*0.24=360 Finally SJ2 is selected, with size of G as 108 L4.2.2. Distributed Query Optimization Algorithms -- 21 Local Optimization Each site optimizes the plan to be executed at the site  A centralized query optimization problem  L4.2.2. Distributed Query Optimization Algorithms -- 22 SDD-1 - Assembly Site Selection    After reduction EMP is at site 1 with size 360 ASG is at site 2 with size 108 PROJ is at site 3 with size 600 Site 3 is chosen as assembly site Site1 EMP SJ4 is removed in post optimization. (ASG SJ EMP) SJ PROJ  site 3 (EMP SJ ASG)  site 3 join at site 3 Site3 PROJ Site2 ASG L4.2.2. Distributed Query Optimization Algorithms -- 23

Ddb-L42

Related documents

Products

Support

Ddb-L42

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib