Distributed Query Optimization Algorithms System R and R* Hill Climbing and SDD-1 L4.2.2. Distributed Query Optimization Algorithms -- 1 System R (Centralized) Algorithm Simple (one relation) queries are executed according to the best access path. Execute joins Determine the possible ordering of joins Determine the cost of each ordering Choose the join ordering with the minimal cost For joins, two join methods are considered: Nested loops Merge join L4.2.2. Distributed Query Optimization Algorithms -- 2 System R Algorithm -- Example Names of employees working on the CAD/CAM project Assume EMP has an index on ENO, ASG has an index on PNO, PROJ has an index on PNO and an index on PNAME L4.2.2. Distributed Query Optimization Algorithms -- 3 System R Algorithm -- Example Choose the best access paths to each relation EMP: sequential scan (no selection on EMP) ASG: sequential scan (no selection on ASG) PROJ: index on PNAME (there is a selection on PROJ based on PNAME) Determine the best join ordering EMP ASG PROJ ASG PROJ EMP PROJ ASG EMP ASG EMP PROJ EMP PROJ ASG PROJ EMP ASG Select the best ordering based on the join costs evaluated according to the two methods L4.2.2. Distributed Query Optimization Algorithms -- 4 System R Example (cont'd) EMP EMP ASG EMP × PROJ ASG (ASG PROJ ASG EMP ASG EMP) PROJ PROJ PROJ ASG PROJ × EMP (PROJ ASG) ASG) EMP EMP Best total join order is one of (ASG EMP) PROJ (PROJ L4.2.2. Distributed Query Optimization Algorithms -- 5 System R Algorithm (PROJ ASG) EMP has a useful index on the select attribute and direct access to the join attributes of ASG and EMP. Final plan: select PROJ using index on PNAME then join with ASG using index on PNO then join with EMP using index on ENO L4.2.2. Distributed Query Optimization Algorithms -- 6 System R* Distributed Query Optimization Total-cost minimization. Cost function includes local processing as well as transmission. Algorithm For each relation in query tree find the best access path For the join of n relations find the optimal join order strategy each local site optimizes the local query processing L4.2.2. Distributed Query Optimization Algorithms -- 7 Data Transfer Strategies Ship-whole. entire relation is shipped and stored as temporary relation. If merge join algorithm is used, no need for temporary storage, and can be done in pipeline mode Fetch-as-needed. this method is equivalent to semijoin of the inner relation with the outer relation tuple L4.2.2. Distributed Query Optimization Algorithms -- 8 Join Strategy 1 External relation R with internal relation S, let LC be local processing cost, CC be data transfer cost, let average number of tuples of S that match one tuple of R be s Strategy 1. Ship the entire outer relation to the site of internal relation TC = LC(get R) + CC(size(R)) + LC(get s tuples from S)*card(R) L4.2.2. Distributed Query Optimization Algorithms -- 9 Join Strategy 2 Ship the entire inner relation to the site of the outer relation TC = LC(get S) + CC(size(S)) + LC(store S) + LC(get R) + LC(get s tuples from S)*card(R) L4.2.2. Distributed Query Optimization Algorithms -- 10 Join Strategy 3 Fetch tuples of the inner relation for each tuple of the outer relation TC = LC(get R) + CC(len(A)) * card(R) + LC(get s tuples from S) * card(R) + CC(s*len(S))*card(R) L4.2.2. Distributed Query Optimization Algorithms -- 11 Join Strategy 4 Move both relations to 3rd site and join there TC = LC(get R) + LC(get S) + CC(size(S)) + LC(store S) + CC(size(R)) + LC(get s tuples from S)*card(R) Conceptually, the algorithm does an exhaustive search among all alternatives and selects one that minimizes total cost L4.2.2. Distributed Query Optimization Algorithms -- 12 Hill Climbing Algorithm - Algorithm Inputs query graph, locations of relations, and relation statistics Initial solution the least costly among all when the relations are sent to a candidate result site denoted by ES0, and the site as chosen site Splits ES0 into ES1: ship one relation of join to the site of other relation ES2: these two relations are joined locally and the result is transmitted to the chosen site If cost(ES1) + cost(ES2) + LC > cost (ES0) select ES0, else select ES1 and ES2. The process can be recursively applied to ES1 and ES2 till no more benefit occurs L4.2.2. Distributed Query Optimization Algorithms -- 13 Hill Climbing Algorithm - Example Relation Size Site EMP 8 1 PAY 4 2 PROJ 1 3 ASG 10 4 SAL PROJ PAY EMP PNO Site1 EMP(8) Site2 PAY(4) TITLE ENO PNAME=“CAD/CAM” ES0 Cost = 13 4 Site3 PROJ(1) 1 8 Site4 ASG(10) ASG Ignore the local processing cost Length of tuples is 1 for all relation L4.2.2. Distributed Query Optimization Algorithms -- 14 ES1 HCA - Example Solution 1 Cost = ES0 Cost = 13 Site1 EMP(8) Site2 PAY(4) 4 Site3 PROJ(1) 1 Site4 ASG(10) Solution 2 Cost = Site2 PAY(4 ) TITLE Site3 PROJ(1) 8 ES1 Site2 PAY(4) ? Site3 PROJ(1) ES2 ? Site4 ? ASG(10) ES3 Site1 EMP(8) ES2 ES3 Site1 EMP(8) Site4 ASG(10) ESo is the “BEST” L4.2.2. Distributed Query Optimization Algorithms -- 15 Hill Climbing Algorithm - Comments Greedy algorithm: determines an initial feasible solution and iteratively tries to improve it. If there are local minimas, it may not find the global minima If the optimal solution has a high initial cost, it won’t be found since it won’t be chosen as the initial feasible solution. Site1 EMP(8) Site2 PAY(4) Site3 PROJ(1) Site4 ASG(10) COST = L4.2.2. Distributed Query Optimization Algorithms -- 16 SDD-1 Algorithm SDD-1 algorithm generalized the hill-climbing algorithm to determine ordering of beneficial semijoins; and uses statistics on the database, called database profiles. Cost of semijoin: Cost (R SJA S) = CMSG + CTR*size(A(S)) Benefit is the cost of transferring irrelevant tuple Benefit(R SJA S) = (1-SFSJ(S.A)) * size(R) * CTR A semijoin is beneficial if cost < benefit. L4.2.2. Distributed Query Optimization Algorithms -- 17 SDD-1: The Algorithm initialization phase generates all beneficial semijoins, and an execution strategy that includes only local processing most beneficial semijoin is selected; statistics are modified and new beneficial semijoins are selected the above step is done until no more beneficial joins are left assembly site selection to perform local operations postoptimization removes unnecessary semijoins L4.2.2. Distributed Query Optimization Algorithms -- 18 SDD1 - Example SELECT FROM WHERE AND * Relation Card Tup_Len Rel_size EMP, ASG, PROJ 30 50 1500 EMP.ENO = ASG.ENO EMP ASG 100 30 3000 ASG.PNO = PROJ.PNO PROJ Site 2 ASG ENO Site 1 EMP PNO Site 3 PROJ 50 40 2000 Relation SFsj Size(PJ(attr)) EMP.ENO 0.3 120 ASG.ENO 0.8 400 ASG.PNO 1.0 400 PROJ.PNO 0.4 200 L4.2.2. Distributed Query Optimization Algorithms -- 19 SDD1 - First Iteration SJ1: ASG SJ EMP benefit = (1-0.3)*3000 = 2100; cost = 120 SJ2: ASG SJ PROJ benefit = (1-0.4)*3000 = 1800 cost = 200 SJ3: EMP SJ ASG benefit = (1-0.8)*1500 = 300; cost = 400 SJ4: PROJ SJ ASG benefit = 0; cost = 400 SJ1 is selected ASG size is reduced to 3000*0.3=900 ASG’ = ASG SJ EMP Semijoin selectivity factor is reduced; it is approximated by SFSJ(G’.ENO)= 0.8*0.3 = 0.24, SFSJ(G’PNO)=1.0*0.3 =0.3, size(G’.ENO)= 400*0.3=120, size(G’.PNO) = 120 L4.2.2. Distributed Query Optimization Algorithms -- 20 SDD-1 - Second & Third Iterations Second iteration SJ2: ASG’ SJ PROJ benefit=(10.4)*900=540 cost=200; SJ3: EMP SJ ASG’; benefit=(10.24)*1500=1140 cost=120 SJ4: PROJ SJ ASG’, benefit=(10.3)*2000=1400 cost=120 SJ4 is selected PROJ’ = PROJ SJ ASG’ size(PROJ’) = 2000*0.3 = 600 SFSJ(J’)=0.4*0.3=0.12 size(J’.PNO)=200*0.3=60 Third Iteration SJ2: ASG’ SJ PROJ benefit=(1-0.12)*900=792 cost=60; SJ3: EMP SJ ASG’; benefit=(10.24)*1500=1140 cost=120 SJ3 is selected reduces size of E to 1500*0.24=360 Finally SJ2 is selected, with size of G as 108 L4.2.2. Distributed Query Optimization Algorithms -- 21 Local Optimization Each site optimizes the plan to be executed at the site A centralized query optimization problem L4.2.2. Distributed Query Optimization Algorithms -- 22 SDD-1 - Assembly Site Selection After reduction EMP is at site 1 with size 360 ASG is at site 2 with size 108 PROJ is at site 3 with size 600 Site 3 is chosen as assembly site Site1 EMP SJ4 is removed in post optimization. (ASG SJ EMP) SJ PROJ site 3 (EMP SJ ASG) site 3 join at site 3 Site3 PROJ Site2 ASG L4.2.2. Distributed Query Optimization Algorithms -- 23