CS 540 Database Management Systems Lecture 8: Query Optimization 1 DBMS Architecture User/Web Forms/Applications/DBA query Today’s lecture Query Parser transaction Transaction Manager Query Rewriter Query Optimizer Lock Manager Logging & Recovery Query Executor Files & Access Methods Buffer Manager Buffers Lock Tables Main Memory Storage Manager Storage Past lectures Many query plans to execute a SQL query • Compute the join of R(A,B) S(B,C) T(C,D) U(D,E) U U T T R U S S R T R S • Even more plans: multiple algorithms to execute each operation hash join Sort-merge Sort-merge index-scan R U Table-scan T index-scan S Table-scan 3 Query optimization: picking the fastest plan • Optimal approach plan – – – – enumerate each possible plan measure its performance by running it pick the fastest one What’s wrong? • Cost-based optimization – predict the cost of each plan – search the plan space to find the fastest one – do it efficiently • Optimization itself should be fast! 4 Cost-based optimization • Plan space – which plans to consider? – it is time consuming to explore all alternatives. • Cost estimator – how to estimate the cost of each plan without executing it? – like to have accurate estimation • Search algorithm – how to search the plan space fast? – like to avoid checking inefficient plans 5 Reduce plan space by query rewriting • Multiple logical query plan for each SQL query Star(name, birthdate), StarsIn(movie, name, year) SELECT movie FROM Stars, StarsIn WHERE Star.name = StarsIn.name AND year = 1950 movie movie s year=1950 StarsIn.name = Star.name StarsIn.name = Star.name StarsIn Star year=1950 StarsIn Star Generally Faster 6 Reduce plan space by query rewriting • Push selection down to reduce # of rows • Push projection down to reduce # of columns SELECT movie, name FROM Stars, StarsIn WHERE Star.name = StarsIn.name movei, name movie, name StarsIn.name = Star.name StarsIn.name = Star.name movie, name StarsIn movie, name Star StarsIn Less effective than pushing down selection. Star 7 Cost estimation • Cost of an operator depends on its input size – sort-merge join of R ⋈ S : 3 B(R) + 3 B(S) – The inputs may be output of other operators sort-merge join for (R ⋈ S) ⋈ T 3 B(T) + 3 B(R ⋈ S) + 3 B(R) + 3 B(S) • For each operator in a plan compute – input size – cost – output size (for the next operator) 8 Cost estimation • Input: stats on each table – NCARD(R): Number of tuples in R – TCARD(R): Number of blocks in R • TCARD(R) = NCARD(R ) / block size – ICARD(R,A): Number of distinct values of attribute A in R • too much information to keep, use histogram • Assumptions on attribute and predicate independence – we need relative accuracy not exact values. 9 Output size estimation: selection • For relation R(A, B) – S = sA=a(R) • NCARD(S) ranges from 0 to NCARD(R) – ICARD(R,A) + 1 • consider its mean: NCARD (S) = NCARD(R) / ICARD (R,A) – S = sA<a(R) • NCARD(S) = NCARD (R) * (max(A) - a) / (max(A) – min(A)) • If not athematic use magic number – NCARD(S) = NCARD (R)/3 – S = s b <A<a(R) • NCARD(S) = NCARD (R) * (a - b) / (max(A) – min(A)) • If not athematic use magic number – NCARD(S) = NCARD (R)/3 10 Output size estimation: selection • S = sA=1 AND B<10(R) – multiply 1/ICARD(R,A) for equality and 1/3 for inequality – NCARD(R) = 10,000, ICARD(R,A) = 50 – T(S) = 10000 / (50 * 3) = 66 • S = sA=1 OR B<10(R) – sum of estimates of predicates minus their product – NCARD(R) = 10,000, ICARD(R,A) = 50 – NCARD(S) = 200 + 3333 – 66 = 3467 11 Output size estimation: join • Containment of values assumption ICARD(S,A) <= ICARD (R,A): A values in S is a subset of A values in R – special case: A is a a key in R and a foreign key in S • Let’s assume ICARD (S,A) <= ICARD (R,A) – Each tuple t in S joins x tuple(s) in R – consider its mean: x = NCARD(R) / ICARD (R,A) – NCARD(R ⋈A S) = NCARD (R) * NCARD(S) / ICARD(R,A) NCARD(R ⋈A S) = NCARD (R) * NCARD(S) / max(ICARD(R,A), ICARD(S,A)) 12 Cost estimation • System- R cost model – Sum of I/O and CPU – #(PAGE) + W * (RSI calls) • Current cost formulas? 13 Search the plan space • Baseline: exhaustive search – enumerate all combinations and compare their costs – huge space! U U T R S T U R S T R S • System-R style (Selinger style) – dynamic programming – bottom up construction of the plan • start from the base tables and work up the tree to form a plan • compute the cost of larger plans based on its sub-trees. • greedily remove sub-trees that are costly (locally optimal) 14 Dynamic programming • • • • Step 1: best plans for {R1}, {R2}, …, {Rn} Step 2: best plans for {R1,R2}, {R1,R3}, …, {Rn-1, Rn} … Step n: best plan for {R1, …, Rn} 15 Dynamic Programming • Step 1: For each {Ri}: – size({Ri}) = B(Ri) – plan({Ri}) = Ri – cost({Ri}) = cost of access to Ri • e.g. B(Ri) if no index on Ri • Step 2: For each {Ri, Rj}: – size({Ri,Rj}) = estimate of the size of join – plan({Ri,Rj}) = join algorithm – cost = cost function of size of Ri and Rj • #I/O access of the chosen join algorithm – plan({Ri,Rj}): the join algorithm with smallest cost 16 Dynamic programming • Step i: For each S ⊆ {R1, …, Rn} of cardinality i do: – compute Size(S) – for every S1 ,S2 s.t. S = S1 S2 c = cost(S1) + cost(S2) + cost(S1 ⋈ S2) – cost(S) = the smallest C – plan(S) = the plan for cost(S) • Return Plan({R1, …, Rn}) 17 Dynamic programming: example • Let’s assume that the cost of each join is the size of its intermediate results. – to simplify the example – other cost measures, #I/O access, are possible. • cost( R ) = 0 (no intermediate results) • cost(R ⋈ S) = 0 (no intermediate results) • cost( (R ⋈ S) ⋈ T) = cost(R ⋈ S) + cost(T) + size( R ⋈ S ) = size(R ⋈ S) 18 Dynamic programming: example • Relations: R, S, T, U • Number of tuples: 2000, 5000, 3000, 1000 • We use a toy size estimation method: – size (A ⋈ B) = 0.01 * T(A) * T(B) 19 Query Size Cost Plan RS RT RU ST SU TU RST RSU RTU STU RSTU 20 Query Size Cost Plan RS 100k 0 RS RT 60k 0 RT RU 20k 0 UR ST 150k 0 TS SU 50k 0 US TU 30k 0 UT RST RSU RTU STU RSTU 21 Query Size Cost Plan RS 100k 0 RS RT 60k 0 RT RU 20k 0 UR ST 150k 0 TS SU 50k 0 US TU 30k 0 UT RST 3M 60k S(RT) RSU 1M 20k S(UR) RTU 0.6M 20k T(UR) STU 1.5M 30k S(UT) RSTU 22 Query Size Cost Plan RS 100k 0 RS RT 60k 0 RT RU 20k 0 UR ST 150k 0 TS SU 50k 0 US TU 30k 0 UT RST 3M 60k S(RT) RSU 1M 20k S(UR) RTU 0.6M 20k T(UR) STU 1.5M 30k S(UT) RSTU 30M 110k (US)(RT) 23 Reducing search space • The algorithm requires exponential computation! • System-R style considers only left-deep joins U U T R S T T U R S S R • Left-deep trees allow us to generate all fully pipelined plans – Intermediate results not written to temporary files. • Not all left-deep trees are fully pipelined (e.g., SM join). • 24 Reducing search space • System R-style ignores plans with Cartesian products – The size of a Cartesian product is generally larger than (natural) joins. • Example: R(A,B), S(B,C), U(C,D) (R ⋈ U) ⋈ S has a Cartesian product pick (R ⋈ S) ⋈ U instead 25 What you should know • The importance of query optimization – difference between fast and slow plans • Query optimization problem – find the fast plans efficiently. • The components of a cost-based (system R style) query optimizer: – plan space definition – cost estimation – search algorithm 26