Query Processing CS 245: Database System Principles Q Query Plan Notes 6: Query Processing Hector Garcia-Molina CS 245 Notes 6 1 CS 245 Notes 6 2 Query Processing Example Q Query Plan Select B,D From R,S Where R.A = “c” S.E = 2 R.C=S.C Focus: Relational System • Others? CS 245 Notes 6 R A B C a 1 10 b 1 c 3 S C CS 245 Notes 6 D E R A B C 10 x 2 a 1 10 20 20 y 2 b 1 2 10 30 z 2 c d 2 35 40 x 1 e 3 45 50 y 3 D E 10 x 2 20 20 y 2 2 10 30 z 2 d 2 35 40 x 1 e 3 45 50 y 3 Answer CS 245 Notes 6 5 CS 245 4 S C B 2 Notes 6 D x 6 1 RXS • How do we execute query? - Do Cartesian product - Select tuples - Do projection One idea CS 245 RXS R.A R.B R.C S.C S.D S.E Notes 6 7 Bingo! Got one... 1 10 10 x 2 a . . 1 10 20 y 2 C . . 2 1 10 10 x 2 a . . 1 10 20 y 2 C . . 2 10 10 x 2 CS 245 Notes 6 8 Relational Algebra - can be used to describe plans... R.A R.B R.C S.C S.D S.E a a Ex: Plan I B,D R.A=“c” S.E=2 R.C=S.C 10 10 x 2 X R CS 245 Notes 6 9 Relational Algebra - can be used to describe plans... S CS 245 Notes 6 10 Another idea: Ex: Plan I Plan II B,D R.A=“c” S.E=2 R.C=S.C R.A = “c” X R B,D R S S.E = 2 natural join S OR: B,D [ R.A=“c” S.E=2 R.C = S.C (RXS)] CS 245 Notes 6 11 CS 245 Notes 6 12 2 R S A B C (R) a 1 10 A B C b 1 20 c 2 10 (S) C D E C D E 10 x 2 10 x 2 20 y 2 c 2 10 20 y 2 30 z 2 d 2 35 30 z 2 40 x 1 e 3 45 Plan III Use R.A and S.C Indexes (1) Use R.A index to select R tuples with R.A = “c” (2) For each R.C value found, use S.C index to find matching tuples 50 y 3 CS 245 Notes 6 13 CS 245 Plan III Notes 6 R Use R.A and S.C Indexes (1) Use R.A index to select R tuples with R.A = “c” (2) For each R.C value found, use S.C index to find matching tuples A B C (3) Eliminate S tuples S.E 2 (4) Join matching R,S tuples, project B,D attributes and place in result CS 245 Notes 6 R A B C C S C I1 I2 C D E a 1 10 10 x 2 b 1 20 20 y 2 c 2 10 30 z 2 d 2 35 40 x 1 e 3 45 50 y 3 CS 245 Notes 6 S R A=“c” 16 S C C D E A B C 10 x 2 a 1 10 20 y 2 b 1 20 c 2 10 30 z 2 c 2 10 30 z 2 d 2 35 40 x 1 d 2 35 40 x 1 e 3 45 50 y 3 e 3 45 50 y 3 a 1 10 b 1 20 CS 245 A=“c” 15 A 14 I1 I2 <c,2,10> Notes 6 17 CS 245 I1 I2 <c,2,10> <10,x,2> Notes 6 C D E 10 x 2 20 y 2 18 3 R A B C a 1 10 b 1 20 A=“c” I1 I2 <c,2,10> <10,x,2> check=2? c 2 10 d 2 35 C output: <2,x> e 3 45 S R C D E A B C A=“c” S C I1 10 x 2 a 1 10 20 y 2 b 1 20 30 z 2 c 2 10 40 x 1 d 2 35 50 y 3 e 3 45 C D E I2 10 x 2 <c,2,10> <10,x,2> 20 y 2 check=2? 30 z 2 40 x 1 output: <2,x> 50 y 3 next tuple: <c,7,15> CS 245 Notes 6 19 CS 245 Notes 6 20 SQL query parse Overview of Query Optimization parse tree convert answer logical query plan apply laws “improved” l.q.p estimate result sizes l.q.p. +sizes execute statistics Pi pick best {(P1,C1),(P2,C2)...} estimate costs consider physical plans {P1,P2,…..} CS 245 Notes 6 21 22 <Query> SELECT title FROM StarsIn WHERE starName IN ( SELECT name FROM MovieStar WHERE birthdate LIKE ‘%1960’ ); <SFW> SELECT <SelList> FROM <FromList> <Attribute> <RelName> title StarsIn WHERE <Condition> <Tuple> IN <Query> <Attribute> starName SELECT (Find the movies with stars born in 1960) Notes 6 Notes 6 Example: Parse Tree Example: SQL query CS 245 CS 245 23 CS 245 <SelList> FROM <FromList> <Attribute> <RelName> name MovieStar Notes 6 WHERE ( <Query> ) <SFW> <Condition> <Attribute> LIKE <Pattern> birthDate ‘%1960’ 24 4 Example: Generating Relational Algebra Example: Logical Query Plan title title starName=name StarsIn <condition> <tuple> IN name birthdate LIKE ‘%1960’ <attribute> starName birthdate LIKE ‘%1960’ MovieStar MovieStar Fig. 7.15: An expression using a two-argument , midway between a parse tree and relational algebra CS 245 Notes 6 Fig. 7.18: Applying the rule for IN conditions 25 Example: Improved Logical Query Plan CS 245 Notes 6 Example: title 26 Estimate Result Sizes Question: Push project to StarsIn? starName=name StarsIn name StarsIn Need expected size name StarsIn birthdate LIKE ‘%1960’ MovieStar MovieStar Fig. 7.20: An improvement on fig. 7.18. CS 245 Notes 6 Example: 27 One Physical Plan CS 245 Notes 6 28 Example: Estimate costs L.Q.P Hash join SEQ scan StarsIn Parameters: join order, memory size, project attributes,... index scan Parameters: Select Condition,... P1 P2 …. Pn C1 C2 …. Cn MovieStar Pick best! CS 245 Notes 6 29 CS 245 Notes 6 30 5 Textbook outline Chapter 15 5 Algebra for queries [Ch 5] Chapter 16 16.1[16.1] 16.2[16.2] 16.3[16.3] Parsing Algebraic laws Parse tree -> logical query plan 16.4[16.4] Estimating result sizes 16.5-7[16.5-7] Cost based optimization [bags vs sets] - Select, project, join, …. [project list a,a+b->x,…] - Duplicate elimination, grouping, sorting 15.1 Physical operators [15.1] - Scan,sort, … 15.2 - 15.6 Implementing operators + [15.2-15.6] estimating their cost CS 245 Notes 6 31 CS 245 Notes 6 Reading textbook - Chapters 15, 16 Query Optimization - In class order Optional: • Relational algebra level (A) • Detailed query plan level – Sections 15.7, 15.8, 15.9 [15.7, 15.8] – Sections 16.6, 16.7 [16.6, 16.7] 32 – Estimate Costs (B) Optional: Duplicate elimination operator grouping, aggregation operators • without indexes • with indexes – Generate and compare plans (C) CS 245 Notes 6 33 CS 245 Notes 6 34 SQL query Relational algebra optimization parse parse tree convert answer logical query plan (A) apply laws “improved” l.q.p (B) estimate result sizes l.q.p. +sizes consider physical plans execute statistics Pi • Transformation rules (preserve equivalence) • What are good transformations? pick best {(P1,C1),(P2,C2)...} (C) (C) estimate costs {P1,P2,…..} CS 245 Notes 6 35 CS 245 Notes 6 36 6 Rules: R (R Note: Natural joins & cross products & union S S) = T S R =R (S • Carry attribute names in results, so order is not important • Can also write as trees, e.g.: T) T R CS 245 Notes 6 Rules: R (R 37 Natural joins & cross products & union S S) = T S R =R (S R S CS 245 S T Notes 6 38 Notes 6 40 Notes 6 42 Rules: Selects T) p1p2(R) = p1vp2(R) = RxS=SxR (R x S) x T = R x (S x T) RUS=SUR R U (S U T) = (R U S) U T CS 245 Notes 6 39 Bags vs. Sets Rules: Selects p1p2(R) = p1vp2(R) = CS 245 CS 245 R = {a,a,b,b,b,c} S = {b,b,c,c,d} RUS = ? p1 [ p2 (R)] [ p1 (R)] U [ p2 (R)] Notes 6 41 CS 245 7 Bags vs. Sets Option 2 (MAX) makes this rule work: R = {a,a,b,b,b,c} S = {b,b,c,c,d} RUS = ? Example: R={a,a,b,b,b,c} P1 satisfied by a,b; P2 satisfied by b,c p1vp2 (R) = p1(R) U p2(R) • Option 1 SUM RUS = {a,a,b,b,b,b,b,c,c,c,d} • Option 2 MAX RUS = {a,a,b,b,b,c,c,d} CS 245 Notes 6 43 Option 2 (MAX) makes this rule work: 44 Senators (……) Example: R={a,a,b,b,b,c} P1 satisfied by a,b; P2 satisfied by b,c T1 = p1vp2 (R) = {a,a,b,b,b,c} p1(R) = {a,a,b,b,b} p2(R) = {b,b,b,c} p1(R) U p2 (R) = {a,a,b,b,b,c} Notes 6 Notes 6 “Sum” option makes more sense: p1vp2 (R) = p1(R) U p2(R) CS 245 CS 245 T1 Rep (……) yr,state Senators; Yr 14 16 15 State CA CA AZ T2 = T2 yr,state Reps Yr 16 16 15 State CA CA CA Union? 45 CS 245 Notes 6 46 Rules: Project Executive Decision Let: X = set of attributes Y = set of attributes XY = X U Y -> Use “SUM” option for bag unions -> Some rules cannot be used for bags xy (R) = CS 245 Notes 6 47 CS 245 Notes 6 48 8 Rules: Project Rules: Project Let: X = set of attributes Y = set of attributes XY = X U Y Let: X = set of attributes Y = set of attributes XY = X U Y xy (R) = x [y (R)] xy (R) = x [y (R)] CS 245 Notes 6 49 CS 245 Notes 6 50 Rules: combined Rules: combined Let p = predicate with only R attribs q = predicate with only S attribs m = predicate with only R,S attribs Let p = predicate with only R attribs q = predicate with only S attribs m = predicate with only R,S attribs p (R S) = q (R S) = CS 245 Notes 6 Rules: combined 51 CS 245 S) = q (R S) = CS 245 (R)] R [ [ p S q (S)] Notes 6 52 pq (R S) = [p (R)] [q (S)] pqm (R S) = m [(p R) (q S)] pvq (R S) = [(p R) S] U [R (q S)] S) = S) = S) = Notes 6 (R Do one, others for homework: (continued) Some Rules can be Derived: pq (R pqm (R pvq (R p 53 CS 245 Notes 6 54 9 --> Derivation for first one: --> Derivation for first one: pq (R pq (R S) = p [q (R S) ] = p [ R q (S) ] = [p (R)] [q (S)] S) = CS 245 Rules: Notes 6 55 combined CS 245 Rules: Notes 6 56 combined Let x = subset of R attributes z = attributes in predicate P (subset of R attributes) Let x = subset of R attributes z = attributes in predicate P (subset of R attributes) x[p (R) ] = x[p (R) ] = CS 245 Rules: Notes 6 57 combined CS 245 Rules: {p [ x (R) ]} Notes 6 combined Let x = subset of R attributes z = attributes in predicate P (subset of R attributes) Let x = subset of R attributes y = subset of S attributes z = intersection of R,S attributes xz x[p (R) ] = x {p [ x xy (R CS 245 Notes 6 (R) ]} 59 CS 245 58 S) = Notes 6 60 10 Rules: combined xy {p (R Let x = subset of R attributes y = subset of S attributes z = intersection of R,S attributes xy (R CS 245 [yz (S) ]} Notes 6 xy {p (R S)} 61 yz’ (S)]} z’ = z U {attributes used in P CS 245 Notes 6 62 Rules for combined with X = xy {p [xz’ (R) Rules = S) = xy{[xz (R) ] CS 245 S)} similar... e.g., } Notes 6 63 p (R X S) = ? CS 245 Notes 6 64 Which are “good” transformations? U combined: p1p2 (R) p1 [p2 (R)] p (R S) [p (R)] S p(R U S) = p(R) U p(S) p(R - S) = p(R) - S = p(R) - p(S) R S S R x [p (R)] x {p [xz (R)]} CS 245 Notes 6 65 CS 245 Notes 6 66 11 Conventional wisdom: do projects early But What if we have A, B indexes? Example: R(A,B,C,D,E) x={E} P: (A=3) (B=“cat”) x {p (R)} vs. B = “cat” A=3 E {p{ABE(R)}} Intersect pointers to get pointers to matching tuples CS 245 Notes 6 67 CS 245 Notes 6 Bottom line: In textbook: more transformations • No transformation is always good • Usually good: early selections • Eliminate common sub-expressions • Other operations: duplicate elimination CS 245 Notes 6 69 CS 245 Notes 6 Outline - Query Processing • Estimating cost of query plan • Relational algebra level (1) Estimating size of results (2) Estimating # of IOs – transformations – good transformations 68 70 • Detailed query plan level – estimate costs – generate and compare plans CS 245 Notes 6 71 CS 245 Notes 6 72 12 Example R Estimating result size A B C D cat 1 10 a • Keep statistics for relation R cat 1 20 b dog 1 30 a – T(R) : # tuples in R – S(R) : # of bytes in each R tuple – B(R): # of blocks to hold all R tuples – V(R, A) : # distinct values in R for attribute A CS 245 Example R Notes 6 A B C D cat 1 10 a cat 1 20 b dog 1 30 a dog 1 40 c dog 1 40 c A: 20 byte string B: 4 byte integer C: 8 byte date D: 5 byte string bat 1 50 d 73 A: 20 byte string B: 4 byte integer C: 8 byte date D: 5 byte string bat 1 50 d CS 245 Notes 6 74 Size estimates for W = R1 x R2 T(W) = S(W) = T(R) = 5 S(R) = 37 V(R,A) = 3 V(R,C) = 5 V(R,B) = 1 V(R,D) = 4 CS 245 Notes 6 75 Size estimates for W = R1 x R2 CS 245 Size estimate for W = T(W) = T(R1) T(R2) S(W) = S(R) S(W) = S(R1) + S(R2) T(W) = ? CS 245 Notes 6 Notes 6 77 CS 245 Notes 6 76 A=a (R) 78 13 Example R A B C D cat 1 10 a cat 1 20 b dog 1 30 a dog 1 40 c Example R V(R,A)=3 V(R,B)=1 V(R,C)=5 V(R,D)=4 z=val(R) A B C D cat 1 10 a cat 1 20 b dog 1 30 a dog 1 40 c W= 79 CS 245 z=val(R) T(W) = CS 245 Notes 6 80 T(R) V(R,Z) Notes 6 81 CS 245 Notes 6 Example R A B C D cat 1 10 a cat 1 20 b Values in select expression Z = val are uniformly distributed over domain with DOM(R,Z) values. dog 1 30 a dog 1 40 c 82 Alternate assumption V(R,A)=3 DOM(R,A)=10 V(R,B)=1 DOM(R,B)=10 V(R,C)=5 DOM(R,C)=10 V(R,D)=4 DOM(R,D)=10 bat 1 50 d W= Notes 6 T(W) = Values in select expression Z = val are uniformly distributed over possible V(R,Z) values. Alternate Assumption: CS 245 z=val(R) what is probability this tuple will be in answer? Assumption: V(R,A)=3 V(R,B)=1 V(R,C)=5 V(R,D)=4 bat 1 50 d W= V(R,A)=3 V(R,B)=1 V(R,C)=5 V(R,D)=4 D bat 1 50 d Notes 6 Example R C dog 1 40 c T(W) = CS 245 B cat 1 20 b dog 1 30 a bat 1 50 d W= A cat 1 10 a 83 CS 245 z=val(R) T(W) = ? Notes 6 84 14 Example R A B C D cat 1 10 a cat 1 20 b dog 1 30 a dog 1 40 c Alternate assumption V(R,A)=3 DOM(R,A)=10 V(R,B)=1 DOM(R,B)=10 V(R,C)=5 DOM(R,C)=10 V(R,D)=4 DOM(R,D)=10 bat 1 50 d W= z=val(R) 85 Selection cardinality Notes 6 What about W = z val (R) C D dog 1 40 c Alternate assumption V(R,A)=3 DOM(R,A)=10 V(R,B)=1 DOM(R,B)=10 V(R,C)=5 DOM(R,C)=10 V(R,D)=4 DOM(R,D)=10 bat 1 50 d z=val(R) T(W) = T(R) DOM(R,Z) CS 245 Notes 6 What about W = z val (R) SC(R,A) = average # records that satisfy equality condition on R.A T(R) V(R,A) SC(R,A) = T(R) DOM(R,A) CS 245 B cat 1 20 b W= Notes 6 A cat 1 10 a dog 1 30 a what is probability this tuple will be in answer? T(W) = ? CS 245 Example R 86 ? T(W) = ? 87 ? CS 245 Notes 6 What about W = z val (R) T(W) = ? T(W) = ? • Solution # 1: T(W) = T(R)/2 • Solution # 1: T(W) = T(R)/2 88 ? • Solution # 2: T(W) = T(R)/3 CS 245 Notes 6 89 CS 245 Notes 6 90 15 • Solution # 3: Estimate values in range • Solution # 3: Estimate values in range Example R Example R Z Min=1 Z V(R,Z)=10 Min=1 W= z 15 (R) V(R,Z)=10 W= z 15 (R) Max=20 Max=20 f = 20-15+1 = 6 20-1+1 20 (fraction of range) T(W) = f T(R) CS 245 Notes 6 91 Equivalently: fV(R,Z) = fraction of distinct values T(W) = [f V(Z,R)] T(R) = f T(R) V(Z,R) CS 245 Notes 6 Size estimate for W = R1 Notes 6 92 Size estimate for W = R1 R2 Let x = attributes of R1 y = attributes of R2 93 R2 CS 245 Notes 6 Case 2 R1 Let x = attributes of R1 y = attributes of R2 Case 1 CS 245 A W = R1 B C 94 R2 R2 A XY=A D XY= Same as R1 x R2 CS 245 Notes 6 95 CS 245 Notes 6 96 16 W = R1 Case 2 R1 A B C XY=A R2 R2 A Computing T(W) when V(R1,A) V(R2,A) R1 D A B C Take 1 tuple R2 A D Match Assumption: V(R1,A) V(R2,A) Every A value in R1 is in R2 V(R2,A) V(R1,A) Every A value in R2 is in R1 “containment of value sets” Sec. 7.4.4 CS 245 Notes 6 97 Computing T(W) when V(R1,A) V(R2,A) R1 A B Take 1 tuple C R2 A D CS 245 Notes 6 • V(R1,A) V(R2,A) 98 T(W) = T(R2) T(R1) V(R2,A) Match • V(R2,A) V(R1,A) T(W) = T(R2) T(R1) V(R1,A) 1 tuple matches with T(R2) tuples... V(R2,A) so T(W) = CS 245 Notes 6 In general T(W) = [A is common attribute] T(R2) T(R1) V(R2, A) W = R1 99 CS 245 Notes 6 100 Case 2 with alternate assumption R2 Values uniformly distributed over domain T(R2) T(R1) max{ V(R1,A), V(R2,A) } R1 A B C R2 A D This tuple matches T(R2)/DOM(R2,A) so T(W) = T(R2) T(R1) = T(R2) T(R1) DOM(R2, A) DOM(R1, A) Assume the same CS 245 Notes 6 101 CS 245 Notes 6 102 17 Using similar ideas, we can estimate sizes of: In all cases: S(W) = S(R1) + S(R2) - S(A) size of attribute A AB (R) ….. Sec. 16.4.2 (same for either edition) A=aB=b (R) …. Sec. 16.4.3 R S with common attribs. A,B,C Sec. 16.4.5 Union, intersection, diff, …. Sec. 16.4.7 CS 245 Notes 6 103 Note: for complex expressions, need intermediate T,S,V results. E.g. W = [A=a (R1) ] R2 Treat as relation U T(U) = T(R1)/V(R1,A) S(U) = S(R1) CS 245 Notes 6 104 To estimate Vs E.g., U = A=a (R1) Say R1 has attribs A,B,C,D V(U, A) = V(U, B) = V(U, C) = V(U, D) = Also need V (U, *) !! CS 245 Notes 6 Example R1 A B C 105 D cat 1 20 20 dog 1 30 10 dog 1 40 30 bat 1 50 10 U= Notes 6 Example R1 A V(R1,A)=3 V(R1,B)=1 V(R1,C)=5 V(R1,D)=3 cat 1 10 10 CS 245 B C V(R1,A)=3 V(R1,B)=1 V(R1,C)=5 V(R1,D)=3 D cat 1 10 10 cat 1 20 20 dog 1 30 10 dog 1 40 30 A=a (R1) 106 bat 1 50 10 U= A=a (R1) V(U,A) =1 V(U,B) =1 V(U,C) = T(R1) V(R1,A) V(D,U) ... somewhere in between CS 245 Notes 6 107 CS 245 Notes 6 108 18 Possible Guess V(U,A) V(U,B) U= For Joins A=a (R) =1 = V(R,B) U = R1(A,B) R2(A,C) V(U,A) = min { V(R1, A), V(R2, A) } V(U,B) = V(R1, B) V(U,C) = V(R2, C) [called “preservation of value sets” in section 7.4.4] CS 245 Notes 6 109 Example: Z = R1(A,B) R1 R2 R3 R2(B,C) Z=U Notes 6 Notes 6 110 Partial Result: U = R1 R3(C,D) T(U) = 10002000 200 T(R1) = 1000 V(R1,A)=50 V(R1,B)=100 T(R2) = 2000 V(R2,B)=200 V(R2,C)=300 T(R3) = 3000 V(R3,C)=90 V(R3,D)=500 CS 245 CS 245 111 CS 245 R2 V(U,A) = 50 V(U,B) = 100 V(U,C) = 300 Notes 6 112 A Note on Histograms R3 40 T(Z) = 100020003000 V(Z,A) = 50 200300 V(Z,B) = 100 V(Z,C) = 90 V(Z,D) = 500 number of tuples in R with A value in given range 30 20 10 10 20 30 40 A=val(R) = ? CS 245 Notes 6 113 CS 245 Notes 6 114 19 Summary Outline • Estimating size of results is an “art” • Estimating cost of query plan • Don’t forget: Statistics must be kept up to date… (cost?) CS 245 Notes 6 115 – Estimating size of results – Estimating # of IOs done! next… • Generate and compare plans CS 245 Notes 6 116 20