CS 245: Database System Principles Notes 6: Query Processing Query Processing

advertisement
Query Processing
CS 245: Database System
Principles
Q  Query Plan
Notes 6: Query Processing
Hector Garcia-Molina
CS 245
Notes 6
1
CS 245
Notes 6
2
Query Processing
Example
Q  Query Plan
Select B,D
From R,S
Where R.A = “c”  S.E = 2  R.C=S.C
Focus: Relational System
• Others?
CS 245
Notes 6
R A
B
C
a
1
10
b
1
c
3
S C
CS 245
Notes 6
D
E
R A
B
C
10
x
2
a
1
10
20
20
y
2
b
1
2
10
30
z
2
c
d
2
35
40
x
1
e
3
45
50
y
3
D
E
10
x
2
20
20
y
2
2
10
30
z
2
d
2
35
40
x
1
e
3
45
50
y
3
Answer
CS 245
Notes 6
5
CS 245
4
S C
B
2
Notes 6
D
x
6
1
RXS
• How do we execute query?
- Do Cartesian product
- Select tuples
- Do projection
One idea
CS 245
RXS
R.A R.B R.C S.C S.D S.E
Notes 6
7
Bingo!
Got one...
1
10
10
x
2
a
.
.
1
10
20
y
2
C
.
.
2
1
10
10
x
2
a
.
.
1
10
20
y
2
C
.
.
2
10
10
x
2
CS 245
Notes 6
8
Relational Algebra - can be used to
describe plans...
R.A R.B R.C S.C S.D S.E
a
a
Ex: Plan I
B,D
R.A=“c” S.E=2  R.C=S.C
10
10
x
2
X
R
CS 245
Notes 6
9
Relational Algebra - can be used to
describe plans...
S
CS 245
Notes 6
10
Another idea:
Ex: Plan I
Plan II
B,D
R.A=“c” S.E=2  R.C=S.C
R.A = “c”
X
R
B,D
R
S
S.E = 2
natural join
S
OR: B,D [ R.A=“c” S.E=2  R.C = S.C (RXS)]
CS 245
Notes 6
11
CS 245
Notes 6
12
2
R
S
A B C
(R)
a 1 10
A B C
b 1 20
c 2 10
(S)
C D E
C D E
10 x 2
10 x 2
20 y 2
c 2 10
20 y 2
30 z 2
d 2 35
30 z 2
40 x 1
e 3 45
Plan III
Use R.A and S.C Indexes
(1) Use R.A index to select R tuples
with R.A = “c”
(2) For each R.C value found, use S.C
index to find matching tuples
50 y 3
CS 245
Notes 6
13
CS 245
Plan III
Notes 6
R
Use R.A and S.C Indexes
(1) Use R.A index to select R tuples
with R.A = “c”
(2) For each R.C value found, use S.C
index to find matching tuples
A B C
(3) Eliminate S tuples S.E  2
(4) Join matching R,S tuples, project
B,D attributes and place in result
CS 245
Notes 6
R
A B C
C
S
C
I1
I2
C D E
a 1 10
10 x 2
b 1 20
20 y 2
c 2 10
30 z 2
d 2 35
40 x 1
e 3 45
50 y 3
CS 245
Notes 6
S
R
A=“c”
16
S
C
C D E
A B C
10 x 2
a 1 10
20 y 2
b 1 20
c 2 10
30 z 2
c 2 10
30 z 2
d 2 35
40 x 1
d 2 35
40 x 1
e 3 45
50 y 3
e 3 45
50 y 3
a 1 10
b 1 20
CS 245
A=“c”
15
A
14
I1
I2
<c,2,10>
Notes 6
17
CS 245
I1
I2
<c,2,10> <10,x,2>
Notes 6
C D E
10 x 2
20 y 2
18
3
R
A B C
a 1 10
b 1 20
A=“c”
I1
I2
<c,2,10> <10,x,2>
check=2?
c 2 10
d 2 35
C
output: <2,x>
e 3 45
S
R
C D E
A B C
A=“c”
S
C
I1
10 x 2
a 1 10
20 y 2
b 1 20
30 z 2
c 2 10
40 x 1
d 2 35
50 y 3
e 3 45
C D E
I2
10 x 2
<c,2,10> <10,x,2>
20 y 2
check=2?
30 z 2
40 x 1
output: <2,x>
50 y 3
next tuple:
<c,7,15>
CS 245
Notes 6
19
CS 245
Notes 6
20
SQL query
parse
Overview of Query Optimization
parse tree
convert
answer
logical query plan
apply laws
“improved” l.q.p
estimate result sizes
l.q.p. +sizes
execute
statistics
Pi
pick best
{(P1,C1),(P2,C2)...}
estimate costs
consider physical plans
{P1,P2,…..}
CS 245
Notes 6
21
22
<Query>
SELECT title
FROM StarsIn
WHERE starName IN (
SELECT name
FROM MovieStar
WHERE birthdate LIKE ‘%1960’
);
<SFW>
SELECT <SelList>
FROM
<FromList>
<Attribute>
<RelName>
title
StarsIn
WHERE
<Condition>
<Tuple> IN <Query>
<Attribute>
starName
SELECT
(Find the movies with stars born in 1960)
Notes 6
Notes 6
Example: Parse Tree
Example: SQL query
CS 245
CS 245
23
CS 245
<SelList>
FROM
<FromList>
<Attribute>
<RelName>
name
MovieStar
Notes 6
WHERE
( <Query> )
<SFW>
<Condition>
<Attribute> LIKE <Pattern>
birthDate
‘%1960’
24
4
Example: Generating Relational Algebra
Example: Logical Query Plan
title
title

starName=name
StarsIn

<condition>
<tuple>
IN
name
birthdate LIKE ‘%1960’
<attribute>
starName
birthdate LIKE ‘%1960’
MovieStar
MovieStar
Fig. 7.15: An expression using a two-argument , midway between a parse tree
and relational algebra
CS 245
Notes 6
Fig. 7.18: Applying the rule for IN conditions
25
Example: Improved Logical Query Plan
CS 245
Notes 6
Example:
title
26
Estimate Result Sizes
Question:
Push project to
StarsIn?
starName=name
StarsIn
name
StarsIn
Need expected size
name
StarsIn
birthdate LIKE ‘%1960’


MovieStar
MovieStar
Fig. 7.20: An improvement on fig. 7.18.
CS 245
Notes 6
Example:
27
One Physical Plan
CS 245
Notes 6
28
Example: Estimate costs
L.Q.P
Hash join
SEQ scan
StarsIn
Parameters: join order,
memory size, project attributes,...
index scan
Parameters:
Select Condition,...
P1
P2
….
Pn
C1
C2
….
Cn
MovieStar
Pick best!
CS 245
Notes 6
29
CS 245
Notes 6
30
5
Textbook outline
Chapter 15
5
Algebra for queries
[Ch 5]
Chapter 16
16.1[16.1]
16.2[16.2]
16.3[16.3]
Parsing
Algebraic laws
Parse tree -> logical query
plan
16.4[16.4]
Estimating result sizes
16.5-7[16.5-7] Cost based optimization
[bags vs sets]
- Select, project, join, ….
[project list
a,a+b->x,…]
- Duplicate elimination, grouping, sorting
15.1 Physical operators
[15.1]
- Scan,sort, …
15.2 - 15.6 Implementing operators +
[15.2-15.6]
estimating their cost
CS 245
Notes 6
31
CS 245
Notes 6
Reading textbook - Chapters 15, 16
Query Optimization - In class order
Optional:
• Relational algebra level (A)
• Detailed query plan level
– Sections 15.7, 15.8, 15.9 [15.7, 15.8]
– Sections 16.6, 16.7 [16.6, 16.7]
32
– Estimate Costs (B)
Optional: Duplicate elimination operator
grouping, aggregation operators
• without indexes
• with indexes
– Generate and compare plans (C)
CS 245
Notes 6
33
CS 245
Notes 6
34
SQL query
Relational algebra optimization
parse
parse tree
convert
answer
logical query plan
(A) apply laws
“improved” l.q.p
(B) estimate result sizes
l.q.p. +sizes
consider physical plans
execute
statistics
Pi
• Transformation rules
(preserve equivalence)
• What are good transformations?
pick best
{(P1,C1),(P2,C2)...}
(C)
(C)
estimate costs
{P1,P2,…..}
CS 245
Notes 6
35
CS 245
Notes 6
36
6
Rules:
R
(R
Note:
Natural joins & cross products & union
S
S)
=
T
S
R
=R
(S
• Carry attribute names in results, so
order is not important
• Can also write as trees, e.g.:
T)
T
R
CS 245
Notes 6
Rules:
R
(R
37
Natural joins & cross products & union
S
S)
=
T
S
R
=R
(S
R
S
CS 245
S
T
Notes 6
38
Notes 6
40
Notes 6
42
Rules: Selects
T)
p1p2(R) =
p1vp2(R) =
RxS=SxR
(R x S) x T = R x (S x T)
RUS=SUR
R U (S U T) = (R U S) U T
CS 245
Notes 6
39
Bags vs. Sets
Rules: Selects
p1p2(R) =
p1vp2(R) =
CS 245
CS 245
R = {a,a,b,b,b,c}
S = {b,b,c,c,d}
RUS = ?
p1 [ p2 (R)]
[ p1 (R)] U [ p2 (R)]
Notes 6
41
CS 245
7
Bags vs. Sets
Option 2 (MAX) makes this rule work:
R = {a,a,b,b,b,c}
S = {b,b,c,c,d}
RUS = ?
Example: R={a,a,b,b,b,c}
P1 satisfied by a,b; P2 satisfied by b,c
p1vp2 (R) = p1(R) U p2(R)
• Option 1 SUM
RUS = {a,a,b,b,b,b,b,c,c,c,d}
• Option 2 MAX
RUS = {a,a,b,b,b,c,c,d}
CS 245
Notes 6
43
Option 2 (MAX) makes this rule work:
44
Senators (……)
Example: R={a,a,b,b,b,c}
P1 satisfied by a,b; P2 satisfied by b,c
T1 =
p1vp2 (R) = {a,a,b,b,b,c}
p1(R) = {a,a,b,b,b}
p2(R) = {b,b,b,c}
p1(R) U p2 (R) = {a,a,b,b,b,c}
Notes 6
Notes 6
“Sum” option makes more sense:
p1vp2 (R) = p1(R) U p2(R)
CS 245
CS 245
T1
Rep (……)
yr,state Senators;
Yr
14
16
15
State
CA
CA
AZ
T2 =
T2
yr,state Reps
Yr
16
16
15
State
CA
CA
CA
Union?
45
CS 245
Notes 6
46
Rules: Project
Executive Decision
Let: X = set of attributes
Y = set of attributes
XY = X U Y
-> Use “SUM” option for bag unions
-> Some rules cannot be used for bags
xy (R) =
CS 245
Notes 6
47
CS 245
Notes 6
48
8
Rules: Project
Rules: Project
Let: X = set of attributes
Y = set of attributes
XY = X U Y
Let: X = set of attributes
Y = set of attributes
XY = X U Y
xy (R) = x [y (R)]
xy (R) = x [y (R)]
CS 245
Notes 6
49
CS 245
Notes 6
50
Rules: combined
Rules: combined
Let p = predicate with only R attribs
q = predicate with only S attribs
m = predicate with only R,S attribs
Let p = predicate with only R attribs
q = predicate with only S attribs
m = predicate with only R,S attribs




p
(R
S) =
q
(R
S) =
CS 245
Notes 6
Rules: combined
51
CS 245
S) =
q
(R
S) =
CS 245

(R)]
R
[
[
p
S

q
(S)]
Notes 6
52
pq (R S) = [p (R)] [q (S)]
pqm (R S) =
m [(p R) (q S)]
pvq (R S) =
[(p R) S] U [R (q S)]
S) =
S) =
S) =
Notes 6
(R
Do one, others for homework:
(continued)
Some Rules can be Derived:
pq (R
pqm (R
pvq (R
p
53
CS 245
Notes 6
54
9
--> Derivation for first one:
--> Derivation for first one:
pq (R
pq (R S) =
p [q (R S) ] =
p [ R q (S) ] =
[p (R)]
[q (S)]
S) =
CS 245
Rules:
Notes 6
55
combined
CS 245
Rules:
Notes 6
56
combined
Let x = subset of R attributes
z = attributes in predicate P
(subset of R attributes)
Let x = subset of R attributes
z = attributes in predicate P
(subset of R attributes)
x[p (R) ] =
x[p (R) ] =
CS 245
Rules:
Notes 6
57
combined
CS 245
Rules:
{p [ x
(R)
]}
Notes 6
combined
Let x = subset of R attributes
z = attributes in predicate P
(subset of R attributes)
Let x = subset of R attributes
y = subset of S attributes
z = intersection of R,S attributes
xz
x[p (R) ] = x {p [ x
xy (R
CS 245
Notes 6
(R)
]}
59
CS 245
58
S) =
Notes 6
60
10
Rules:
combined
xy {p (R
Let x = subset of R attributes
y = subset of S attributes
z = intersection of R,S attributes
xy (R
CS 245
[yz (S) ]}
Notes 6
xy {p (R
S)}
61
yz’ (S)]}
z’ = z U {attributes used in P
CS 245
Notes 6
62
Rules for  combined with X
=
xy {p [xz’ (R)
Rules
=
S) =
xy{[xz (R) ]
CS 245
S)}
similar...
e.g.,
}
Notes 6
63
p (R X S) = ?
CS 245
Notes 6
64
Which are “good” transformations?
 U combined:
p1p2 (R)  p1 [p2 (R)]
p (R S)  [p (R)] S
p(R U S) = p(R) U p(S)
p(R - S) = p(R) - S = p(R) - p(S)
R
S  S
R
x [p (R)]  x {p [xz (R)]}
CS 245
Notes 6
65
CS 245
Notes 6
66
11
Conventional wisdom:
do projects early
But What if we have A, B indexes?
Example: R(A,B,C,D,E) x={E}
P: (A=3)  (B=“cat”)
x {p (R)}
vs.
B = “cat”
A=3
E {p{ABE(R)}}
Intersect pointers to get
pointers to matching tuples
CS 245
Notes 6
67
CS 245
Notes 6
Bottom line:
In textbook: more transformations
• No transformation is always good
• Usually good: early selections
• Eliminate common sub-expressions
• Other operations: duplicate elimination
CS 245
Notes 6
69
CS 245
Notes 6
Outline - Query Processing
• Estimating cost of query plan
• Relational algebra level
(1) Estimating size of results
(2) Estimating # of IOs
– transformations
– good transformations
68
70
• Detailed query plan level
– estimate costs
– generate and compare plans
CS 245
Notes 6
71
CS 245
Notes 6
72
12
Example
R
Estimating result size
A
B
C
D
cat 1 10 a
• Keep statistics for relation R
cat 1 20 b
dog 1 30 a
– T(R) : # tuples in R
– S(R) : # of bytes in each R tuple
– B(R): # of blocks to hold all R tuples
– V(R, A) : # distinct values in R
for attribute A
CS 245
Example
R
Notes 6
A
B
C
D
cat 1 10 a
cat 1 20 b
dog 1 30 a
dog 1 40 c
dog 1 40 c
A: 20 byte string
B: 4 byte integer
C: 8 byte date
D: 5 byte string
bat 1 50 d
73
A: 20 byte string
B: 4 byte integer
C: 8 byte date
D: 5 byte string
bat 1 50 d
CS 245
Notes 6
74
Size estimates for W = R1 x R2
T(W) =
S(W) =
T(R) = 5
S(R) = 37
V(R,A) = 3
V(R,C) = 5
V(R,B) = 1
V(R,D) = 4
CS 245
Notes 6
75
Size estimates for W = R1 x R2
CS 245
Size estimate for W =
T(W) =
T(R1)  T(R2)
S(W) = S(R)
S(W) =
S(R1) + S(R2)
T(W) = ?
CS 245
Notes 6
Notes 6
77
CS 245
Notes 6
76
A=a (R)
78
13
Example
R
A
B
C
D
cat 1 10 a
cat 1 20 b
dog 1 30 a
dog 1 40 c
Example
R
V(R,A)=3
V(R,B)=1
V(R,C)=5
V(R,D)=4
z=val(R)
A
B
C
D
cat 1 10 a
cat 1 20 b
dog 1 30 a
dog 1 40 c
W=
79
CS 245
z=val(R)
T(W) =
CS 245
Notes 6
80
T(R)
V(R,Z)
Notes 6
81
CS 245
Notes 6
Example
R
A
B
C
D
cat 1 10 a
cat 1 20 b
Values in select expression Z = val
are uniformly distributed
over domain with DOM(R,Z) values.
dog 1 30 a
dog 1 40 c
82
Alternate assumption
V(R,A)=3 DOM(R,A)=10
V(R,B)=1 DOM(R,B)=10
V(R,C)=5 DOM(R,C)=10
V(R,D)=4 DOM(R,D)=10
bat 1 50 d
W=
Notes 6
T(W) =
Values in select expression Z = val
are uniformly distributed
over possible V(R,Z) values.
Alternate Assumption:
CS 245
z=val(R)
what is probability this
tuple will be in answer?
Assumption:
V(R,A)=3
V(R,B)=1
V(R,C)=5
V(R,D)=4
bat 1 50 d
W=
V(R,A)=3
V(R,B)=1
V(R,C)=5
V(R,D)=4
D
bat 1 50 d
Notes 6
Example
R
C
dog 1 40 c
T(W) =
CS 245
B
cat 1 20 b
dog 1 30 a
bat 1 50 d
W=
A
cat 1 10 a
83
CS 245
z=val(R)
T(W) = ?
Notes 6
84
14
Example
R
A
B
C
D
cat 1 10 a
cat 1 20 b
dog 1 30 a
dog 1 40 c
Alternate assumption
V(R,A)=3 DOM(R,A)=10
V(R,B)=1 DOM(R,B)=10
V(R,C)=5 DOM(R,C)=10
V(R,D)=4 DOM(R,D)=10
bat 1 50 d
W=
z=val(R)
85
Selection cardinality
Notes 6
What about W =
z  val (R)
C
D
dog 1 40 c
Alternate assumption
V(R,A)=3 DOM(R,A)=10
V(R,B)=1 DOM(R,B)=10
V(R,C)=5 DOM(R,C)=10
V(R,D)=4 DOM(R,D)=10
bat 1 50 d
z=val(R)
T(W) =
T(R)
DOM(R,Z)
CS 245
Notes 6
What about W =
z  val (R)
SC(R,A) = average # records that satisfy
equality condition on R.A
T(R)
V(R,A)
SC(R,A) =
T(R)
DOM(R,A)
CS 245
B
cat 1 20 b
W=
Notes 6
A
cat 1 10 a
dog 1 30 a
what is probability this
tuple will be in answer?
T(W) = ?
CS 245
Example
R
86
?
T(W) = ?
87
?
CS 245
Notes 6
What about W =
z  val (R)
T(W) = ?
T(W) = ?
• Solution # 1:
T(W) = T(R)/2
• Solution # 1:
T(W) = T(R)/2
88
?
• Solution # 2:
T(W) = T(R)/3
CS 245
Notes 6
89
CS 245
Notes 6
90
15
• Solution # 3: Estimate values in range
• Solution # 3: Estimate values in range
Example R
Example R
Z
Min=1
Z
V(R,Z)=10
Min=1
W= z  15 (R)
V(R,Z)=10
W= z  15 (R)
Max=20
Max=20
f = 20-15+1 = 6
20-1+1
20
(fraction of range)
T(W) = f  T(R)
CS 245
Notes 6
91
Equivalently:
fV(R,Z) = fraction of distinct values
T(W) = [f  V(Z,R)] T(R) = f  T(R)
V(Z,R)
CS 245
Notes 6
Size estimate for W = R1
Notes 6
92
Size estimate for W = R1
R2
Let x = attributes of R1
y = attributes of R2
93
R2
CS 245
Notes 6
Case 2
R1
Let x = attributes of R1
y = attributes of R2
Case 1
CS 245
A
W = R1
B
C
94
R2
R2
A
XY=A
D
XY=
Same as R1 x R2
CS 245
Notes 6
95
CS 245
Notes 6
96
16
W = R1
Case 2
R1
A
B
C
XY=A
R2
R2
A
Computing T(W) when V(R1,A)  V(R2,A)
R1
D
A
B
C
Take
1 tuple
R2
A
D
Match
Assumption:
V(R1,A)  V(R2,A)  Every A value in R1 is in R2
V(R2,A)  V(R1,A)  Every A value in R2 is in R1
“containment of value sets” Sec. 7.4.4
CS 245
Notes 6
97
Computing T(W) when V(R1,A)  V(R2,A)
R1
A
B
Take
1 tuple
C
R2
A
D
CS 245
Notes 6
• V(R1,A)
 V(R2,A)
98
T(W) = T(R2) T(R1)
V(R2,A)
Match
• V(R2,A)
 V(R1,A)
T(W) = T(R2) T(R1)
V(R1,A)
1 tuple matches with T(R2) tuples...
V(R2,A)
so
T(W) =
CS 245
Notes 6
In general
T(W) =
[A is common attribute]
T(R2)  T(R1)
V(R2, A)
W = R1
99
CS 245
Notes 6
100
Case 2 with alternate assumption
R2
Values uniformly distributed over domain
T(R2) T(R1)
max{ V(R1,A), V(R2,A) }
R1
A
B
C
R2 A
D
This tuple matches T(R2)/DOM(R2,A) so
T(W) = T(R2) T(R1) = T(R2) T(R1)
DOM(R2, A)
DOM(R1, A)
Assume the same
CS 245
Notes 6
101
CS 245
Notes 6
102
17
Using similar ideas,
we can estimate sizes of:
In all cases:
S(W) = S(R1) + S(R2) - S(A)
size of attribute A
AB (R) ….. Sec. 16.4.2 (same for either edition)
A=aB=b (R) ….
Sec. 16.4.3
R
S with common attribs. A,B,C
Sec. 16.4.5
Union, intersection, diff, ….
Sec. 16.4.7
CS 245
Notes 6
103
Note: for complex expressions, need
intermediate T,S,V results.
E.g. W = [A=a (R1) ]
R2
Treat as relation U
T(U) = T(R1)/V(R1,A)
S(U) = S(R1)
CS 245
Notes 6
104
To estimate Vs
E.g., U = A=a (R1)
Say R1 has attribs A,B,C,D
V(U, A) =
V(U, B) =
V(U, C) =
V(U, D) =
Also need V (U, *) !!
CS 245
Notes 6
Example
R1 A
B
C
105
D
cat 1 20 20
dog 1 30 10
dog 1 40 30
bat 1 50 10
U=
Notes 6
Example
R1 A
V(R1,A)=3
V(R1,B)=1
V(R1,C)=5
V(R1,D)=3
cat 1 10 10
CS 245
B
C
V(R1,A)=3
V(R1,B)=1
V(R1,C)=5
V(R1,D)=3
D
cat 1 10 10
cat 1 20 20
dog 1 30 10
dog 1 40 30
A=a (R1)
106
bat 1 50 10
U=
A=a (R1)
V(U,A) =1 V(U,B) =1 V(U,C) =
T(R1)
V(R1,A)
V(D,U) ... somewhere in between
CS 245
Notes 6
107
CS 245
Notes 6
108
18
Possible Guess
V(U,A)
V(U,B)
U=
For Joins
A=a (R)
=1
= V(R,B)
U = R1(A,B)
R2(A,C)
V(U,A) = min { V(R1, A), V(R2, A) }
V(U,B) = V(R1, B)
V(U,C) = V(R2, C)
[called “preservation of value sets” in
section 7.4.4]
CS 245
Notes 6
109
Example:
Z = R1(A,B)
R1
R2
R3
R2(B,C)
Z=U
Notes 6
Notes 6
110
Partial Result: U = R1
R3(C,D)
T(U) = 10002000
200
T(R1) = 1000 V(R1,A)=50 V(R1,B)=100
T(R2) = 2000 V(R2,B)=200 V(R2,C)=300
T(R3) = 3000 V(R3,C)=90 V(R3,D)=500
CS 245
CS 245
111
CS 245
R2
V(U,A) = 50
V(U,B) = 100
V(U,C) = 300
Notes 6
112
A Note on Histograms
R3
40
T(Z) = 100020003000 V(Z,A) = 50
200300
V(Z,B) = 100
V(Z,C) = 90
V(Z,D) = 500
number of tuples
in R with A value
in given range
30
20
10
10
20
30
40
A=val(R) = ?
CS 245
Notes 6
113
CS 245
Notes 6
114
19
Summary
Outline
• Estimating size of results is an “art”
• Estimating cost of query plan
• Don’t forget:
Statistics must be kept up to date…
(cost?)
CS 245
Notes 6
115
– Estimating size of results
– Estimating # of IOs
done!
next…
• Generate and compare plans
CS 245
Notes 6
116
20
Download