Relational Algebra and Query Optimization

advertisement
Relational Algebra
SQL specifies what to retrieve but not
how to retrieve it.
Need a process to translate a
descriptive statement into a collection
of activities.
This is “behind the scenes”, but
important to know nonetheless.
If a CS student doesn’t know it – who
will?
SQL Statement
Low level logic
involving B-Tree,
index, hash table
access, etc.
Result
High level logic
involving table
(relation)
manipulation
Mid level logic
involving loops
and comparisons
SQL queries relations and generates
a new relation.
Need something that manipulates
relations in order to create other
relations.
Need to do it efficiently.
Relational Algebra:
a collection of operations (activities) to
construct new relations from given
relations.
Typical process:
SQL  relational algebra operations (or
something similar)  mid/low level
logic  result
Reference: First few slides of
[http://www.cs.wayne.edu/~shiyong/c
sc6710/slides/kiferComp_348761_ppt
11.ppt], though the notation may be
intimidating.
Background:
Tuple (n-tuple): collection of n things
written as (a1, a2,…, an).
Typically a tuple represents an entity
in a relation (a row in a table).
Set operations:
Definition:
2 relations are Union-compatible if they
have same degree (number of attributes)
and the i th attributes of each are defined
on same domains.
The first three definitions that follow
apply only to Union-compatible
relations.
Union
A  B is the set of elements
belonging to either A or B.
Venn Diagram
AB
SQL Server:
SELECT *
FROM
S
WHERE status = 20
UNION
SELECT *
FROM
S
WHERE status = 50
Difference (Minus):
A – B is the set of elements
belonging to A but not to B.
Venn Diagram.
A–B
SQL Server uses Except instead of
Minus
Supplier table: Create a view defined
by that below (all suppliers not in
Paris).
SELECT *
FROM
S
EXCEPT
SELECT *
FROM
S
WHERE City = 'Paris'
Intersection
A  B is the set of elements belonging
to both A and B.
Venn Diagram
AB
SQL Server:
SELECT *
FROM
S
WHERE status > 20
INTERSECT
SELECT *
FROM
S
WHERE status < 50
Cartesian Product:
A×B is the set of all elements of the
form (x, y) where x belongs to A and y
belongs to B.
A = {a, b}
B = {1, 2, 3}
A×B = {(a, 1), (a, 2), (a, 3), (b, 1),
(b, 2), (b, 3) }
Supplier table: Create a view defined
by
SELECT * FROM S CROSS JOIN SP
Selection (Restriction)
R where predicate where the
predicate is some condition that
evaluates to true or false.
A row subset of relation R.
Select * from R where predicate.
Projection
Let R be a relation and X, Y, …, Z be
among R’s attributes.
Then R[X, Y, …, Z] is the set of
elements from R restricted to
attributes from X, Y, … , Z with
duplicates removed.
Column subset of R
Select distinct attribute list from R
Join
Cartesian Product followed by a
selection
A×B where predicate.
SELECT * FROM S CROSS JOIN SP
where S.S# = SP.S#
Or
SELECT * FROM S, SP where
S.S#=SP.S#
Natural Join:
A join B
Let Ci, … , Cj be attribute names
common to A and B. Then A JOIN B
is
(A ×B)[all attributes with duplicates
removed] where
A.Ci = B.Ci and
:
:
A.Cj = B.Cj
Outer Join:
Example, Consider S join SP.
Suppose S9 is a supplier but S9
supplies no parts, yet. An entry for S9
will NOT appear in S join SP.
S left outerjoin SP will include every
row from the left table (S) with NULLS
filled in if there’s no matching S# in
the SP table.
Inner Join
Conventional join operation but the
phrase is used if there’s the potential
for confusion with outer join.
Examples
select * from S inner join SP on
S.S#=SP.S#
select * from S left outer join SP on
S.S#=SP.S#
Get supplier names for suppliers who
supply part P2.
( ( SP where P# = ‘P2) join S) [SName]
Find a row subset (Selection) of SP
where P#=‘P2’
2. Do a natural join with S
3. Find a column subset of the result
(project on Sname)
1.
Get supplier names for suppliers who
supply at least one red part.
( ( ( P where color = ‘Red’) Join SP) [S#] Join S)
[Sname]
1.
2.
3.
4.
5.
Find a row subset (Selection) of P where
color=‘Red’
Do a natural join with SP
Project on S#
Do a natural join with S
Project on SName
Get supplier names for suppliers who
do not supply part P2.
S[Sname] Minus ( ( SP where P# = ‘P2) join S)
[Sname]
1.
2.
3.
4.
5.
Project S on SName
Find a row subset of SP where P#=‘P2’
Join the result of 2 with S
Project the result of 3 on Sname
Calculate the difference between the result of 1 and the
result of 2
Division:
Let A be a relation of degree m+n and B be a
relation of degree n
Visualize an m+n tuple (entry) of A as a pair
(x, y) where
x is an m-tuple (1st m attributes) and y is an
n-tuple (last n attributes)
Also suppose the n attributes of B are
defined on the same domains as the last n
attributes of A.
Then C = A dividedby B is a relation
consisting of m-tuples x where for all y in B
there is a pair (x, y) in A.
NOTE: only 5 operations: restriction,
projection, product, union, and
difference are primitive.
AB = A – (A – B)
A dividedby B = A[x] – (A[x]×B – A)[x].
Example:
SP dividedby B
B is: 1) P1 or 2) P2 and P4 or 3) P1
through P6
Supplier numbers that supply all parts in B
SP Dividedby B is: 1) S1 and S2 or 2) S1
and S4 or 3) S1
S#
S1
S1
S1
S1
S1
S1
S2
S2
S2
S3
S3
S4
S4
S4
P#
P1
P2
P3
P4
P5
P6
P1
P2
P3
P2
P3
P2
P3
P4
A dividedby B = A[x] – (A[x]×B – A)[x].
Apply to previous example where A =
SP.
B = P1
B = P2 and P4
B = P1 thru P6
S1 thru S4
(Si, P1) i = 1…4
S1 thru S4
(Si, P2) i = 1…4
(Si, P4) i = 1…4
S1 thru S4
(S1, Pi) i = 1…6
(S2, Pi) i = 1…6
(S3, Pi) i = 1…6
(S4, Pi) i = 1…6
A[x] ×B - A
(S3, P1)
(S4, P1)
(S2, P4)
(S3, P4)
(S2, Pi) i=3,4,5,6
(S3, Pi) i=1,3,4,5,6
(S4, Pi) i=1,3,6
(A[x] ×B – A)[x]
S3 and S4
S1 and S2
S2 and S3
S1 and S4
S2, S3, S4
S1
A[x]
A[x] ×B
A[x] -(A[x] × B – A)[x]
Example
Get supplier names for suppliers who
supply all parts.
( ( SP dividedby P[P#]) Join S)[Sname]
or
( ( SP[S#] - (SP[S#] times P[P#] – SP)[S#])
Join S)[Sname]
Recall previous notes in SQL where finding
ALL of something required double “not
exist” subqueries.
Query Optimization: Introduction to
Database Systems by C. J. Date 8th ed
Select S.name
From S, SP
Where S.S# = SP.S#
And SP.P# = ‘P2’
Suppose 100 suppliers and 10,000 shipments
Consider
(S times SP) where S.S#=SP.S# and SP.P#=’P2’
vs
(S times (SP where SP.P#=’P2’) where S.S# = SP.S#
First option creates a VERY large intermediate table.
Stages of query optimization:
Slide 4 of
[http://www.cs.wayne.edu/~shiyong/csc6710/
slides/kiferComp_348761_ppt11.ppt]
Internal representation of a relational algebra
expression (e.g. a query tree)
Project on name
Project on name
join
Where SP.P# = ‘P2
Where SP.P# = ‘P2
join
S
SP
S
(S join (SP where P#=’P2’))[name]
SP
((S join SP) where P#=’P2) [name]
Canonical forms
Set of queries, called C, such that for
every possible query there is a query in
C that is equivalent.
Reason for this?
Get suppliers who supply P2 can be
expressed in 8 different ways. Ideally,
the efficiency should not depend on the
query form.
Examples:
Where p or (q and r)  where (p or q)
and (p or r) -- Conjunctive normal form
(A where p) where q  A where (p and q)
(A[attributes]) where p  (A where
p)[attributes]
A.F1 > B.F2 and B.F2 = 3  A.F1 > 3
Consider: Get suppliers not in London
or Paris.
SELECT
FROM
WHERE
S#, Sname, Status, City
dbo.S
(City <> 'Paris') OR
(City <> 'London')
Does this yield the expected result?
NOT (p and q)  NOT p or NOT q
(DeMorgan’s laws
NOT (p or q)  NOT p and NOT q
(DeMorgan’s laws
If system knew that SP.P# is a foreign key
matching a primary key P.P# in P then
(SP join P) [S#]  SP[S#]
Query optimizers do not optimize –
just try to find “reasonably good”
evaluation strategies. Might take
longer to find optimal strategy than to
do brute force method.
[http://www.cs.wayne.edu/~shiyong/c
sc6710/slides/kiferComp_348761_ppt
11.ppt]
Summaries and References:
“MySQL's query optimization engine
isn't always the most efficient when it
comes to subqueries.”
often a good idea to convert a
subquery to a join
[http://www.databasejournal.com/features/mysq
l/article.php/3813821/Five-QueryOptimizations-in-MySQL.htm]
Some additional information
[http://avid.cs.umass.edu/courses/645/s2
009/lectures/Lec12-QueryOptimizerx6.pdf]
SQL Server: Query Analysis:
Select a database
Create a query window (right click on
the database and select new query)
and enter an query
Query  Display Estimated
Execution Plan
Query  Include Actual Execution
Plan.
When the query is executed there will be
an execution plan tab on the result pane.
Some example queries to analyze:
select * from S
select * from S order by S#
select * from S order by status
select * from SP where Qty = 200
select * from S, SP where S.S#=SP.S# and P#='P2'
select * from S where S# in (select S# from SP where P#='P2')
S2 Replacement View
Any Red part View.
Only Red Parts1 view.
Do some views from the genealogy database (see next slide for family
tree)
Mike
Tom
Jim
John
Joanne
Jason
Alice
Kathy
Sally
Jeanne
Brian
Linda
Ira
Sarah
Ellen
Joseph
Daniel
Tim
Intro to SQL Server Optimization
[http://www.csharpcorner.com/UploadFile/john_cha
rles/QueryoptimizationinSQLServer20
0512112007154303PM/Queryoptimiz
ationinSQLServer2005.aspx]
Some useful terms:
[http://technet.microsoft.com/enus/library/cc917672.aspx]
Clustered index
A clustered index is organized as a Btree, where the nonleaf nodes are index
pages and the leaf nodes are data
pages.
A table can have only 1.
Nonclustered index
A nonclustered index is organized as a
B-tree. Unlike a clustered index, a
nonclustered index consists of only
index pages. The leaf nodes in a
nonclustered index are not data pages,
but contain row locators for individual
rows in the data pages.
Hash match inner join:
[http://blogs.msdn.com/craigfr/archive/2006
/08/10/687630.aspx].
Left Semi Join
“returns rows from one table that would
join with another table without
performing a complete join”
[http://blogs.msdn.com/craigfr/archive/20
06/07/19/671712.aspx]
Left Anti Semi Join
Similar to above but returns those that
would NOT join.
Stream Aggregates:
[http://blogs.msdn.com/craigfr/archive/
2006/09/13/752728.aspx]
Index scan vs index seek:
[http://blogs.msdn.com/craigfr/archive/
2006/06/26/647852.aspx
Spooling:
Temporary “caching” of data for use
by another activity:
[http://technet.microsoft.com/enus/library/ms191221.aspx]
Summaries at
[http://blogs.msdn.com/craigfr/attach
ment/8508493.ashx] and
[http://technet.microsoft.com/enus/library/ms191158.aspx]
Generating low level routines
Construct query plans from low level
routines and make some attempt at
optimizing.
True optimization is sometimes more
costly than the savings.
Join of relations R and S (on attribute
C): Intro to Database Systems by C. J.
Date
Brute Force:
Cardinality of R is m, cardinality of S is n
For i = 1 to m do
For j = 1 to n do
If R[i].C = S[j].C then
Add joined tuple, R[i]:S[j]
to result table
Time is proportional to m*n
Worse
depending on many factors, number of
disk I/Os could be proportional to m*n
Potentially large if tens or hundreds of
thousands of tuples in each table or if
tuples are very large and require a lot
of disk I/Os.
Indexed method
Assume index on S.C
For i = 1 to m do
{
Find all records S[j], in S, where S[j].C=R[i].C
//requires a search through the index which is
//far less work than looking through all records
in S
Join R[i] with corresponding tuples and add to
result table.
}
Hash Table:
Assume (or build) a hash table on S.C
For i = 1 to m do
{
k = hash(R[i].C)
search hash list at H[k] looking for
matches to R[i].C
if found, add joined tuple to result
table.
}
Projection:
R… relation; A…attributes; R[i] tuple i of R;
Chosen is a boolean vector, one bit for each
R[i]; initially all False.
Nested Loop :
For i = 1 to m do
//look for alternative
If Chosen[i] is true then
{
add R[i].A to result table
for j = i+1 to n
if R[j].A = R[i].A then set Chosen[j]=true
}
Hashing: using a hash table H
For i = 1 to m do
{
dup = false
k = hash(R[i].A)
search all tuples at H[k], looking for R[i].A;
if not found add R[i].A to tuples at H[k]
}
Gather all tuples from hash table to store
into result table.
Find London suppliers who supply red
parts with weight < 20 in quantities of
200
CreateView P’
As Select * from P where Color = ‘Red’ and Weight < 20
CreateView SP’
As Select * from SP where Qty > 200
CreateView S’
As Select * from S where City = ‘London’
Select name from S’, SP’, P’
Where S’.S# = SP’.S# and
SP’.P# = P’.P#
Could create these three views concurrently with multiprocessors or
multi-core processors.
Download