Extended Relational Algebra for SQL Slides

advertisement
SQL, RA, Sets, Bags
• Fundamental difference between theoretical RA
and practical application of it in DBMSs and SQL
– RA  uses sets
– SQL  uses bags (multisets)
• There are good performance reasons for using
bags:
– Queries involve 2+ join, union, etc., which would
require an extra pass through the relation being built
– There are times we WANT every instance, particularly
for aggregate functions (e.g. taking an average)
• Downside:
– Extra memory
• Section 5.1 Topics include:
– Union, Difference, Intersection and how they are
affected by operation over bags
– Projection operator over bags
– Selection operator over bags
– Product and join over bags
• All the above follow what you would expect
• Other topics in 5.1:
– Algebraic laws of set operators applied to bags
Examples: set operators over bags
• {1,2,1} ∪ {1,1,2,3,1} =
– {1,1,1,1,1,2,2,3}
• {1,2,1,1} ∩ {1,2,1,3} =
– {1, 1, 2}
• {1,2,1,1,1} – {1,1,2,3} =
– {1,1}
Exercise 5.1.3a
Exercise 5.1.3b
• πbore(Ships |><| Classes)
More relational algebra
δ – Duplicate elimination
• δ(R)
– Eliminate duplicates from relation R
– (i.e. converts a relation from a bag to set
representation)
• R2 := δ(R1)
– R2 consists of one copy of each tuple that appears
in R2 one or more times
• DISTINCT modifier in SELECT stmt
δ - Example
R= ( A
1
3
1
δ(R) =
B)
2
4
2
A
1
3
B
2
4
τ – Sorting
• R2 := τL(R1)
– L – list of some attributes of R1
– L specifies the order of sorting
• Increasing order
– Tuples with identical components in L specify no order
• Benefit:
– Obvious – ordered output
– Not so obvious – stored sorted relations can have
substantial query benefit
• Recall running time for binary search
• O(log n) is far superior than O(n)
Aggregation Operators
• Use to summarize something about the values
in attribute of a relation
– Produces a single value as a result
•
•
•
•
•
SUM(attr)
AVG(attr)
MIN(attr)
MAX(attr)
COUNT(attr)
Example: Aggregation
R= ( A
1
3
3
B )
3
4
2
SUM(A) = 7
COUNT(A) = 3
MAX(B) = 4
AVG(B) = 3
SUM(A),
COUNT(A),
MAX(B), AVG(B)
=?
Grouping Operator
• R2 := γL(R1)
• L is a list of elements that are:
– Individual attributes of R1
• Called grouping attributes
– Aggregated attribute of R1
• Use an arrow and a new name to rename the
component
– R2 projects only what is in L
How does γL(R) work?
1. Form one group for each distinct list of values
for those attributes in R
2. Within each group, compute AGG(A) for each
aggregation on L
3. Result has one tuple for each group
– The grouping attributes' values for the group
– The aggregations over all tuples of the group (for
the aggregated attributes)
Example: Grouping / Aggregation
R=
γ
( A
1
4
1
1
B
2
5
2
3
C)
3
6
5
5
(R) = ??
First, partition R by A and B :
A
B
C
1
2
3
1
2
5
4
5
6
1
3
5
A,B,AVG(C)->X
Then, average C
within groups:
A
1
4
1
B
2
5
3
X
4
6
5
Note about aggregation
• If R is a relation, and R has attributes A1…An,
then
– δ(R) == γA1,A2,…,An(R)
– Grouping on ALL attributes in R eliminates
duplicates
– i.e. δ is not really necessary
• Also, if relation R is also a set, then
– πA1,A2,…,An(R) = γA1,A2,…,An(R)
Extended Projection
• Recall R2 := πL(R1)
– R2 contains only L attributes from R1
• L can be extended to allow arbitrary
expressions:
– Renaming (e.g., A -> B)
– Arithmetic expressions (e.g., A + B -> SUM)
– Duplicate attributes (i.e., include in L multiple
times)
Example: Extended Projection
R= ( A
1
3
B)
2
4
πA+B->C,A,A (R) =
C
3
7
A1
1
3
A2
1
3
Outer joins
• Recall that the standard natural join occurs only if
there is a match from both relations
• A tuple of R that has NO tuple of S with which it
can join is said to be dangling
– Vice versa applies
• Outer join: preserves dangling tuples in join
– Missing components set to NULL
• R |>◦<|C S.
– This is a bad approximation of the symbol – see text
– NO C? Natural outer join
Example: Outer Join
R=
(A
1
4
B)
2
5
S=
(B
2
6
C)
3
7
(1,2) joins with (2,3), but the other two tuples
are dangling.
R |>◦<| S =
A
B
1
4
NULL
C
2
5
6
3
NULL
7
Types of outer joins
• R |>◦<| S
– No condition, requires matching attributes
– Pads dangling tuples from both side
• R |>◦<| L S
– Pad dangling tupes of R only
• R |>◦<| R S
– Pad dangling tuples of S only
• SQL:
– R NATURAL {LEFT | RIGHT} JOIN S
– R {LEFT | RIGHT} JOIN S
– NOTE MySQL does not allow a FULL OUTER JOIN! Only LEFT or
RIGHT
– Just UNION a left outer join and a right outer join… mostly
A+B
1
5
1
6
7
A2
0
4
0
4
9
B2
1
9
1
16
16
B+1
1
3
3
4
1
4
C-1
0
3
4
3
1
3
A
0
2
2
3
B
1
3
4
4
A
0
2
3
SUM(B)
2
7
4
SELECT A,SUM(B)
FROM R
GROUP BY A
A
0
2
3
SELECT A FROM R GROUP BY A;
SELECT DISTINCT A FROM R;
A
2
MAX(C)
4
SELECT A,MAX(C)
FROM R NATURAL JOIN S
GROUP BY A;
What if MAX(C)
was SUM(C)?
A
2
2
0
0
2
3
B
3
3
1
1
4
4
C
4
4
┴
┴
┴
┴
SELECT * FROM R NATURAL LEFT JOIN S;
A
2
2
┴
┴
┴
┴
B
3
3
0
2
2
0
C
4
4
1
4
5
2
SELECT * FROM R NATURAL RIGHT JOIN S;
A
2
2
0
0
2
3
┴
┴
┴
┴
B
3
3
1
1
4
4
0
2
2
0
C
4
4
┴
┴
┴
┴
1
4
5
2
SELECT *
FROM R
NATURAL LEFT JOIN S
UNION
SELECT *
FROM R
NATURAL RIGHT JOIN S;
Right?
• SELECT *
FROM R
NATURAL LEFT JOIN S
UNION ALL
SELECT *
FROM R
NATURAL RIGHT JOIN S
WHERE A IS NULL;
A
0
0
0
0
0
0
0
0
2
2
3
┴
┴
R.B
1
1
1
1
1
1
1
1
3
4
4
┴
┴
S.B
2
2
3
3
2
2
3
3
┴
┴
┴
0
0
C
4
5
4
4
4
5
4
4
┴
┴
┴
1
2
Back to SQL
Aggregations
• SUM, AVG, COUNT, MIN, and MAX can be
applied to a column in a SELECT clause
– Produces an aggregation on the attribute
• COUNT(*) count the number of tuples
• Use DISTINCT inside of an aggregation to
eliminate duplicates in the function
Example:
• Sells(bar, beer, price)
• Find the average price of Guinness
– SELECT AVG(price)
– FROM Sells
– WHERE beer = 'Guinness';
• Find the number of different prices charged for
Guinness
– SELECT COUNT(DISTINCT price)
AS "# Prices"
– FROM Sells
– WHERE beer = 'Guinness';
Grouping
• SELECT
attr(s)
FROM
tbls
WHERE
cond_expr
GROUP BY attr(s)
• The resulting SELECT-FROM-WHERE relation
determined FIRST, then grouped according to
GROUP BY clause
– MySQL will also sort the relations according to
attributes listed in GROUP BY clause
• Therefore, allows optional ASC or DESC (just like ORDER BY)
• Aggregations are applied only within each group
Grouping and NULLS
Note on NULL and Aggregation
• NULL values in a tuple:
– never contribute to a sum, average or count
– can never be a min or max of an attribute
• If all values for an attribute are NULL, then
the result of an aggregation is NULL
– Exception: COUNT of an empty set is 0
• NULL values are treated as ordinary values
when forming groups
Example: Grouping
• Sells(bar, beer, price)
Frequents(drinker, bar)
• Find the average price for each beer
– SELECT beer, AVG(price)
– FROM Sells
– GROUP BY beer;
• Find for each drinker the average price of Guinness at
the bars they frequent
–
–
–
–
–
SELECT drinker, AVG(price)
FROM Frequents
NATURAL JOIN Sells
WHERE beer = 'Guinness'
GROUP BY drinker;
Restrictions
• Example:
– Find the bar that sells Guinness the cheapest
– SELECT bar, MIN(price)
FROM Sells
WHERE beer = 'Guinness';
– Is this correct?
• Book states that this is illegal SQL
– if an aggregation used, then each SELECT element
should be aggregated or be an attribute in GROUP BY
– MySQL allows the above, but such queries will give
meaningless results
Example of confusing aggregation
• Find the country of the ship with bore of 15
with the smallest displacement
• SELECT country,
MIN(displacement)
FROM Classes
WHERE bore = 15;
Not quite the correct answer!
Be sure to follow the rules for
aggregation.
HAVING Clause
• HAVING cond
– Follows a GROUP BY clause
– Condition applies to each possible group
– Groups not satisfying condition are eliminated
• Rules for conditions in HAVING clause:
– Aggregated attributes:
• Any attribute in relation in FROM clause can be aggregated
• Only applies to the group being tested
– Unaggregated attributes
• Only attributes in GROUP BY list
• mySQL is more lenient with this, though they result in
meaningless information
Example: HAVING
• Sells(bar, beer, price)
• Find the average price of those beers that are
served in at least three bars
• SELECT beer, AVG(price)
FROM Sells
GROUP BY beer
HAVING COUNT(*) >= 3;
Example: HAVING
• Sells(bar, beer, price)
Beers(name, manf)
• Find the average price of beers that are either served in at
least three bars or are manufactured by Sam Adams
• SELECT beer, AVG(price)
• FROM Sells
• GROUP BY beer
• HAVING COUNT(*) >= 3 OR
•
beer IN
•
(SELECT name
FROM Beers
WHERE manf = 'Sam Adams');
• Find the average displacement of ships from
each country having at least two classes
•
•
•
•
SELECT country, AVG(displacement)
FROM Classes
GROUP BY country
HAVING count(*) >= 2;
Summary so far
•
•
•
•
•
•
SELECT
FROM
WHERE
GROUP BY
HAVING
ORDER BY
–
–
–
–
S
R1,…,Rn
C1
a1,…,ak
C2
b1,…,bk;
S attributes from R1,…,Rn or aggregates
C1 are conditions on R1,…,Rn
a1,…,ak are attributes from R1,…,Rn
C2 are conditions based on any attribute, or on any
aggregation in GROUP BY clause
– b1,…,bk are attributes on R1,…,Rn
Exercises
Exercise 6.2.3f
SELECT battle
FROM Outcomes
INNER JOIN Ships
ON Outcomes.ship = Ships.name
NATURAL JOIN Classes
GROUP BY country, battle
HAVING COUNT(ship) >= 3;
Exercise 6.4.7a
• SELECT COUNT(type)
FROM
Classes
WHERE type = 'bb';
Exercise 6.4.7b
• SELECT AVG(numGuns) AS 'Avg
Guns'
FROM
Classes
WHERE type = 'bb';
Exercise 6.4.7c
• SELECT AVG(numGuns) AS 'Avg
Guns'
FROM
Classes
NATURAL JOIN Ships
WHERE type = 'bb';
Exercise 6.4.7d
• SELECT class,
MIN(launched) AS First_Launched
FROM
Classes
NATURAL JOIN Ships
GROUP BY class;
Exercise 6.4.7e
•
•
•
•
•
•
•
SELECT C.class, COUNT(O.ship) AS '# sunk'
FROM
Classes AS C
NATURAL JOIN Ships AS S
INNER JOIN Outcomes AS O
ON S.name = O.ship
WHERE O.result = 'sunk'
GROUP BY C.class;
Download