SQL, RA, Sets, Bags • Fundamental difference between theoretical RA and practical application of it in DBMSs and SQL – RA uses sets – SQL uses bags (multisets) • There are good performance reasons for using bags: – Queries involve 2+ join, union, etc., which would require an extra pass through the relation being built – There are times we WANT every instance, particularly for aggregate functions (e.g. taking an average) • Downside: – Extra memory • Section 5.1 Topics include: – Union, Difference, Intersection and how they are affected by operation over bags – Projection operator over bags – Selection operator over bags – Product and join over bags • All the above follow what you would expect • Other topics in 5.1: – Algebraic laws of set operators applied to bags Examples: set operators over bags • {1,2,1} ∪ {1,1,2,3,1} = – {1,1,1,1,1,2,2,3} • {1,2,1,1} ∩ {1,2,1,3} = – {1, 1, 2} • {1,2,1,1,1} – {1,1,2,3} = – {1,1} Exercise 5.1.3a Exercise 5.1.3b • πbore(Ships |><| Classes) More relational algebra δ – Duplicate elimination • δ(R) – Eliminate duplicates from relation R – (i.e. converts a relation from a bag to set representation) • R2 := δ(R1) – R2 consists of one copy of each tuple that appears in R2 one or more times • DISTINCT modifier in SELECT stmt δ - Example R= ( A 1 3 1 δ(R) = B) 2 4 2 A 1 3 B 2 4 τ – Sorting • R2 := τL(R1) – L – list of some attributes of R1 – L specifies the order of sorting • Increasing order – Tuples with identical components in L specify no order • Benefit: – Obvious – ordered output – Not so obvious – stored sorted relations can have substantial query benefit • Recall running time for binary search • O(log n) is far superior than O(n) Aggregation Operators • Use to summarize something about the values in attribute of a relation – Produces a single value as a result • • • • • SUM(attr) AVG(attr) MIN(attr) MAX(attr) COUNT(attr) Example: Aggregation R= ( A 1 3 3 B ) 3 4 2 SUM(A) = 7 COUNT(A) = 3 MAX(B) = 4 AVG(B) = 3 SUM(A), COUNT(A), MAX(B), AVG(B) =? Grouping Operator • R2 := γL(R1) • L is a list of elements that are: – Individual attributes of R1 • Called grouping attributes – Aggregated attribute of R1 • Use an arrow and a new name to rename the component – R2 projects only what is in L How does γL(R) work? 1. Form one group for each distinct list of values for those attributes in R 2. Within each group, compute AGG(A) for each aggregation on L 3. Result has one tuple for each group – The grouping attributes' values for the group – The aggregations over all tuples of the group (for the aggregated attributes) Example: Grouping / Aggregation R= γ ( A 1 4 1 1 B 2 5 2 3 C) 3 6 5 5 (R) = ?? First, partition R by A and B : A B C 1 2 3 1 2 5 4 5 6 1 3 5 A,B,AVG(C)->X Then, average C within groups: A 1 4 1 B 2 5 3 X 4 6 5 Note about aggregation • If R is a relation, and R has attributes A1…An, then – δ(R) == γA1,A2,…,An(R) – Grouping on ALL attributes in R eliminates duplicates – i.e. δ is not really necessary • Also, if relation R is also a set, then – πA1,A2,…,An(R) = γA1,A2,…,An(R) Extended Projection • Recall R2 := πL(R1) – R2 contains only L attributes from R1 • L can be extended to allow arbitrary expressions: – Renaming (e.g., A -> B) – Arithmetic expressions (e.g., A + B -> SUM) – Duplicate attributes (i.e., include in L multiple times) Example: Extended Projection R= ( A 1 3 B) 2 4 πA+B->C,A,A (R) = C 3 7 A1 1 3 A2 1 3 Outer joins • Recall that the standard natural join occurs only if there is a match from both relations • A tuple of R that has NO tuple of S with which it can join is said to be dangling – Vice versa applies • Outer join: preserves dangling tuples in join – Missing components set to NULL • R |>◦<|C S. – This is a bad approximation of the symbol – see text – NO C? Natural outer join Example: Outer Join R= (A 1 4 B) 2 5 S= (B 2 6 C) 3 7 (1,2) joins with (2,3), but the other two tuples are dangling. R |>◦<| S = A B 1 4 NULL C 2 5 6 3 NULL 7 Types of outer joins • R |>◦<| S – No condition, requires matching attributes – Pads dangling tuples from both side • R |>◦<| L S – Pad dangling tupes of R only • R |>◦<| R S – Pad dangling tuples of S only • SQL: – R NATURAL {LEFT | RIGHT} JOIN S – R {LEFT | RIGHT} JOIN S – NOTE MySQL does not allow a FULL OUTER JOIN! Only LEFT or RIGHT – Just UNION a left outer join and a right outer join… mostly A+B 1 5 1 6 7 A2 0 4 0 4 9 B2 1 9 1 16 16 B+1 1 3 3 4 1 4 C-1 0 3 4 3 1 3 A 0 2 2 3 B 1 3 4 4 A 0 2 3 SUM(B) 2 7 4 SELECT A,SUM(B) FROM R GROUP BY A A 0 2 3 SELECT A FROM R GROUP BY A; SELECT DISTINCT A FROM R; A 2 MAX(C) 4 SELECT A,MAX(C) FROM R NATURAL JOIN S GROUP BY A; What if MAX(C) was SUM(C)? A 2 2 0 0 2 3 B 3 3 1 1 4 4 C 4 4 ┴ ┴ ┴ ┴ SELECT * FROM R NATURAL LEFT JOIN S; A 2 2 ┴ ┴ ┴ ┴ B 3 3 0 2 2 0 C 4 4 1 4 5 2 SELECT * FROM R NATURAL RIGHT JOIN S; A 2 2 0 0 2 3 ┴ ┴ ┴ ┴ B 3 3 1 1 4 4 0 2 2 0 C 4 4 ┴ ┴ ┴ ┴ 1 4 5 2 SELECT * FROM R NATURAL LEFT JOIN S UNION SELECT * FROM R NATURAL RIGHT JOIN S; Right? • SELECT * FROM R NATURAL LEFT JOIN S UNION ALL SELECT * FROM R NATURAL RIGHT JOIN S WHERE A IS NULL; A 0 0 0 0 0 0 0 0 2 2 3 ┴ ┴ R.B 1 1 1 1 1 1 1 1 3 4 4 ┴ ┴ S.B 2 2 3 3 2 2 3 3 ┴ ┴ ┴ 0 0 C 4 5 4 4 4 5 4 4 ┴ ┴ ┴ 1 2 Back to SQL Aggregations • SUM, AVG, COUNT, MIN, and MAX can be applied to a column in a SELECT clause – Produces an aggregation on the attribute • COUNT(*) count the number of tuples • Use DISTINCT inside of an aggregation to eliminate duplicates in the function Example: • Sells(bar, beer, price) • Find the average price of Guinness – SELECT AVG(price) – FROM Sells – WHERE beer = 'Guinness'; • Find the number of different prices charged for Guinness – SELECT COUNT(DISTINCT price) AS "# Prices" – FROM Sells – WHERE beer = 'Guinness'; Grouping • SELECT attr(s) FROM tbls WHERE cond_expr GROUP BY attr(s) • The resulting SELECT-FROM-WHERE relation determined FIRST, then grouped according to GROUP BY clause – MySQL will also sort the relations according to attributes listed in GROUP BY clause • Therefore, allows optional ASC or DESC (just like ORDER BY) • Aggregations are applied only within each group Grouping and NULLS Note on NULL and Aggregation • NULL values in a tuple: – never contribute to a sum, average or count – can never be a min or max of an attribute • If all values for an attribute are NULL, then the result of an aggregation is NULL – Exception: COUNT of an empty set is 0 • NULL values are treated as ordinary values when forming groups Example: Grouping • Sells(bar, beer, price) Frequents(drinker, bar) • Find the average price for each beer – SELECT beer, AVG(price) – FROM Sells – GROUP BY beer; • Find for each drinker the average price of Guinness at the bars they frequent – – – – – SELECT drinker, AVG(price) FROM Frequents NATURAL JOIN Sells WHERE beer = 'Guinness' GROUP BY drinker; Restrictions • Example: – Find the bar that sells Guinness the cheapest – SELECT bar, MIN(price) FROM Sells WHERE beer = 'Guinness'; – Is this correct? • Book states that this is illegal SQL – if an aggregation used, then each SELECT element should be aggregated or be an attribute in GROUP BY – MySQL allows the above, but such queries will give meaningless results Example of confusing aggregation • Find the country of the ship with bore of 15 with the smallest displacement • SELECT country, MIN(displacement) FROM Classes WHERE bore = 15; Not quite the correct answer! Be sure to follow the rules for aggregation. HAVING Clause • HAVING cond – Follows a GROUP BY clause – Condition applies to each possible group – Groups not satisfying condition are eliminated • Rules for conditions in HAVING clause: – Aggregated attributes: • Any attribute in relation in FROM clause can be aggregated • Only applies to the group being tested – Unaggregated attributes • Only attributes in GROUP BY list • mySQL is more lenient with this, though they result in meaningless information Example: HAVING • Sells(bar, beer, price) • Find the average price of those beers that are served in at least three bars • SELECT beer, AVG(price) FROM Sells GROUP BY beer HAVING COUNT(*) >= 3; Example: HAVING • Sells(bar, beer, price) Beers(name, manf) • Find the average price of beers that are either served in at least three bars or are manufactured by Sam Adams • SELECT beer, AVG(price) • FROM Sells • GROUP BY beer • HAVING COUNT(*) >= 3 OR • beer IN • (SELECT name FROM Beers WHERE manf = 'Sam Adams'); • Find the average displacement of ships from each country having at least two classes • • • • SELECT country, AVG(displacement) FROM Classes GROUP BY country HAVING count(*) >= 2; Summary so far • • • • • • SELECT FROM WHERE GROUP BY HAVING ORDER BY – – – – S R1,…,Rn C1 a1,…,ak C2 b1,…,bk; S attributes from R1,…,Rn or aggregates C1 are conditions on R1,…,Rn a1,…,ak are attributes from R1,…,Rn C2 are conditions based on any attribute, or on any aggregation in GROUP BY clause – b1,…,bk are attributes on R1,…,Rn Exercises Exercise 6.2.3f SELECT battle FROM Outcomes INNER JOIN Ships ON Outcomes.ship = Ships.name NATURAL JOIN Classes GROUP BY country, battle HAVING COUNT(ship) >= 3; Exercise 6.4.7a • SELECT COUNT(type) FROM Classes WHERE type = 'bb'; Exercise 6.4.7b • SELECT AVG(numGuns) AS 'Avg Guns' FROM Classes WHERE type = 'bb'; Exercise 6.4.7c • SELECT AVG(numGuns) AS 'Avg Guns' FROM Classes NATURAL JOIN Ships WHERE type = 'bb'; Exercise 6.4.7d • SELECT class, MIN(launched) AS First_Launched FROM Classes NATURAL JOIN Ships GROUP BY class; Exercise 6.4.7e • • • • • • • SELECT C.class, COUNT(O.ship) AS '# sunk' FROM Classes AS C NATURAL JOIN Ships AS S INNER JOIN Outcomes AS O ON S.name = O.ship WHERE O.result = 'sunk' GROUP BY C.class;