Week 5 5 Week 5 Contents page Objectives .................................................................................................. 4 Introduction to Week 5 .............................................................................. 6 Textbook coverage................................................................................ 6 The Parts database ..................................................................................... 7 Introducing the Parts database .............................................................. 7 Understanding SQL queries – the basics ................................................. 11 Query evaluation ................................................................................. 11 Grouped queries .................................................................................. 12 Explicitly grouped queries .................................................................. 12 Implicitly grouped queries .................................................................. 13 A motivating information request............................................................ 14 Joining tables ........................................................................................... 15 Cartesian or implicit joins ................................................................... 15 Understanding explicit joins ............................................................... 16 SQL null – first encounter! ................................................................. 17 Answering complex information requests using views ........................... 18 Divide and conquer ............................................................................. 18 SQL null – second encounter! ............................................................ 19 Answering complex information requests without views ....................... 20 Temporary tables ................................................................................ 20 Common table expressions ................................................................. 20 Using a scalar-valued subquery in SELECT ...................................... 21 Using a table-valued sub-query in FROM .......................................... 22 Missing Information - the SQL null ........................................................ 23 Where did nulls come from? .............................................................. 23 SQL nulls – the good news! ................................................................ 24 SQL nulls – why all the fuss? ............................................................. 24 2 Week 5 The effect of Unknown during query processing ............................... 25 Nulls produced by Aggregate Functions ............................................ 26 Is one type of null enough? ................................................................. 27 Handling Nulls in SQL-92 .................................................................. 28 Recommendation ................................................................................ 29 Recursive queries ..................................................................................... 30 Introduction......................................................................................... 30 Identifying a recursive query .............................................................. 31 Processing a recursive query............................................................... 31 A user-defined function ...................................................................... 32 Other useful things to know..................................................................... 32 Query optimisation ............................................................................. 32 Relational algebra and SQL ................................................................ 33 Relational calculus and SQL .............................................................. 34 Working on Assignment 1 ....................................................................... 37 Making an early start on Assignment 2 .............................................. 37 Review Objectives ................................................................................... 38 Solutions to exercises in module ............................................................. 42 3 Week 5 Objectives On completion of this module you should be able to: 4 comment on the accuracy of SQL being described as a nonprocedural, declarative, or set-at-a-time language identify and describe the steps involved in evaluating an SQL query avoid errors when using grouped queries explain the difference between an explicitly grouped query and an implicitly grouped query, and provide examples of each state the rule that applies to a column that appears in the SELECT clause of an explicitly grouped query identify advantages and disadvantages of using SQL views to answer complex information requests explain how temporary tables or common table expressions provide an alternative to views when answering complex information requests explain differences between the Cartesian join of two tables, the inner join of two tables, and the outer join of two tables use SQL inner joins and outer joins to answer information requests demonstrate how a scalar-valued subquery can be used in the SELECT clause of an SQL query demonstrate how a table-valued subquery can be used in the FROM clause of an SQL query explain why SQL includes a null , and describe problems associated with using a null to represent missing information explain why Ted Codd has proposed two types of null explain why Chris Date has rejected the proposal for two types of null identify conveniences and risks posed by the SQL null identify which of the aggregate functions MIN, MAX, COUNT, SUM, AVG can evaluate to null, and describe conditions under which this occurs identify situations in which a null can arise during query processing describe SQL’s 3–valued logic system and explain how the value Unknown can arise during query processing identify and fix queries that may fail to fully answer an information request as a consequence of nulls that arise during query processing demonstrate how the COALESCE function can be used to manage nulls that arise when processing an SQL query Week 5 describe the concept of a common table expression (CTE), and explain how a CTE can be used to express a recursive query explain how the recursive signature of self-reference appears in the SQL implementation of recursive queries explain the processing steps involved in producing a result for a recursive SQL query briefly explain the process of SQL query optimisation explain the relevance of relational algebra to the SQL user describe the extent to which SQL is based on relational calculus 5 Week 5 Introduction to Week 5 The focus of interest this week is SQL queries. This module starts out with a revision of SQL query processing. A solid understanding of query processing is essential to avoid queries that produce incorrect results. Managers are very trusting of numbers that emerge from computers. One common cause of incorrect query results is a poor understanding of the SQL null. The SQL null takes us form the comfortable world of 2valued logic (True and False) into the brave new world of 3-valued logic (True, False and Unknown). The SQL query language supports a number of powerful features that you may not have met previously. We will explore some in this module. As well as powerful queries, we are also interested in features of the SQL query language than help to simplify complex queries. Some features covered in this module will help. With the topics covered this week you will be able to: finish the views needed for Assignment 1 develop the queries needed for Assignment 2 develop the CREATE ASSERTION statement for Assignment 2 Textbook coverage The textbook Chapter 7 material of interest to this module is in the section New Forms of Join (p 234). ------------------------------------------------------------------------------------Textbook Chapter 7, pages 234 to 239 ------------------------------------------------------------------------------------The textbook Chapter 8 material of interest to this module is in the section Additional SQL Statements (p 266). ------------------------------------------------------------------------------------Textbook Chapter 8, pages 266 to 271 ------------------------------------------------------------------------------------- 6 Week 5 The Parts database One of the nice things about relational databases technology is that its origins are so well defined. Form your previous studies you will know that relational database technology is underpinned by a set of ideas called the relation model. The relational model was first proposed by E.F.Codd (http://en.wikipedia.org/wiki/Edgar_F._Codd) in 1969. The extract below comes from the March 08 version of the Wikipedia article on the relational model (http://en.wikipedia.org/wiki/Relational_model): “The relational model was invented by E.F. (Ted) Codd as a general model of data, and subsequently maintained and developed by Chris Date and Hugh Darwen among others. In The Third Manifesto (first published in 1995) Date and Darwen show how the relational model can accommodate certain desired object-oriented features without compromising its fundamental principles.” As mentioned in a Wikipedia article on the topic, SQL has its critics: http://en.wikipedia.org/wiki/Sql. Mentioned in the quote above, Chris Date is a highly regarded author on relational database technology (http://en.wikipedia.org/wiki/Christopher_J._Date). Chris Date is one of the most articulate of the SQL critics. As well as the way SQL handles missing information, Chris Date’s “The Third Manifesto” also criticises the way that object-oriented features have been added to SQL. Introducing the Parts database Throughout his writings, Chris Date uses a collection of simple tables to illustrate his ideas. We will make use of these tables throughout this module, and collectively refer to them as the Parts database. Chris Date has been using his example tables for a long time. Tables in the Parts database hold data used by a manufacturer of punched card sorters. Some students may need to refer to the Wikipedia article on “computer programming in the punch card era” for an explanation. Our Parts database consists of five tables: P – parts types used by the manufacturer S – suppliers of parts used by the manufacturer SP – shipments of parts from suppliers J – projects to manufacture different products SPJ – parts used to manufacture products Note: The literature is filled with suppliers and parts databases (http://en.wikipedia.org/wiki/Suppliers_and_Parts_database). Columns, keys, and sample data for the Parts database follow... 7 Week 5 Columns… P table – describing parts types used by the manufacturer: PNO – unique part type number – px, say PNAME – name of part type px COLOR – colour of part type px WEIGHT – weight in grams of a single part of type px CITY – city where parts of type px are held Note: Chris Date’s US spelling of COLOR will be used in this course S table – describing part suppliers: SNO – unique supplier number – sx, say SNAME – name of supplier sx STATUS – numeric status indicator of supplier sx CITY – city where supplier sx is located SP table – describing shipments of parts from suppliers: SNO – supplier number – sx, say PNO – part type number – px, say QTY – number of px parts being shipped by supplier sx J table – describing projects to manufacture different products: JNO – unique project number – jx, say JNAME – name of project jx CITY – city in which project jx is conducted SPJ table – describing parts used to manufacture products: SNO – supplier number – sx, say PNO – part type number – px, say JNO – product number – jx, say QTY – number of px parts from supplier sx used on project jx Primary and foreign keys… P (PNO, PNAME, COLOR, WEIGHT, CITY) S (SNO, SNAME, STATUS, CITY) SP (SNO, PNO, QTY) SNO references S, PNO references P J (JNO, JNAME, CITY) SPJ (SNO, PNO, JNO, QTY) SNO references S, PNO references P, JNO references J 8 Week 5 Sample data… P table – describing parts types used by the manufacturer: PNO PNAME COLOR WEIGHT CITY P1 Nut Red 12 London P2 Bolt Green 17 Paris P3 Screw Blue 17 Rome P4 Screw Red 14 London P5 Cam Blue 12 Paris P6 Cog Red 19 London S table – describing part suppliers: SNO S1 S2 S3 S4 S5 SNAME Smith Jones Blake Clark Adams STATUS 20 10 30 20 30 CITY London Paris Paris London Athens SP table – describing shipments of parts from suppliers: SNO S1 S1 S1 S1 S1 S1 S2 S2 S3 S4 S4 S4 PNO P1 P2 P3 P4 P5 P6 P1 P2 P2 P2 P4 P5 QTY 300 200 400 200 100 100 300 400 200 200 300 400 Notes: each row describes the number of parts of a given type currently being shipped by a given supplier a given part type can be supplied by more than one supplier 9 Week 5 J table – describing projects to manufacture different products: JNO JNAME CITY J1 Sorter Paris J2 Sorter Rome J3 Sorter Athens J4 Sorter Athens J5 Sorter London J6 Sorter Oslo J7 Sorter London SPJ table – describing parts used to manufacture products: SNO PNO JNO QTY S1 P1 J1 200 S1 P1 J4 700 S2 P3 J1 400 S2 P3 J2 200 S2 P3 J3 200 S2 P3 J4 500 S2 P3 J5 600 S2 P3 J6 400 S2 P3 J7 800 S2 P5 J2 100 S3 P3 J1 200 S3 P4 J2 500 S4 P6 J3 300 S4 P6 J7 300 S5 P1 J4 100 S5 P2 J2 200 S5 P2 J4 100 S5 P3 J4 200 S5 P4 J4 800 S5 P5 J4 400 S5 P5 J5 500 S5 P5 J7 100 S5 P6 J2 200 S5 P6 J4 500 Note: Each row describes the number of parts of a given type from a given supplier used on a given project. Important: The course web site will provide a Microsoft Access and SQL Server implementation of the Parts database. 10 Week 5 Understanding SQL queries – the basics Query evaluation Understanding the anatomy of an SQL query will help you avoid producing erroneous results. It will also help you debug those produced by others. The result of a query is determined by the five clauses: <SELECT clause> <FROM clause> [<WHERE clause>] (optional) [<GROUP BY clause>] (optional) [<HAVING clause>] (optional; requires GROUP BY) Note: Queries can also include an ORDER BY clause and/or a DISTINCT modifier. However, these only affect the presentation of the result, not the content. To avoid errors, you must understand how a query result is produced – at least conceptually. A DBMS is not obliged to evaluate a query as suggested below. However, the result must be the same as that produced by the following steps: Step 1: evaluate the table specified in FROM clause Step 2 (optional): filter rows as specified in WHERE clause Step 3 (optional): form groups as specified in GROUP BY clause Step 4 (optional): filter groups as specified in HAVING clause Step 5 (grouped query): produce one output row for each group surviving the HAVING clause (if specified) Step 5 (ungrouped query): produce one output row for each row surviving the WHERE clause (if specified) Given the description above, you might think that the SELECT clause should be placed at the end of the query. Two reasons for designers placing SELECT at the start are: to produce a more “structured English query language” relational calculus is a “results first” language (more later) Important points: the FROM clause of every query evaluates to a single table queries are either grouped or ungrouped 11 Week 5 Grouped queries As mentioned above, queries are either grouped or ungrouped. A simple ungrouped query is: SELECT * FROM S; A simple grouped query: SELECT CITY, COUNT(SNO) FROM S GROUP BY CITY; For a grouped query, a single output row is produced for each group that survives the HAVING clause (if specified). For the query above, a group is formed for each CITY value in the table. The output row includes the CITY value for the group, plus a count of rows in the group. Exercise 1 ------------------------------------------------------------------------------------The query below produces an error. Why? Try to answer this question before continuing. SELECT CITY, SNAME, COUNT(*) FROM S GROUP BY CITY; ------------------------------------------------------------------------------------The query above produces the following error message from SQL Server 2005. Does this make sense? Column 'S.SNAME' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. If we modify the query as shown below, the query does not produce an error. Here, SNAME is “contained in an aggregate function”. SELECT CITY, COUNT(SNAME), COUNT(*) FROM S GROUP BY CITY; The query processor is happy now since, for each group, it is in a position to produce a single output value for each expression in the SELECT clause. Previously, it had a dilemma – a group might have more than one SNAME value. How would it decide which SNAME to display? Explicitly grouped queries A query that includes a GROUP BY clause is an explicitly grouped query. The rule for a column that appears in the SELECT clause of an explicitly grouped query is that it must either: (1) be “contained” in an aggregate function in the SELECT clause, or (2) be “contained” in the GROUP BY clause of the query A more formal description of (1) above would be to say that the column must appear as an argument to an aggregate function. The bottom line 12 Week 5 here is that the SELECT clause must produce a single result row for each group. For a column that appears in the GROUP BY clause, all rows in a group will have the same value for that column. Consequently, that value may appear in the result row for the group – it does not need to appear as the argument to an aggregate function in the SELECT clause. For a column that does not appear in the GROUP BY clause, the rows in a group may hold different values for that column. Consequently, if that column appears in the SELECT clause, it must appear as an argument to an aggregate function (COUNT, say) – to produce a single output value for the set of input column values. Implicitly grouped queries As well as explicitly grouped queries, SQL also supports implicitly grouped queries. Consider the example below. This query does not include a GROUP BY clause. However, it produces a single output row. SELECT COUNT(*) FROM S; The query above is a grouped query. Rows in the S table are treated as a single group to produce one result row for the query. Indeed, the above query is equivalent to the one below: SELECT COUNT(*) FROM S GROUP BY (); The query below is also a grouped query. SELECT SNAME, COUNT(*) FROM S; SQL Server produces the following error message when presented with the above query: Column 'S.SNAME' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. 13 Week 5 A motivating information request In following sections we will use a single information request to motivate our exploration of SQL queries: For each and every suppliers described in the Parts database, find the sum of the number of parts supplied in the past (already used on projects) and the number of parts they have been requested to supply in the future (currently being shipped). To answer this request, data must be obtained from the SP and SPJ tables. The SPJ table describes parts that have been used in the past. The SP table describes parts that are currently being shipped. A summary of suppliers’ parts “used in the past” is obtained from the query below, producing the following result: SELECT SNO,SUM(QTY) FROM SPJ GROUP BY SNO; SNO S1 S2 S3 S4 S5 SUM(QTY) 900 3200 700 600 3100 A summary of supplier’s parts “currently being shipped” is obtained from the query below, producing the following result: SELECT SNO,SUM(QTY) FROM SP GROUP BY SNO; SNO S1 S2 S3 S4 SUM(QTY) 1300 700 200 900 We would like to combine this data to produce the following result. SNO S1 S2 S3 S4 S5 14 TOTAL 2200 3900 900 1500 3100 Week 5 Joining tables Cartesian or implicit joins To answer our information request, data must be drawn from the SP and SPJ tables. A novice SQL user may think that the query below might answer the request. It does not. Try it! SELECT SP.SNO, SUM(SP.QTY)+SUM(SPJ.QTY) AS TOTAL FROM SP, SPJ WHERE SP.SNO = SPJ.SNO GROUP BY SP.SNO; Exercise 2 ------------------------------------------------------------------------------------Why does the above query produce the wrong result? Spend a few minutes trying to answer this question before continuing. ------------------------------------------------------------------------------------We can explore the table resulting from the FROM clause using the query below. This query produces 288 rows. Does that make sense? SELECT * FROM SP, SPJ; The FROM clause specifies a Cartesian join of tables SP and SPJ – 12 rows in SP, 24 rows in SPJ, 288 rows in the result – that does makes sense! We can explore the filter in the WHERE clause using the query below. This query produces 36 rows. Does that make sense? SELECT * FROM SP, SPJ WHERE SP.SNO = SPJ.SNO; Every SP row with one or more related SPJ rows (same SNO value) will appear joined to those rows in the result. That explains why we get 36 rows in the result. But, why does that lead to such large TOTAL values? Let’s put a couple of classic problem solving techniques to work here: breaking complex problems into smaller, simpler problems investigate specific cases (similar to debugging a program) A good case to consider here is S3. The correct TOTAL value for S3 is 900; but, we get 1100 from our novice query. Let’s investigate this case using the query below, which produces the following result. Can you see where the 1100 comes from? SELECT * FROM SP, SPJ WHERE SP.SNO = SPJ.SNO AND SP.SNO = 'S3'; SP.SNO SP.PNO SPJ.SNO SPJ.PNO SPJ.JNO S3 P2 SP.QTY 200 S3 P3 J1 SPJ.QTY 200 S3 P2 200 S3 P4 J2 500 15 Week 5 The 1100 comes from the addition of the highlighted values below. SP.SNO SP.PNO SPJ.SNO SPJ.PNO SPJ.JNO S3 P2 SP.QTY 200 S3 P3 J1 SPJ.QTY 200 S3 P2 200 S3 P4 J2 500 A good understanding of table joins will help you to avoid such errors. Understanding explicit joins A Cartesian join is sometimes referred to as an implicit join – the FROM clause does not include the word JOIN. Explicit joins do include the word JOIN. The following three queries produce the same result. SELECT * FROM S, SP WHERE S.SNO = SP.SNO; SELECT * FROM S JOIN SP ON S.SNO = SP.SNO; SELECT * FROM S INNER JOIN SP ON S.SNO = SP.SNO; The result produced by these queries is known as the inner join of S and SP on SNO. Using the sample data, the result has 12 rows – one for each row in the SP table. Each SP row is joined to the one related row in S. The second query above illustrates that the default explicit join is the inner join. As well as “inner joins”, SQL has “outer joins”. The outer join preserve rows that have no related row in the joined table. In the sample data, there is no SP row with an SNO value of S5. As such, S5 does not appear in the inner join of S and SP. If we wish to obtain the number of part types currently being shipped by each and every supplier, we cannot use the query below – S5 does not appear in the result. SELECT S.SNO, SNAME, COUNT(PNO) AS NbrPartTypes FROM S INNER JOIN SP ON S.SNO = SP.SNO GROUP BY S.SNO, SNAME; However, if we change the inner join to an outer join, we do get the required result. SELECT S.SNO, SNAME, COUNT(PNO) AS NbrPartTypes FROM S LEFT OUTER JOIN SP ON S.SNO = SP.SNO GROUP BY S.SNO, SNAME; SNO S1 SNAME Smith NbrPartTypes 6 S2 Jones 2 S3 Blake 1 S4 Clark 3 S5 Adams 0 Let’s explore the outer join… 16 Week 5 The query below displays the left outer join of S and SP on SNO. The result has 13 rows – 12 rows from the inner join, plus the S5 row from S joined to a row of nulls. SELECT * FROM S LEFT OUTER JOIN SP ON S.SNO = SP.SNO; S.SNO SNAME STATUS CITY SP.SNO PNO QTY S1 Smith 20 London S1 P1 300 S1 Smith 20 London S1 P2 200 S1 Smith 20 London S1 P3 400 S1 Smith 20 London S1 P4 200 S1 Smith 20 London S1 P5 100 S1 Smith 20 London S1 P6 100 S2 Jones 10 Paris S2 P1 300 S2 Jones 10 Paris S2 P2 400 S3 Blake 30 Paris S3 P2 200 S4 Clark 20 London S4 P2 200 S4 Clark 20 London S4 P4 300 S4 S5 Clark Adams 20 30 London Athens S4 NULL P5 NULL 400 NULL The outer join comes in three flavours – LEFT, RIGHT and FULL. The LEFT join preserves rows from the table on the left, the RIGHT join preserves rows from the table on the right, and the FULL join preserves rows from both tables. The keyword OUTER is optional. The following queries are equivalent. SELECT * FROM S LEFT OUTER JOIN SP ON S.SNO = SP.SNO; SELECT * FROM S LEFT JOIN SP ON S.SNO = SP.SNO; SQL null – first encounter! As mentioned previously, SQL has been criticised for the way it handles missing information. In SQL, missing information is represented by a null. We will explore why SQL nulls have attracted so much attention later. By way of an introduction to the topic however, see if you can predict the NbrParts value for S5 produced by the query below. Exercise 3 ------------------------------------------------------------------------------------What NbrParts value would you anticipate for S5 from the query below? SELECT S.SNO, SNAME, SUM(QTY) AS NbrParts FROM S LEFT JOIN SP ON S.SNO = SP.SNO GROUP BY S.SNO, SNAME; Now, check your answer. ------------------------------------------------------------------------------------- 17 Week 5 Answering complex information requests using views Views can help to solve complex information requests by breaking the request down in to a collection of smaller, simpler requests. The method of layering built-in functions described in the textbook (p 244) illustrates the idea. Divide and conquer SQL views can be used to break our information request (repeated below) into three simpler information requests. For each and every suppliers described in the Parts database, find the sum of the number of parts supplied in the past (already used on projects) and the number of parts they have been requested to supply in the future (currently being shipped). If we create two views summarising data from SP and SPJ (shown below), perhaps we can join these views to produce the required result. CREATE VIEW V1(SNO,ORDERED) AS SELECT SNO, SUM(QTY) FROM SP GROUP BY SNO; CREATE VIEW V2(SNO,USED) AS SELECT SNO, SUM(QTY) FROM SPJ GROUP BY SNO; Queries below will return the contents of V1 and V2. Exercise 4 SELECT * FROM V1; SELECT * FROM V2; SNO S1 S2 S3 S4 SNO S1 S2 S3 S4 S5 ORDERED 1300 700 200 900 USED 900 3200 700 600 3100 ------------------------------------------------------------------------------------Create V1 and V2 as suggested above. Now, formulate a query to answer our information request using these two views. Note: In Microsoft Access you create queries, not views. ------------------------------------------------------------------------------------- 18 Week 5 Joining V1 and V2 using the query below produces the following result. SELECT V1.SNO, ORDERED+USED AS TOTAL FROM V1, V2 WHERE V1.SNO = V2.SNO; SNO S1 S2 S3 S4 Exercise 5 TOTAL 2200 3900 900 1500 ------------------------------------------------------------------------------------Explain why S5 is missing from the result. Is the result different using INNER JOIN rather than a Cartesian join? Is the result different using an OUTER JOIN? Can an OUTER JOIN be used in V1 to include S5 in the result? If an outer join is used in V1, what result is produced by the query above? ------------------------------------------------------------------------------------- SQL null – second encounter! We can use outer joins in the definition of V1 and V2. One problem with this approach is that nulls may appear in the result. We will return to this issue later. Another way to including all SNO values in V1 and V2 is shown below. Exercise 6 ------------------------------------------------------------------------------------Redefine V1 and V2 as shown below. Note: DBMS limited to SQL-86 will not support UNION in the definition of a view. CREATE VIEW V1(SNO,ORDERED) AS SELECT SNO,SUM(QTY) FROM SP GROUP BY SNO UNION SELECT SNO,0 FROM S WHERE SNO NOT IN (SELECT SNO FROM SP ); CREATE VIEW V2(SNO,USED) AS SELECT SNO,SUM(QTY) FROM SPJ GROUP BY SNO UNION SELECT SNO,0 FROM S WHERE SNO NOT IN (SELECT SNO FROM SPJ); With V1 and V2 defined as above, check that we can join V1 and V2 to produce the required result. ------------------------------------------------------------------------------------- 19 Week 5 Answering complex information requests without views We have found a solution to our information request using views. Unfortunately, creating views to answer every non-trivial information requests will result in a large number of views in a database. After a while, keeping track of which views are still used by an application becomes difficult. Temporary tables Most modern DBMS support temporary tables. A temporary table is a table that is automatically discarded at the end of a session. Note: The term session describes the dialog that occurs over a database connection. A connection must be established between a client program (like SQL Server Management Studio) and a database server before SQL statements can be submitted from the client to the server. For our information request, we could create temporary tables T1 and T2, execute the queries we formulated for V1 and V2 to populate T1 and T2, and then execute our final query against T1 and T2 (instead of V1 and V2). Using SQL Server, tables created with a first character of # are temporary tables. The solution proposed above will be demonstrated in the lecture using SQL Server. Temporary tables provide a feasible solution then, if not the most efficient. But we can do better! As well as temporary tables, mature DBMS support a feature that might be described as temporary views. The formal name for this feature is common table expressions. Common table expressions Introduced in SQL:1999, common table expression (CTEs) might be described as temporary views. The following query uses CTEs to answer our information request. 20 Week 5 WITH V1(SNO,ORDERED) AS ( SELECT SNO,SUM(QTY) FROM SP GROUP BY SNO UNION SELECT SNO,0 FROM S WHERE SNO NOT IN (SELECT SNO FROM SP)), V2(SNO,USED) AS ( SELECT SNO,SUM(QTY) FROM SPJ GROUP BY SNO UNION SELECT SNO,0 FROM S WHERE SNO NOT IN (SELECT SNO FROM SPJ)) SELECT V1.SNO, ORDERED+USED AS TOTAL FROM V1,V2 WHERE V1.SNO = V2.SNO; CTEs were introduced to support recursive queries (a topic for later), but can also be used to simplify complex queries; although, the query above is not that simple. Can do better? Yes we can! Using a scalar-valued subquery in SELECT One powerful feature of SQL is the use of a scalar-valued subquery in the SELECT clause of a query. The query below produces the following result. SELECT SNO, (SELECT SUM(QTY) FROM SP WHERE SNO=S.SNO) + (SELECT SUM(QTY) FROM SPJ WHERE SNO=S.SNO) AS TOTAL FROM S; SNO TOTAL S1 2200 S2 3900 S3 900 S4 1500 S5 NULL Note: A subquery is a query enclosed in round brackets, with no GROUP BY clause and no ORDER BY clause. Exercise 7 ------------------------------------------------------------------------------------This query does not fully answer our information request. Why? Hint: How does SQL represent missing information? ------------------------------------------------------------------------------------The query below illustrates the power of combining scalar-valued subqueries with CTEs. Each row in the query result describes a part type received from a supplier – the number of parts received and the 21 Week 5 percentage of all parts of that type received from all suppliers. The first six rows of the result are shown following the query. WITH V(SNO,PNO,USED) AS ( SELECT SNO,PNO,SUM(QTY) FROM SPJ GROUP BY SNO, PNO ) SELECT SNO,PNO,USED, (SELECT 100*V1.USED/SUM(USED) FROM V WHERE PNO = V1.PNO) AS PERCENT FROM V AS V1; SNO PNO USED PERCENT S1 P1 900 90 S2 P3 3100 88 S2 P5 100 9 S3 P3 200 5 S3 P4 500 38 S4 P6 600 46 : : : : Using a table-valued sub-query in FROM Another powerful feature of SQL is the use of a table-valued subquery in the FROM clause of a query. The query below correctly answers our information request using the sample data (see following exercise). SELECT FROM SNO, SUM(QTY) AS TOTAL ( SELECT SNO,QTY FROM SP UNION ALL SELECT SNO,QTY FROM SPJ ) AS V GROUP BY SNO; Exercise 8 ------------------------------------------------------------------------------------Why does the above query include the word ALL after UNION? The above query may not always fully answer our information request. The start of the request is repeated below. Can you see how the query above might possible produce an incomplete result? For each and every suppliers described in the Parts database… Extend the table-valued subquery in the FROM clause above to produce a query that will always fully answer our information request. ------------------------------------------------------------------------------------- 22 Week 5 Missing Information - the SQL null In SQL, missing information is represented by an object called a null. The SQL null has a controversial history. The user of SQL who does not understanding nulls is at risk of producing incorrect query results. It is important to note that nulls were not invented by computer scientists who designed SQL – it is an inherent part of the relational model. Ted Codd proposed twelve rules to test the credentials of early RDBMS (http://en.wikipedia.org/wiki/Codd%27s_12_rules). The extract below comes from the Wikipedia article on the topic in March 08. Rule 3: Systematic treatment of null values: The DBMS must allow each field to remain null (or empty). Specifically, it must support a representation of "missing information and inapplicable information" that is systematic, distinct from all regular values (for example, "distinct from zero or any other number," in the case of numeric values), and independent of data type. It is also implied that such representations must be manipulated by the DBMS in a systematic way. Where did nulls come from? The keys of a relational database are very important. C.J. Date has said that foreign keys and candidate keys are “the glue that holds a relational database together”. In a relational database, every value of a foreign key must match a value of the referenced candidate key. However, some foreign keys do not always have a value. For example, we may wish to record details of an employee in an E table who has not yet been assigned to a department (described in a D table). D (DNO, DName, …); E (ENO, EName, DNO, …) DNO references D; The concept of a null was introduced into the relational model to represent the thing you have when you don’t have a value for a foreign key. It was subsequently used to represent any missing information. Missing information, and how to handle it, has been a hot research topic for many years. It remains a controversial topic. Indeed, Ted Codd (the creator of the relational model), and Chris Date (one of its most respected advocates), have widely differing views about how missing information 23 Week 5 should be handled. Codd continues to support the null. Date believes that nulls should be avoided until we have a better solution. The facts are: SQL does have a null it is not going away it is impossible to avoid it introduces complexity failure to understand it will lead to errors SQL nulls – the good news! Out of respect for Ted Codd, we should start by considering the main positive aspect of nulls. That is, that they provide a “representation of missing information and inapplicable information that is systematic”. To list suppliers with missing information, the method is the same regardless of data type: SELECT * FROM S WHERE STATUS IS NULL; SELECT * FROM S WHERE SNAME IS NULL; Note: We will briefly explore the distinction between missing data and inapplicable data later. Perhaps we can be thankful that SQL only has one type of null, Ted Codd has proposed two: (http://en.wikipedia.org/wiki/Relational_Model/Tasmania). SQL nulls – why all the fuss? Consider the following query, formulated to obtain a list of suppliers currently shipping more parts than supplier S3. SELECT * FROM S WHERE SNO IN (SELECT FROM GROUP HAVING Exercise 9 SNO SP BY SNO SUM(QTY) > (SELECT SUM(QTY) FROM SP WHERE SNO = 'S3')); ------------------------------------------------------------------------------------Check that this query produces the correct result. Will it always produce the correct result? Since S5 is currently shipping no parts, what result might you expect if we replace S3 with S5? What result do you get when making this replacement? ------------------------------------------------------------------------------------- 24 Week 5 Chris Date on nulls: “the SQL approach of using a null to represent missing information is not a satisfactory solution to that problem. Indeed, it is my opinion that the SQL null introduces far more problems than it solves… …it is all too easy (in the presence of nulls) to formulate a query that looks correct, but in fact is not—even if the user is quite familiar with the way nulls behave” The query above looks correct. Indeed, it produces the correct result for S3. However, when we replace S3 with S5 no result is obtained. But supplier S5 is shipping no parts. We would expect all suppliers shipping parts to appear in the result. The problem here is that the SQL null takes us from the familiar world of 2-valued logic into the world of 3-valued logic. In this world, the familiar values of True and False are joined by a third value – Unknown! Unknown results from any comparison involving null. So, the comparison 3 > null evaluates to Unknown. The truth tables are: AND T F U T T F U F F F F U U F U OR T F U T T T T F T F U U T U U NOT T F F T U U Aside: You sometimes see the phrase “null value”. However, a null is a no value here marker. Try to avoid using this phrase. The effect of Unknown during query processing Why do we get no result from the query below? SELECT * FROM S WHERE SNO IN (SELECT FROM GROUP HAVING SNO SP BY SNO SUM(QTY) > (SELECT SUM(QTY) FROM SP WHERE SNO = 'S5')); Since there are no rows in SP for supplier S5, the inner subquery (repeated below) does not produce a result. Computer scientists designing SQL had to decide how to handle this situation. (SELECT SUM(QTY) FROM SP WHERE SNO = 'S5') Since a null had been introduced for foreign keys, the decision was made to treat a subquery that produces no result in the same way. The result of the inner subquery is null. 25 Week 5 What then is the result of the outer subquery (repeated below)? (SELECT FROM GROUP HAVING SNO SP BY SNO SUM(QTY) > (SELECT SUM(QTY) FROM SP WHERE SNO = 'S5' )) The outer subquery is evaluated once for each group of SP rows. For each group, the sum of QTY values is compared to the result of the inner subquery – a null. As mentioned above, a comparison involving null evaluates to Unknown. Consequently, the HAVING clause does not evaluate to True for any group. So, the outer subquery produces an empty set of SNO values. Exercise 10 ------------------------------------------------------------------------------------Using no new SQL features introduced in this module, how can the query at the top of this page be modified to produce the correct result? ------------------------------------------------------------------------------------Aside: Some authors suggest that, instead of Unknown, the result of a comparison involving null is null. Either way, the SQL query processor is looking for WHERE and HAVING expressions that evaluate to True. Nulls produced by Aggregate Functions Computer scientists who designed SQL also had to decide how aggregate functions (SUM, COUNT, AVG, MIN, MAX) should handle nulls. The following decisions were made: remove nulls from the set to which an aggregate function is applied SUM, AVG, MIN and MAX return null when applied to an empty set COUNT returns zero when applied to an empty set Note: Chris Date argues that the SUM of an empty set should return 0 since 0 is the identity value under addition; that is, 0 + x evaluates to x. To illustrate, consider the following “E” table holding employee data. 26 ENO EName DNO Bonus E1 J. Smith NULL NULL E2 D. Brown NULL NULL E3 A. Sharma NULL NULL E4 B. Lee D1 500 E5 S. Green D1 NULL E6 Q. Han D2 500 E7 E8 E9 M. Patel D. Jones G. Bush D2 D2 D2 400 300 300 Week 5 Notes: S. Green will earn a bonus, but the amount is currently unknown. Executives are not assigned to a department and do not earn a bonus. The query below produces the following result. SELECT DNO, COUNT(ENO), COUNT(Bonus), SUM(Bonus) FROM E GROUP BY DNO; DNO COUNT(ENO) COUNT(Bonus) SUM(Bonus) NULL 3 0 NULL D1 2 1 500 D2 4 4 1500 Note: Nulls are considered equal when grouping rows (the rows with a null DNO are placed in the same group), but not when compared for equality (null = null evaluates to Unknown). Exercise 11 ------------------------------------------------------------------------------------Based on this data, how many bonuses will be paid by department D1? Explain why the value 1 is shown as the count of bonuses for D1. ---------------------------------------------------------------------------------------- Is one type of null enough? The process of removing nulls from the set to which an aggregate function is applied is consistent with the use of a null to represent an inapplicable value; as distinct from an applicable value that is missing: a bonus is not applicable to executives a bonus is applicable to S. Green, but the value of that bonus is currently unknown a COUNT (Bonus) value of 0 for executive bonuses accurately reflects reality a COUNT (Bonus) value of 1 for department D1 does not accurately reflect reality – a second bonus will be paid; the value is missing Noting the distinction between applicable and inapplicable values, Ted Codd proposed two null types – inapplicable and applicable but missing. Chris Date (not a fan of nulls) would ask how we might handle the case of an employee where it has not been decided if a bonus will be paid. Clearly, it would be wrong to record an “inapplicable null” – a bonus may be applicable. Likewise, and it would be wrong to record an “applicable but missing” null – a bonus may not be applicable. In this case, the applicability of a bonus value is the missing information. 27 Week 5 Chris Date uses such examples to ask if two types of null are sufficient – perhaps we need three: inapplicable, applicable but missing, unknown if applicable. Chris date usually follows the question above with another: when will the madness end? From the above coverage, you will see that missing information is nontrivial topic. Interested students can learn more about the topic here: http://en.wikipedia.org/wiki/Null_%28SQL%29. As mentioned previously, Chris Date has concluded that “the SQL null introduces far more problems than it solves”. Handling Nulls in SQL-92 SQL-92 introduced a COALESCE function that can be used to convert a null to a value (zero, perhaps). The COALESCE function takes two or more arguments and returns the result of the first argument that provides a value (is not null). You discovered that the query below produces the following result. SELECT SNO, (SELECT SUM(QTY) FROM SP WHERE SNO = S.SNO)+ (SELECT SUM(QTY) FROM SPJ WHERE SNO = S.SNO) FROM S; SNO S1 S2 S3 S4 S5 TOTAL 2200 3900 900 1500 null The reason a value does not appear for S5 is, once again, the dreaded SQL null. Problems with nulls extend to the arithmetic operators +, -, * and /. Any arithmetic operation involving null evaluates to null. So null * 10 evaluates to null, 0 + null evaluates to null, etc. We can use COALESCE to handle the null problem above. If we replace the second SELECT clause item with the following expression, the query will always display a value for each and every supplier. (SELECT COALESCE(SUM(QTY),0) FROM SP WHERE SNO = S.SNO)+ (SELECT COALESCE(SUM(QTY),0) FROM SPJ WHERE SNO = S.SNO) Note: With each use of COALESCE above, two arguments are provided – SUM(QTY), and the literal value 0. If the aggregate function returns null, the value of the second argument (0) is returned. This has the effect of converting nulls to zeros. 28 Week 5 Exercise 12 ------------------------------------------------------------------------------------Using COALESCE as described above, check that the resulting query returns a TOTAL value for each and every supplier. Note: Microsoft Access does not support COALESCE. However, it provides functions to achieve the same effect. Replace the subquery below with the following subquery: (SELECT COALESCE(SUM(QTY),0) FROM SP WHERE SNO=S.SNO) (SELECT IIF(IsNull(SUM(QTY)),0,SUM(QTY)) FROM SP WHERE SNO=S.SNO) Use on-line help to explore the IIF and IsNull functions. Advanced: How does the fact that 0 + null evaluates to null affect Chris Date’s suggestion that the SUM of an empty set should evaluate to 0. ---------------------------------------------------------------------------------------- Recommendation Since the SQL null creates such problems, Chris Date suggests that we try to minimise the number of nulls we must handle. He suggests that, wherever possible, nulls should be avoided in base tables. It was partly due to Chris Date that SQL-92 introduced support for default values. When a default value is specified for a column, the default value is inserted into the column if the INSERT operation does not provide a value for that column. See the WORK table on page 277 of the textbook for an example. Exercise 13 ------------------------------------------------------------------------------------Advanced: How useful are default values for foreign keys? ---------------------------------------------------------------------------------------Even if no nulls are admitted to base tables, the SQL user will still encounter nulls during query processing. Nulls arise when: rows are preserved in an outer join SUM, AVG, MIN and MAX are applied to an empty set, and a subquery does not produce a result There is a good Wikipedia article on the topic: http://en.wikipedia.org/wiki/Null_%28SQL%29 Recommendation: Beware the SQL null. 29 Week 5 Recursive queries Earlier in this module you were introduced to Common Table Expressions (CTEs). CTEs bring support for recursive queries to SQL. Recursion is a powerful technique. The concept finds expression in many areas of computer science. Introduction To demonstrate the use of recursive queries we will use a table called E describing employees of some enterprise. Each row in E holds an employee number (ENO), employee name (EName) and the employee number of the employee’s boss (BossENO). E (ENO, EName, BossENO) BossENO references E; ENO E1 E2 E3 E4 E5 E6 E7 E8 E9 EName J. Smith D. Brown A. Sharma B. Lee S. Green Q. Han M. Patel D. Jones G. Bush BossENO NULL E1 E1 E2 E2 E3 E3 E7 E7 The E table captures a hierarchical boss of relationship: J. Smith is boss of D. Brown is boss of B. Lee S. Green A. Sharma is boss of Q. Han M. Patel is boss of D. Jones G. Bush The query below lists all “subordinates” to J.Smith (ENO = E1): WITH Sub(ENO) AS ( SELECT ENO FROM E WHERE BossENO = 'E1' UNION ALL SELECT E.ENO FROM E JOIN Sub S ON E.BossENO = S.ENO ) SELECT ENO FROM Sub; 30 Week 5 Identifying a recursive query Previously we described CTEs as temporary views. Here we build a temporary view called Sub. Sub has a single column – ENO. Sub will hold the ENO value of each employee subordinate to J.Smith (ENO = E1). Notice that the last line below is a simple query returning ENO values from the temporary view Sub. WITH Sub(ENO) AS ( SELECT ENO FROM E WHERE BossENO = 'E1' UNION ALL SELECT E.ENO FROM E JOIN Sub S ON E.BossENO = S.ENO ) SELECT ENO FROM Sub; The recursive nature of this query is seen in the definition of CTE Sub – extracted below: WITH Sub(ENO) AS ( SELECT ENO FROM E WHERE BossENO = 'E1' UNION ALL SELECT E.ENO FROM E JOIN Sub S ON E.BossENO = S.ENO ) Notice that the second query in the UNION expression mentions Sub. So, Sub is defined in terms of itself. This is the self-referencing signature of recursion. Note: Recursive queries include a recursive CTE. Processing a recursive query A recursive CTE has two queries joined by UNION ALL. The first query specifies anchor rows in the CTE result. The second query defines chain rows – added recursively to the result. The second query is evaluated repeatedly until either no more chain rows are added to the result, or a recursion limit is reached. In our example, the anchor query (below) produces the following result (shown horizontally): SELECT ENO FROM E WHERE BossENO = 'E1' ENO|E2,E3 The first evaluation of the chain query (below), applied against the value of Sub above, produces the following result (again, shown horizontally): SELECT E.ENO FROM E JOIN Sub S ON E.BossENO = S.ENO Sub: ENO|E2,E3,E4,E5,E6,E7 The second evaluation of the chain query, applied against the value of Sub above, produces the following result: Sub: ENO|E2,E3,E4,E5,E6,E7,E8,E9 The third evaluation of the chain query, applied against the value of Sub above, produces the same result – no new rows are added to the result. 31 Week 5 Recursive evaluation of the chain query terminates when no new rows are added to the result. Having evaluated Sub, the query referencing Sub is run to produce the query output. A user-defined function Just for fun, the following T-SQL statement defines a function that returns a table of SNO values for employees subordinate to a given employee. CREATE FUNCTION SubsTo(@ENO char(6)) RETURNS @Sub TABLE(ENO char(6)) AS BEGIN WITH Sub(ENO) AS ( SELECT ENO FROM E WHERE BossENO = @ENO UNION ALL SELECT E.ENO FROM E JOIN Sub S ON E.BossENO = S.ENO ) INSERT INTO @Sub SELECT ENO FROM Sub; RETURN; END; Having created this function, we can use it in the query below, producing the following result. SELECT * FROM DBO.SubsTo('E2'); ENO|E4,E5 Note: We said that the FROM clause of an SQL query always evaluates to a single table. Here the table is returned by the SubsTo function. Other useful things to know Query optimisation Unfortunately, the topic of query optimisation is not covered in the textbook. It is a topic of interest to the database application developer and database administrator. A thorough investigation of this topic falls outside the scope of the course. However, by way of introduction, we briefly consider the execution of a query that includes a Cartesian join. As mentioned previously, the Cartesian join of two tables is a table consisting of all possible combinations of rows from the two tables. The following query produces the Cartesian join of S and SP: SELECT * FROM S,SP; Normally, we join related tables, and we are only interested in joining related rows. The query below lists related rows from S and SP. For this query, it is conceptually correct to picture the DBMS applying the WHERE clause filter to the Cartesian join of S and SP. In practice, the 32 Week 5 Disk I/O, CPU time, and memory required to process the query may be reduced through the use of an index. SELECT * FROM S,SP WHERE S.SNO = SP.SNO; Let’s assume that tables S and SP are both large. If the percentage of S rows with a related SP row is small, an efficient processing plan might read SP rows by scanning the SP table and use an index on S.SNO to read S rows of interest. If the query were extended as shown below, an SP.PNO index might be used to read the SP rows of interest, and then an index on S.SNO used to read S rows of interest. This could significantly reduce processing time. SELECT * FROM S,SP WHERE S.SNO = SP.SNO AND PNO = 'P1'; The query optimiser selects a processing plan by costing candidate plans expressed as relational algebra expressions. Older optimisers are rulebased estimators – costing a plan directly from the operations used. Modern optimisers are cost-based estimators – costing a plan using statistics on tables and indexes held in the system catalog. Cost-based estimation is more expensive, but more accurate. Interested students are referred to the Wikipedia article on the subject: http://en.wikipedia.org/wiki/Query_optimization. Relational algebra and SQL Students of this course have met relational algebra previously, and possibly also relational calculus. A basic understanding of these languages provides a solid platform for understanding SQL. Here, we explore the relationships between relational algebra and SQL. One of the nice things about Chris Date’s writing on relational algebra is that he uses English words for operators instead of mathematical symbols. In COIT12167 Database Use & Design, a reading on relational algebra is provided, written by Chris Date. The query below is followed by an equivalent relational algebra expression using Chris Date’s syntax. SELECT SNAME FROM S,SP WHERE S.SNO = SP.SNO AND PNO = 'P1'; ((S TIMES SP) WHERE S.SNO = SP.SNO AND PNO = ‘P1’)[SNAME] To illustrate the role of relational algebra in query optimisation, some alternate execution plans are shown below. ((S TIMES (SP WHERE PNO = ‘P1’)) WHERE S.SNO = SP.SNO) [SNAME] 33 Week 5 (((S[SNO,SNAME]) TIMES (SP WHERE PNO = ‘P1’)) WHERE S.SNO = SP.SNO) [SNAME] (((S[SNO,SNAME]) TIMES ((SP WHERE PNO = ‘P1’)[SNO])) WHERE S.SNO = SP.SNO) [SNAME] What use is relational algebra to the database application developer or database administrator? SQL provides direct implementation or three relational operators: UNION, INTERSECT, and EXCEPT (http://en.wikipedia.org/wiki/Union_(SQL)) the student of relational algebra will know that UNION is a set operator; and, hence, will be in a position to appreciate the difference between UNION and UNION ALL further reading on the subject of query optimisation will assume familiarity with relational algebra Note: Most textbooks that cover relational algebra use mathematical symbols to represent operators. The Wikipedia article does likewise: http://en.wikipedia.org/wiki/Relational_algebra. Relational calculus and SQL You will often see claims that SQL is based on relational calculus. It is said that SQL is a non-procedural or declarative language, in contrasts with procedural or imperative languages like Java. Wikipedia provides articles on declarative and imperative programming: declarative: http://en.wikipedia.org/wiki/Declarative_programming imperative: http://en.wikipedia.org/wiki/Imperative_programming With the introduction of procedural extensions in SQL:1999 (SQL/PSM), the claim that SQL is non-procedural is no longer valid. However, developing stand-alone queries might still be described as declarative programming: an SQL query does not specify a procedure to extract data of interest – it simply declares the data of interest. Aside: Given the 5-step processing model presented earlier, you might think that an SQL query does specify a process; and, so, is not declarative. The purpose of the model provided was to explain the semantics of an SQL query. The model represents one possible SQL query execution plan. As you know, query optimisation will consider many possible execution plans for any given query. Some points of interest: it is not important how a DBMS implements an SQL query it is important that the result produced by a DBMS is consistent with the 5-step process described earlier; a DBMS running on a quantum computer may take a very different approach (http://en.wikipedia.org/wiki/Quantum_computer) 34 Week 5 it is the high level of abstraction in the relational model that enables query optimisation; this abstraction may also enable very different styles of query processing in the future – more instantaneous than procedural, perhaps bottom line: an SQL query is not intrinsically procedural One of the nice things about Chris Date’s writings on relational calculus is that he uses English words for the existential and universal quantifiers – EXISTS and FORALL. The relational calculus comes in two flavours – domain and tuple. SQL is based on the tuple calculus. The tuple calculus uses variables that range over tuples in a relation. (The domain calculus uses variables that range over values in a domain.) The query below is followed by an equivalent relational tuple calculus expression using Chris Date’s operators. SELECT SNAME FROM S,SP WHERE S.SNO = SP.SNO AND PNO = 'P1'; RANGE OF SX IS S RANGE OR SPX IS SP SX.SNAME WHERE EXISTS SPX (SPX.SNO = SX.SNO AND SPX.PNO = ‘P1’) An English translation is: The SNAME value of any S tuple where there exists an SP tuple with the same SNO value and a PNO value of P1. SQL provides an implementation of the existential quantifier – EXISTS. An SQL query that is equivalent to the expression above is: SELECT SNAME FROM S SX WHERE EXISTS ( SELECT FROM WHERE AND * SP SPX SX.SNO = SPX.SNO SPX.PNO = 'P1'); Similarities between the above SQL query and relational calculus expression illustrate the claim that SQL is based on relational calculus. Sadly, SQL does not provide an implementation of the universal quantifier FORALL. The expression below evaluates to names of suppliers shipping each and every part type. RANGE OF RANGE OR RANGE OF SX.SNAME SX IS S SPX IS SP PX IS P WHERE FORALL PX EXISTS SPX (SPX.SNO = SX.SNO AND SPX.PNO = P.PNO) An English translation is: The SNAME value of any S tuple - sx, say - where for every P tuple there is an SP tuple describing a shipment of that part type from supplier sx. 35 Week 5 SQL can avoid implementing FORALL because it is not primitive FORALL can be expressed in terms of EXISTS: RANGE OF RANGE OR RANGE OF SX.SNAME SX IS S SPX IS SP PX IS P WHERE NOT EXISTS PX NOT EXISTS SPX (SPX.SNO = SX.SNO AND SPX.PNO = P.PNO) In English: The SNAME value of any S tuple - sx, say - where there does not exist a P tuple that has no related SP tuple describing a shipment of that part type from supplier sx. or… The name of any supplier where there is no part type that they are not shipping. The lack of support for FORALL explains why some SQL queries use double NOT EXISTS (see following example). Important: Since double NOT EXISTS queries are not the easiest to read, always look for simpler equivalent queries. All of the queries below are equivalent. SELECT SNAME FROM S SX WHERE NOT EXISTS ( SELECT * FROM P PX WHERE NOT EXISTS ( SELECT FROM WHERE AND * SP SPX SPX.SNO = SX.SNO SPX.PNO = PX.PNO )); is equivalent to: SELECT SNAME FROM S SX WHERE NOT EXISTS ( SELECT * FROM P PX WHERE PNO NOT IN ( SELECT PNO FROM SP SPX WHERE SPX.SNO = SX.SNO )); is equivalent to: SELECT SNAME FROM S WHERE SNO IN ( SELECT SNO FROM SP GROUP BY SNO HAVING COUNT(*) = ( SELECT COUNT (*) FROM P )); is equivalent to: 36 Week 5 SELECT SNAME FROM S INNER JOIN SP ON S.SNO = SP.SNO GROUP BY SNAME HAVING COUNT(*) = ( SELECT COUNT (*) FROM P ); What use is relational calculus to you? familiarity with the calculus provides a solid platform for using SQL SQL provides direct support for the existential quantifier relational calculus is based on first order logic – a formal deductive system that attempts to capture the essence of human reasoning (http://en.wikipedia.org/wiki/First-order_predicate_calculus) first order logic finds application in many fascinating areas of computer science, including: artificial intelligence, deductive databases, logic programming, program proof, natural language processing (Wikipedia has articles on all of these topics) Happy querying! Working on Assignment 1 It is VERY important that you spend time working on Assignment 1 each and every week up to the due date (Friday, Week 5). No tutorial work has been set for this course. Instead, use the time to work on the assignments. This week you should aim to finish your CREATE VIEW statements for Assignment 1. Then, having completed you script, perform a final check that the script runs correctly. It is vital that the marker can run your script. If not, the marker is unlikely to give full credit for your work. Having checked that your script runs without errors, the final steps to making your submission are: check the documentation requirements for the assignment prepare your documentation download the marking sheet and check the assessment criteria prepare you zip file for submission make your submission record your submission number download your submission file and check it is not corrupted Making an early start on Assignment 2 37 Week 5 With the content of this module fresh in your mind, you might want to take a look at the information requests in Assignment 2. It may take a while to develop queries to answer these requests. The best way to develop these queries may be to return to the task a few times. Make a start now by just reading the requests. Review Objectives In preparation for the exam, review the leaning objectives identified at the start of this module. The exam for this course is open book. You can take your own notes, an annotated Study Guide, and printed materials into the exam. Prepare for the exam now by making any notes that will help demonstrate you can satisfy the leaning objectives identified at the start of this module. 38 Week 5 Exercises Exercises 1 to 13 are integrated into the body of the module. 14. In your own words, explain why the information request below is not correctly answered by the following query. For each and every supplier described in the Parts database, find the sum of the number of parts supplied in the past (already used on projects) and the number of parts they have been requested to supply in the future (currently being shipped). SELECT SP.SNO,SUM(SP.QTY)+SUM(SPJ.QTY) AS TOTAL FROM SP JOIN SPJ ON SP.SNO = SPJ.SNO GROUP BY SP.SNO; 15. With V1 and V2 defined as shown below, will the following query fully answer our information request? CREATE VIEW V1(SNO,ORDERED) AS SELECT SNO, SUM(QTY) FROM SP GROUP BY SNO; CREATE VIEW V2(SNO,USED) AS SELECT SNO, SUM(QTY) FROM SPJ GROUP BY SNO; SELECT FROM WHERE UNION SELECT FROM WHERE UNION SELECT FROM WHERE UNION SELECT FROM WHERE AND V1.SNO,ORDERED+USED AS TOTAL V1,V2 V1.SNO = V2.SNO SNO,ORDERED V1 SNO NOT IN ( SELECT SNO FROM V2 ) SNO,USED V2 SNO NOT IN ( SELECT SNO FROM V1 ) SNO,0 S SNO NOT IN ( SELECT SNO FROM V1 ) SNO NOT IN ( SELECT SNO FROM V2 ); 39 Week 5 16. Consider the following information request and query. List names of suppliers who are currently shipping more part types than supplier S3. SELECT FROM WHERE GROUP HAVING SNAME S,SP S.SNO = SP.SNO BY SNAME COUNT(PNO) > ( SELECT COUNT(PNO) FROM SP WHERE SNO = 'S3' ); Will the above query answer the information request correctly? 17. Consider the following information request and query. Obtain names of suppliers that are currently shipping no parts. SELECT SNAME FROM S LEFT JOIN SP ON S.SNO = SP.SNO GROUP BY SNAME HAVING SUM(QTY) = 0; (a) Why will this query not answer the information request correctly? (b) Modify the query to produce the required result. 18. Consider the following information request and query. Obtain names of suppliers that are currently shipping no P2s. SELECT SNAME FROM S LEFT JOIN SP ON S.SNO = SP.SNO WHERE PNO = 'P2' OR PNO IS NULL GROUP BY SNAME HAVING SUM(QTY) IS NULL; Why will this query not answer the information request correctly? 19. Consider the following database, information request and query: Account (AID, LastStatementDate, LastStatementBalance); Payment (PID, AcctID, Date, Paid) AID references Account; List Account IDs for accounts where the balance outstanding on the last account statement has not been paid in full. SELECT A.AID FROM Account A JOIN Payment P ON A.AID = P.AID WHERE P.Date > LastStatementDate GROUP BY A.AID, LastStatementBalance HAVING SUM(Paid) < LastStatementBalance; This query will not always answer the information request correctly. (a) Explain why this query will miss accounts that have made no payment since the last statement. (b) To include the missing accounts, one might be tempted to use an outer join and modify WHERE to retain preserved rows (P.Date IS NULL). Will this fix the problem? (c) Formulate a query to always answer the request correctly. 40 Week 5 20. The following database describes information about courses (in C table) and prerequisite courses (in P table). Consider the following database, information request, query, and sample data: C (CID, CName); R (CID, PreReqCID) CID references C, PreReqCID references C; List the prerequisite courses for COIT13143. WITH Req(CID) AS ( SELECT PreReqCID FROM R WHERE CID = 'COIT13143' UNION ALL SELECT PreReqCID FROM R JOIN Req ON R.CID = Req.CID ) SELECT CID FROM Req; C CID COIT11134 COIT11222 COIT11226 COIT12167 COIT13143 CName Java Programming Visual Programming Systems Analysis & Design Database Use & Design Database Application Development R CID COIT11134 COIT12167 COIT12167 COIT13143 PreReqCID COIT11222 COIT11134 COIT11226 COIT12167 (a) How do we know that the query above is a recursive query? (b) Explain the workings of the above query. In particular, explain how contents of the CTE called Req are derived recursively. 41 Week 5 Solutions to exercises in module 1. As explained in the following text, the query produces an error because SNAME “does not appear in an aggregate function or the GROUP BY clause”. 2. See solution to Exercise 14 below. 3. One might expect to see a NbrParts value of zero (0). Actually, SQL produces a null NbrParts for S5. 4. No solution required. 5. S5 is missing from the result because S5 does not appear in V1 and an inner join of V1 and V2 has been formed. No, the Cartesian join of V1 and V2 followed by the row filer of V1.SNO = V2.SNO is equivalent to the inner join. Yes, S5 appears in the result, but with a null TOTAL. An outer in V1 will preserve S5, but with a null ORDERED value. S5 appears with a null TOTAL. 6. It does. 7. For S5 (SELECT SUM(QTY) FROM SP WHERE SNO=S.SNO) evaluates to null. And, null + 3100 evaluates to null. 8. The query includes ALL after UNION to preserve duplicate (SNO, QTY) rows that may be obtained from the two operand queries. The result will not include a row with an SNO value that appears in S but does not appear in either SP or SPJ. SELECT SNO, SUM(QTY) AS TOTAL FROM ( SELECT SNO,QTY FROM SP UNION ALL SELECT SNO,QTY FROM SPJ UNION SELECT SNO,0 FROM S) AS V GROUP BY SNO; 9. It does produce the correct result. It will not produce the correct result if S3 is shipping no parts. One might expect it to list details of all suppliers shipping parts. The query produces an empty result. 42 Week 5 10. The following query will always list all suppliers shipping more parts than S5: SELECT * FROM S WHERE SNO IN ( SELECT SNO FROM SP GROUP BY SNO HAVING SUM(QTY) > ( SELECT SUM(QTY) FROM SP WHERE SNO = 'S5' ) OR NOT EXISTS ( SELECT * FROM SP WHERE SNO = 'S5' )); 11. Two (2) bonuses will be paid by department D1. The value 1 is shown as the count of bonuses for D1 since the null for S. Green is removed from the set to which the COUNT function is applied. 12. Advanced: Chris Dates argues that SUM of an empty set should evaluate to 0 since 0 is identity under addition: 0 + x evaluates to x (which is still true if x is null). The fact that x + null evaluates to null does not weaken his argument. 13. Advanced: Default values are not particularly useful for foreign keys. If a default value is used, a row in the target table must hold this value for the target candidate key. There are few cases where this is appropriate. In most cases, this would require the introduction of a “dummy row” in the target table. Then: queries using the referenced table must explicitly avoid the dummy row; or, a view would be needed to filter out the dummy row. Also: queries using the referencing table must take “default foreign keys values” into account. Note: searching for default foreign keys values is still type-independent as we can search for “IS DEFAULT”. 14. The query: SELECT SP.SNO,SUM(SP.QTY)+SUM(SPJ.QTY) AS TOTAL FROM SP JOIN SPJ ON SP.SNO = SPJ.SNO GROUP BY SP.SNO; produces the following result: SNO S1 S2 S3 S4 TOTAL 8000 12000 1100 3600 Consider the case of supplier S3. The correct value for S3 is 900. SELECT * FROM SP WHERE SNO = 'S3'; produces: SNO S3 PNO P2 QTY 200 SELECT * FROM SPJ WHERE produces: SNO S3 S3 PNO P2 P4 JNO J1 J2 SNO = 'S3'; QTY 200 500 43 Week 5 SELECT * FROM SP JOIN SPJ ON SP.SNO = SPJ.SNO WHERE SP.SNO = 'S3'; produces: SP.SNO SP.PNO SP.QTY SPJ.SNO SPJ.PNO SPJ.QTY S3 P2 200 S3 P3 200 S3 P2 200 S3 P4 500 giving: SUM(SP.QTY) = 400 SUM(SPJ.QTY) = 700 SUM(SP.QTY)+SUM(SPJ.QTY) = 1100 In summary: Rows from SP and SPJ are joined over SNO. Each row is joined to one or more related rows in the other table. Each QTY value will appear in the join as many times as there are related rows in the other table. Summing replicated QTY values will result in an inflated TOTAL value for the supplier. 15. Yes. 16. Yes it will. COUNT returns zero when it is applied to an empty set of values. 17. (a) Nulls are removed from the set of items to which an aggregate function is applied. SUM returns null when applied to an empty set. (b) Two possibilities are: SELECT SNAME FROM S LEFT JOIN SP ON S.SNO = SP.SNO GROUP BY SNAME HAVING SUM(QTY) IS NULL; SELECT FROM GROUP HAVING SNAME S LEFT JOIN SP ON S.SNO = SP.SNO BY SNAME COALESCE(SUM(QTY),0) = 0; 18. The WHERE clause is applied to the outer join of S and SP. This outer join will preserve rows from S with no related row in SP. An S row describing a supplier not shipping any parts will be preserved by the outer join. The name of this supplier will appear in the result. An S row describing a supplier shipping some parts, but not P2s, will appear in the result of the outer join joined to the rows describing the parts they are shipping. For this supplier, no row in the outer join has a PNO of “P2 or null”. Consequently, no rows for this supplier will survive the WHERE clause. 44 Week 5 19. (a) It will not include details of accounts that have made no payments since the date of the last statement. (b) No. See answer to Exercise 18. (c) Three possibilities are: SELECT A.AID FROM Account A JOIN Payment P ON A.AID = P.AID WHERE P.Date > LastStatementDate GROUP BY A.AID, LastStatementBalance HAVING SUM(Paid) < LastStatementBalance UNION SELECT A.AID FROM Account A WHERE LastStatementBalance > 0 AND NOT EXISTS ( SELECT * FROM Payment P WHERE P.AID = A.AID AND P.Date > LastStatementDate ); SELECT A.AID FROM Account A WHERE LastStatementBalance > ( SELECT SUM(Paid) FROM Payment P WHERE P.AID = A.AID AND P.Date > LastStatementDate ) OR (LastStatementBalance > O AND NOT EXISTS ( SELECT * FROM Payment P WHERE P.AID = A.AID AND P.Date > LastStatementDate )); SELECT A.AID FROM Account A WHERE LastStatementBalance > ( SELECT COALESCE(SUM(Paid),0) FROM Payment P WHERE P.AID = A.AID AND P.Date > LastStatementDate ); 45 Week 5 20. (a) We know that the query is a recursive query because it includes a recursive common table expression (CTE). Notice how the definition of CTE Req includes a reference to itself in the query expression following UNION ALL. WITH Req(CID) AS ( SELECT PreReqCID FROM R WHERE CID = 'COIT13143' UNION ALL SELECT PreReqCID FROM R JOIN Req ON R.CID = Req.CID ) (b) The query works by evaluating the CTE Req and then executing the following query against Req: SELECT CID FROM Req; The evaluation of Req starts by executing the anchor query below, with produces the following result. SELECT PreReqCID FROM R WHERE CID = 'COIT13143' Req: CID| 'COIT12167' The first evaluation of the chain query (below), applied against the value of Req above, produces the following result: SELECT PreReqCID FROM R JOIN Req ON R.CID = Req.CID Req: CID| 'COIT12167', ' COIT11134', ' COIT11226' The second evaluation of the chain query, applied against the value of Req above, produces the following result: Req: CID| 'COIT12167', ' COIT11134', 'COIT11226', 'COIT11222' The third evaluation of the chain query, applied against the value of Req above, produces the same value of Req. That is, no new rows are added to Req. At this point the recursive CTE is fully evaluated. Having evaluated Req, the query referencing Req is run to produce the query output - the four CID values above. 46