Outer Join In the following document C.J.Date explains some of the subtleties of outer joins. Introduction Outer join comes in different flavors i.e. left, right and full outer theta-join, and left, right, full outer natural join. The main advantage of outer join is that it ‘preserves’ information when compared to inner join, in which there is ‘information loss’. This advantage is explained through the following fig. 1 In the above figure part (a) shows some sample values for the suppliers table (S) and the shipments table (SP); part (b) shows the regular (inner) natural join of S and SP over S#; and part (c) shows a corresponding outer natural join. As the figure indicates the inner join ‘loses information’ for suppliers who supply no parts, whereas the outer join ‘preserves’ such information. The left and full outer joins are identical in this particular example and are shown in part (c) of the figure; the right outer natural join, by contrast, degenerates to the inner natural join and is shown in part (b) of the figure 1. Support for outer join Here, is the way the outer join of fig 1(c) is would be expressed in DB2 (one product out of many that does not provide any direct support): The problems here with above syntax are: The query expression is complicated; it is error prone; system optimizer will not recognize that the user is actually trying to construct outer join, and thus performance will be poor. So, Date shows that there is some need for support for outer join. Date wants to see at least all of the following properties mentioned below to regard it as a ‘good’ outer join support: Explicit “one-liner” outer join expression: e.g. Support for: S OUTER JOIN SP Left, right, and full versions for both outer natural joins and outer theta-joins Outer joins of any number of tables Generality (i.e. all possible outer joins are handled) Ability for the user to control the representation of fact that information is missing. How have many of the products failed to extend support for outer join? Well, this will be discussed in the coming topics. But, for now C.J.Date remarks that it is in fact quite difficult to extend SQL to do outer join “right”. In order to explain why this is so, we have to first know the semantics of the basic SQL “SELECT-FROM-WHERE” construct. Conceptually this is what happens, 1. The Cartesian product of the tables listed in the FROM clause is computed. 2. The Cartesian product is restricted to that subset of the rows that satisfy the “restriction condition” in the WHERE clause. 3. Finally, that restricted subset is projected over the columns named in the SELECT clause. The foregoing conceptual evaluation does not work well for outer joins, because outer join possesses a number of Nasty Properties which together conspire to undermine the assumptions underlying the original evaluation algorithm. The above fact makes it impossible to express outer joins in terms of any simple extension to SELECT-FROM-WHERE construct. The Nasty properties are listed here: 1. Outer equijoin is not a restriction of Cartesian product. 2. Restriction does not distribute over outer equijoin. 3. “A<=B” is not the same as “A<B” or “A=B”. 4. The comparison operators are not transitive. 5. Outer natural join is not a projection of outer equijoin. WHY WE CANNOT “MERELY” EXTEND THE WHERE CLAUSE Here the Nasty Property (1), outer equijoin is not a restriction of the Cartesian product is explained. The above fig. 2 shows (a) the Cartesian product of tables S and SP from fig. 1 and (b) the left (equivalently, the full) outer equijoin of those same two tables over S#. Here is the way that outer join of fig. 1(c) is explained in SYBASE: It follows from Nasty property (1) that we cannot coherently expect to be able express outer equijoin by merely inventing some new kind of comparison operator (i.e. SYBASE’s “ *= ” .) and then allowing that operator in WHERE clause. The above SYBASE example fails on this score. The operator “ *= “ means ‘include all rows from the first-named table in the join specification whether or not there is a match on the [joining] column in the [second] table’. Even though the above query produces desired result, it does not handle all cases ( it does not support outer equijoin, but only the right and left versions) and moreover it is hard to explain the semantics of the expression “ *= ” to the user. Nasty Property (2), i.e. restriction does not distribute over outer equijoin. This statement means that the expressions: ( R outer-equijoin S) WHERE restriction-on-R and ( R WHERE restriction-on-R ) outer-equijoin S are not equivalent, in general. By way of example, consider the following SYBASE query The above query has two interpretations , depending on whether the restriction condition ‘ SP.QTY < 1000’ is applied before or after the join condition ‘ S.S# *= SP.S# ’, because performing the restriction first gives a result containing rows for both suppliers S2 and S5; performing the join first gives result containing supplier S2 only. Chamberlin’s formulation of outer join The one possible approach to extending SQL to support outer equijoins would be to allow the FROM clause to refer to ‘augmented tables’, and then to use the WHERE clause to express the restriction of the result of that FROM clause (Cartesian product of augmented tables) that is needed to construct the desired outer join. Here ‘augmented table’ means a table that contains all of the rows of the original table with an all-null row. Augmented version of table T is ‘ T+ ‘. This would be the Chamberlin’s formulation of the outer join of fig. 1(c) Here XYZ is the augmented version of table SP. This approach certainly works and is not ad hoc. But, it does suffer from some problems which are it is very error prone; it is not at all easy to get the restrictions in the where clause right, even in simple cases. Moreover, the technique is difficult to apply. WHY WE CANNOT MERELY EXTEND THE FROM CLAUSE: As it is difficult to get the WHERE clause right in Chamberlin’s approach, because the necessary restriction conditions become quite complex, we try to find a way to extend the FROM clause to augment tables with all-null rows appropriately and to do the necessary complex restrictions automatically. INFORMIX attempted such an approach, here is the example : An immediate criticism of the above syntax is that the result includes rows for which the where condition does not evaluate to true. Another is that the reference to ‘SP’ in the WHERE clause must be presumably be taken to denote the augmented table SP, not the table SP. Such syntactic trickery does not seem to be a very sound basis on which to construct a formal language. This technique will not handle the general case, nor it is easy to state which cases it will handle (Like SUBASE,INFORMIX left and right outer equijoin but not full outer equijoin). And there is no clear explanation of ‘outer T’, it is difficult to explain the semantics of the expression ‘outer T’ for arbitrary T. Nasty Property (3), i.e. “A<=B” is not the same as “A < B or A=B”. Consider the following two queries Given the sample data of fig. 1(a), the first of these queries produces a result that happens to be identical to the outer equijoin, shown in fig. 2 (b). The second however, produces a result that is the union of the outer equijoin with the left outer ‘less than’ join of the two tables, it is shown below And, of course when computed the two results are not the same. To be specific the ‘union’ result includes an additional row in which the supplier(S) S# is S2 and the shipment(SP) S# is null. Nasty Property (4), i.e. the comparison operators are not transitive. This property is a consequence of the fact that we are dealing with three-valued logic. For example, the statement ( A = B AND B = C ) implies ( A = C ) is not valued in three-valued logic. For instance, if A is 1 and B is null and C is 2, then the left hand side evaluates to unknown, but the right hand side evaluates to false. PRESERVE CLAUSE By now it is convinced that no simple extension to the FROM or WHERE clause can handle the outer join problem satisfactorily. So, C.J.Date proposed a new PRESERVE clause, whose was to preserve rows from the table named in that clause that did not otherwise contribute to the result of evaluating the preceding FROM and WHERE clauses. PRESERVE was implemented in Computer Associates’ CA-DB product. Here is the CA-DB version of left outer natural join of S with SP over S# The PRESERVE clause was adequate to handle left, right and full outer theta-joins of exactly two tables. However, the preserve clause per se still does not provide a particular elegant solution to the outer natural join problem, because of the Nasty Property (5). Nasty Property (5), i.e. outer natural join is not a projection of the outer equijoin. Any language that is based on taking projections is awkward to extend support to outer natural join. This is unfortunate. Below shown is fig. 3 which is a revised version of fig. 1 . Part (a) of the figure gives a slightly revised set of sample values for tables; part (b) shows the outer equijoin over supplier numbers; and part (c) shows the corresponding outer natural join. Here is one possible way of formulating a query for the outer natural join of fig. 3(c) : In the above query, the FROM, WHERE and PRESERVE clauses together construct the outer equijoin of fig. 3(b); the SELECT clause then derives the outer natural join. The COALESCE function returns a value equal to the value of its first non null argument, or returns null if both of its arguments are null. The above query is very succinct and is cumbersome to express. So C.J.Date’s solution to the above problems is to extend SQL to provide direct ‘one-liner’ support. SQL2, which is follow-on to the existing SQL is planning to provide such support. Here is the formulation of outer natural join of fig. 3(c): IS OUTER JOIN REALLY WHAT WE NEED? In Date’s opinion, outer join, as the operation is usually understood, is too simplistic, because of its ‘nulls interpretation’ problem. There are numerous different reasons for generating nulls in the result of an outer join. So, generating just a single ‘value unknown’ null in every case is not appropriate. This is explained through an example: Here, PGMR represents a subtype table, i.e. if employee ‘x’ is a programmer, then ‘x’ will appear both in EMP and PGMR tables).Let us examine some outer join cases using the above tables. 1) The left outer natural join of EMP with PGMR over EMP# will generate null LANG Value for employee who is not a programmer. This null should be of ‘Property does not apply’ variety. 2) The left outer natural join of DEPT with EMP over DEPT# will generate null EMP# and SALARY values for any department that has no employees. This null should be of ‘value does not exist’ variety. 3) The left outer natural join of EMP with DEPT over DEPT# will generate null DEPT# and BUDGET values for any employee who has unknown department. These are ‘value unknown’ nulls. ARE NULLS REALLY WHAT WE NEED? Here C.J.Date says that ‘nulls’, meaning nulls of three or higher-valued logic are not what we want. We need a DBMS that provides a systematic “DBA-defined default values” mechanism, and a version of outer join that generates such values instead of nulls. Date also states that outer join is not a good model of certain real world situations. In Case 2 (from above) the table produced by the outer join (left outer join of DEPT with EMP over DEPT#) has DEPT#, BUDGET, EMP# and SALARY as its columns. The criteria for membership in this table for some candidate row (d,b,e,s) has to be something like this: “There exists some department with department number d and budget b, and EITHER e is the employee number of some employee who works in that department and s is that employee’s salary, OR that department has no employees at all and e is null and s is null.” The above criteria is not only difficult to state, it is quite difficult to understand as well. Any misunderstanding is likely to result in incorrect queries and wrong answers out of the database. For such reasons, some researchers are exploring alternative approaches to outer join problem. OUTER JOIN WITH NO NULLS AND FEWER TEARS C.J.Date exposes some of the pitfalls of the outer join operator in relational query languages, and also discusses some of the problems that arise in products in which such a product has already been implemented and hints at a preference for an implementation that does not involve nulls. So, here Hugh Darwen suggest an operator LEFT JOIN, that: Can be painlessly be included in a relational algebra Is not a primitive operator in an algebra that includes Cartesian Product, rename, restriction, projection, difference and union operator. Always delivers a relation in 1NF. Never generates nulls. Is constrained to a certain well-defined subset of the many varieties of outer join that can be distinguished. LEFT JOIN The operator LEFT JOIN, is intended for inclusion in a language based on the relational algebra, in which results are all null-free. This, language is based entirely on the traditional two-valued logic. The syntax is as follows : LEFT JOIN ( left, right, fill ) Explanation: 1. The operands left, right, and fill are relation-valued expressions 2. Some candidate key of right is a subset of the common columns of left and right 3. The columns of fill must be precisely those columns of right that are not columns of left 4. The degrees of left, right, and fill are not otherwise constrained, and may even be zero 5. The cardinality of fill must be exactly one. Let R= LEFT JOIN (A,B,C). Then R is the left outer natural join of A with B, defined as follows. Each of R consists of a row of A extended with values, for the non common columns of B, from either (a) the matching row of B, if such a matching row exists, or (b) the (unique) row of C otherwise. Every row in A has a corresponding row in R. Rows in B that have no matching row in A do not contribute to R. The word LEFT implies that only the data of what would be the left operand in an infix notation is preserved. The pseudo SQL version is provided below: The heading (i.e., set of column names) of R is he union of the headings of A and B. Any candidate key of A is a candidate key of R. R is updateable to the extent that A is updateable, except the values in columns deriving from B may not be changed. LEFT JOIN directly supports all requirements for many-to-one left outer natural joins. One-to-many right outer natural joins are trivially supported too. One-to-one full outer natural joins are indirectly supported, because” FULL JOIN (A,B,Ca,Cb) ” can be expressed as UNION ( LEFT JOIN (A,B,Cb), LEFT JOIN (B,A,Ca)). On-to-many left, many-to-one right, and many-to-many outer natural joins are not supported. OUTER JOINS NOT SUPPORTED BY THE LEFT JOIN Consider the left outer natural join of DEPT with EMP, not supported by LEFT JOIN because it is one-tomany. And it has been already seen in that when an unmatched DEPT row is combined with a row of nulls or defaults to represent an employee less department is heavily suspect. Consider that SUSPECT is the left outer natural join of DEPT with EMP, and when a user has a glance at the contents of SUSPECT; it will be all about employees working in departments. So, now the user wanted to know how many employees work in each department and he implemented the following query Unfortunately the above did not distinguish the empty departments from singletons, so right output was not shown. A correct way to express the required query involves the left outer natural join of DEPT with a summary of EMP. It is shown here in two steps, the first using SQL , the second using LEFT JOIN operator Here DEPT# is a candidate key for T1, so the constraints we impose on LEFT JOIN are satisfied. The LEFT JOIN really does deliver a thoroughly respectable relation as its result, where nulls will not appear and meaning of each column of the result does not vary from row to row