Support for outer join

advertisement
Outer Join
In the following document C.J.Date explains some of the subtleties of outer joins.
Introduction
Outer join comes in different flavors i.e. left, right and full outer theta-join, and left, right, full outer
natural join. The main advantage of outer join is that it ‘preserves’ information when compared to inner
join, in which there is ‘information loss’. This advantage is explained through the following fig. 1
In the above figure part (a) shows some sample values for the suppliers table (S) and the shipments
table (SP); part (b) shows the regular (inner) natural join of S and SP over S#; and part (c) shows a
corresponding outer natural join. As the figure indicates the inner join ‘loses information’ for suppliers
who supply no parts, whereas the outer join ‘preserves’ such information. The left and full outer joins
are identical in this particular example and are shown in part (c) of the figure; the right outer natural
join, by contrast, degenerates to the inner natural join and is shown in part (b) of the figure 1.
Support for outer join
Here, is the way the outer join of fig 1(c) is would be expressed in DB2 (one product out of many that
does not provide any direct support):
The problems here with above syntax are: The query expression is complicated; it is error prone; system
optimizer will not recognize that the user is actually trying to construct outer join, and thus performance
will be poor. So, Date shows that there is some need for support for outer join. Date wants to see at
least all of the following properties mentioned below to regard it as a ‘good’ outer join support:

Explicit “one-liner” outer join expression: e.g.

Support for:
S OUTER JOIN SP
 Left, right, and full versions for both outer natural joins and outer theta-joins
 Outer joins of any number of tables

Generality (i.e. all possible outer joins are handled)

Ability for the user to control the representation of fact that information is missing.
How have many of the products failed to extend support for outer join? Well, this will be discussed in
the coming topics. But, for now C.J.Date remarks that it is in fact quite difficult to extend SQL to do outer
join “right”. In order to explain why this is so, we have to first know the semantics of the basic SQL
“SELECT-FROM-WHERE” construct. Conceptually this is what happens,
1. The Cartesian product of the tables listed in the FROM clause is computed.
2. The Cartesian product is restricted to that subset of the rows that satisfy the “restriction
condition” in the WHERE clause.
3. Finally, that restricted subset is projected over the columns named in the SELECT clause.
The foregoing conceptual evaluation does not work well for outer joins, because outer join possesses a
number of Nasty Properties which together conspire to undermine the assumptions underlying the
original evaluation algorithm. The above fact makes it impossible to express outer joins in terms of any
simple extension to SELECT-FROM-WHERE construct. The Nasty properties are listed here:
1. Outer equijoin is not a restriction of Cartesian product.
2. Restriction does not distribute over outer equijoin.
3. “A<=B” is not the same as “A<B” or “A=B”.
4. The comparison operators are not transitive.
5. Outer natural join is not a projection of outer equijoin.
WHY WE CANNOT “MERELY” EXTEND THE WHERE CLAUSE
Here the Nasty Property (1), outer equijoin is not a restriction of the Cartesian product is explained.
The above fig. 2 shows (a) the Cartesian product of tables S and SP from fig. 1 and (b) the left
(equivalently, the full) outer equijoin of those same two tables over S#. Here is the way that outer join of
fig. 1(c) is explained in SYBASE:
It follows from Nasty property (1) that we cannot coherently expect to be able express outer equijoin by
merely inventing some new kind of comparison operator (i.e. SYBASE’s “ *= ” .) and then allowing that
operator in WHERE clause. The above SYBASE example fails on this score. The operator “ *= “ means
‘include all rows from the first-named table in the join specification whether or not there is a match on
the [joining] column in the [second] table’.
Even though the above query produces desired result, it does not handle all cases ( it does not support
outer equijoin, but only the right and left versions) and moreover it is hard to explain the semantics of
the expression “ *= ” to the user.
Nasty Property (2), i.e. restriction does not distribute over outer equijoin. This statement means that the
expressions:
( R outer-equijoin S) WHERE restriction-on-R
and
( R WHERE restriction-on-R ) outer-equijoin S
are not equivalent, in general. By way of example, consider the following SYBASE query
The above query has two interpretations , depending on whether the restriction condition
‘ SP.QTY < 1000’ is applied before or after the join condition ‘ S.S# *= SP.S# ’, because performing the
restriction first gives a result containing rows for both suppliers S2 and S5; performing the join first gives
result containing supplier S2 only.
Chamberlin’s formulation of outer join
The one possible approach to extending SQL to support outer equijoins would be to allow the FROM
clause to refer to ‘augmented tables’, and then to use the WHERE clause to express the restriction of the
result of that FROM clause (Cartesian product of augmented tables) that is needed to construct the
desired outer join. Here ‘augmented table’ means a table that contains all of the rows of the original
table with an all-null row. Augmented version of table T is ‘ T+ ‘. This would be the Chamberlin’s
formulation of the outer join of fig. 1(c)
Here XYZ is the augmented version of table SP. This approach certainly works and is not ad hoc. But, it
does suffer from some problems which are it is very error prone; it is not at all easy to get the
restrictions in the where clause right, even in simple cases. Moreover, the technique is difficult to apply.
WHY WE CANNOT MERELY EXTEND THE FROM CLAUSE:
As it is difficult to get the WHERE clause right in Chamberlin’s approach, because the necessary
restriction conditions become quite complex, we try to find a way to extend the FROM clause to
augment tables with all-null rows appropriately and to do the necessary complex restrictions
automatically. INFORMIX attempted such an approach, here is the example :
An immediate criticism of the above syntax is that the result includes rows for which the where
condition does not evaluate to true. Another is that the reference to ‘SP’ in the WHERE clause must be
presumably be taken to denote the augmented table SP, not the table SP. Such syntactic trickery does
not seem to be a very sound basis on which to construct a formal language. This technique will not
handle the general case, nor it is easy to state which cases it will handle (Like SUBASE,INFORMIX left and
right outer equijoin but not full outer equijoin). And there is no clear explanation of ‘outer T’, it is
difficult to explain the semantics of the expression ‘outer T’ for arbitrary T.
Nasty Property (3), i.e. “A<=B” is not the same as “A < B or A=B”. Consider the following two queries
Given the sample data of fig. 1(a), the first of these queries produces a result that happens to be
identical to the outer equijoin, shown in fig. 2 (b). The second however, produces a result that is the
union of the outer equijoin with the left outer ‘less than’ join of the two tables, it is shown below
And, of course when computed the two results are not the same. To be specific the ‘union’ result
includes an additional row in which the supplier(S) S# is S2 and the shipment(SP) S# is null.
Nasty Property (4), i.e. the comparison operators are not transitive. This property is a consequence of
the fact that we are dealing with three-valued logic. For example, the statement
( A = B AND B = C ) implies ( A = C )
is not valued in three-valued logic. For instance, if A is 1 and B is null and C is 2, then the left hand side
evaluates to unknown, but the right hand side evaluates to false.
PRESERVE CLAUSE
By now it is convinced that no simple extension to the FROM or WHERE clause can handle the outer join
problem satisfactorily. So, C.J.Date proposed a new PRESERVE clause, whose was to preserve rows from
the table named in that clause that did not otherwise contribute to the result of evaluating the
preceding FROM and WHERE clauses. PRESERVE was implemented in Computer Associates’ CA-DB
product. Here is the CA-DB version of left outer natural join of S with SP over S#
The PRESERVE clause was adequate to handle left, right and full outer theta-joins of exactly two tables.
However, the preserve clause per se still does not provide a particular elegant solution to the outer
natural join problem, because of the Nasty Property (5).
Nasty Property (5), i.e. outer natural join is not a projection of the outer equijoin. Any language that is
based on taking projections is awkward to extend support to outer natural join. This is unfortunate.
Below shown is fig. 3 which is a revised version of fig. 1 . Part (a) of the figure gives a slightly revised set
of sample values for tables; part (b) shows the outer equijoin over supplier numbers; and part (c) shows
the corresponding outer natural join.
Here is one possible way of formulating a query for the outer natural join of fig. 3(c) :
In the above query, the FROM, WHERE and PRESERVE clauses together construct the outer equijoin of
fig. 3(b); the SELECT clause then derives the outer natural join. The COALESCE function returns a value
equal to the value of its first non null argument, or returns null if both of its arguments are null. The
above query is very succinct and is cumbersome to express.
So C.J.Date’s solution to the above problems is to extend SQL to provide direct ‘one-liner’ support. SQL2,
which is follow-on to the existing SQL is planning to provide such support. Here is the formulation of
outer natural join of fig. 3(c):
IS OUTER JOIN REALLY WHAT WE NEED?
In Date’s opinion, outer join, as the operation is usually understood, is too simplistic, because of its ‘nulls
interpretation’ problem. There are numerous different reasons for generating nulls in the result of an
outer join. So, generating just a single ‘value unknown’ null in every case is not appropriate. This is
explained through an example:
Here, PGMR represents a subtype table, i.e. if employee ‘x’ is a programmer, then ‘x’ will appear both in
EMP and PGMR tables).Let us examine some outer join cases using the above tables.
1) The left outer natural join of EMP with PGMR over EMP# will generate null LANG Value for
employee who is not a programmer. This null should be of ‘Property does not apply’ variety.
2) The left outer natural join of DEPT with EMP over DEPT# will generate null EMP# and SALARY
values for any department that has no employees. This null should be of ‘value does not exist’
variety.
3) The left outer natural join of EMP with DEPT over DEPT# will generate null DEPT# and BUDGET
values for any employee who has unknown department. These are ‘value unknown’ nulls.
ARE NULLS REALLY WHAT WE NEED?
Here C.J.Date says that ‘nulls’, meaning nulls of three or higher-valued logic are not what we want. We
need a DBMS that provides a systematic “DBA-defined default values” mechanism, and a version of
outer join that generates such values instead of nulls.
Date also states that outer join is not a good model of certain real world situations. In Case 2 (from
above) the table produced by the outer join (left outer join of DEPT with EMP over DEPT#) has DEPT#,
BUDGET, EMP# and SALARY as its columns. The criteria for membership in this table for some candidate
row (d,b,e,s) has to be something like this:
“There exists some department with department number d and budget b, and EITHER e is the employee
number of some employee who works in that department and s is that employee’s salary, OR that
department has no employees at all and e is null and s is null.”
The above criteria is not only difficult to state, it is quite difficult to understand as well. Any
misunderstanding is likely to result in incorrect queries and wrong answers out of the database. For such
reasons, some researchers are exploring alternative approaches to outer join problem.
OUTER JOIN WITH NO NULLS AND FEWER TEARS
C.J.Date exposes some of the pitfalls of the outer join operator in relational query languages, and also
discusses some of the problems that arise in products in which such a product has already been
implemented and hints at a preference for an implementation that does not involve nulls. So, here
Hugh Darwen suggest an operator LEFT JOIN, that:

Can be painlessly be included in a relational algebra

Is not a primitive operator in an algebra that includes Cartesian Product, rename, restriction,
projection, difference and union operator.

Always delivers a relation in 1NF.

Never generates nulls.

Is constrained to a certain well-defined subset of the many varieties of outer join that can be
distinguished.
LEFT JOIN
The operator LEFT JOIN, is intended for inclusion in a language based on the relational algebra, in which
results are all null-free. This, language is based entirely on the traditional two-valued logic. The syntax is
as follows :
LEFT JOIN ( left, right, fill )
Explanation:
1. The operands left, right, and fill are relation-valued expressions
2. Some candidate key of right is a subset of the common columns of left and right
3. The columns of fill must be precisely those columns of right that are not columns of left
4. The degrees of left, right, and fill are not otherwise constrained, and may even be zero
5. The cardinality of fill must be exactly one.
Let R= LEFT JOIN (A,B,C). Then R is the left outer natural join of A with B, defined as follows. Each of R
consists of a row of A extended with values, for the non common columns of B, from either (a) the
matching row of B, if such a matching row exists, or (b) the (unique) row of C otherwise. Every row in A
has a corresponding row in R. Rows in B that have no matching row in A do not contribute to R. The
word LEFT implies that only the data of what would be the left operand in an infix notation is preserved.
The pseudo SQL version is provided below:
The heading (i.e., set of column names) of R is he union of the headings of A and B. Any candidate key of
A is a candidate key of R. R is updateable to the extent that A is updateable, except the values in
columns deriving from B may not be changed.
LEFT JOIN directly supports all requirements for many-to-one left outer natural joins. One-to-many right
outer natural joins are trivially supported too. One-to-one full outer natural joins are indirectly
supported, because” FULL JOIN (A,B,Ca,Cb) ” can be expressed as UNION ( LEFT JOIN (A,B,Cb), LEFT JOIN
(B,A,Ca)).
On-to-many left, many-to-one right, and many-to-many outer natural joins are not supported.
OUTER JOINS NOT SUPPORTED BY THE LEFT JOIN
Consider the left outer natural join of DEPT with EMP, not supported by LEFT JOIN because it is one-tomany. And it has been already seen in that when an unmatched DEPT row is combined with a row of
nulls or defaults to represent an employee less department is heavily suspect. Consider that SUSPECT is
the left outer natural join of DEPT with EMP, and when a user has a glance at the contents of SUSPECT; it
will be all about employees working in departments. So, now the user wanted to know how many
employees work in each department and he implemented the following query
Unfortunately the above did not distinguish the empty departments from singletons, so right output was
not shown.
A correct way to express the required query involves the left outer natural join of DEPT with a summary
of EMP. It is shown here in two steps, the first using SQL , the second using LEFT JOIN operator
Here DEPT# is a candidate key for T1, so the constraints we impose on LEFT JOIN are satisfied.
The LEFT JOIN really does deliver a thoroughly respectable relation as its result, where nulls will not
appear and meaning of each column of the result does not vary from row to row
Download