System Implementation Techniques

advertisement
System Implementation Techniques
System Architectures for DBMSs
1. Centralized DBMS Architecture
Architectures for DBMSs followed trends similar to those of general computer systems
architectures. Earlier architectures used mainframe computers to provide the main
processing for all functions of the system, including user application programs, user
interface programs, as well as all the DBMS functionality. So, all processing was
performed remotely, and only display information and controls were sent from the
computer to the display terminals, which were connected to the central computer via
various types of communications networks.
As the prices of hardware declined, most users replaced their terminals with personal
computers (PCs) and workstations. At first, database systems used these computers in the
same way as they had used display terminals, so that the DBMS itself was still a
centralized DBMS where all the DBMS functionality, application program execution,
and user interface processing were carried out in one machine.
2. Client-Server Architecture
The clientserver architecture was developed to deal with computing environments
where a large number of PCs, workstations, file servers, printers, database servers, Web
servers, and other equipment are connected together via a network.
For example, it is possible to connect a number of PCs or small workstations as clients to
a file server that maintains the files of the client machines. Another machine could be
designated as a printer server by being connected to various printers; thereafter, all print
requests by the clients are forwarded to this machine. Web servers or E-mail servers
also fall into the specialized server category. In this way, the resources provided by
specialized servers can be accessed by many client machines. The client machines
provide the user with the appropriate interfaces to utilize these servers as well as with
local processing power to run local applications. This concept can be carried over to
software, with specialized software—such as a DBMS or a CAD package—being stored
on specific server machines and being made accessible to multiple clients.
3. Client-Server Architectures for DBMSs
The client-server architecture is increasingly being incorporated into commercial DBMS
packages. The server is called a query server or transaction server, because it provided
these two functionalities. In RDBMSs, the server is also often called an SQL server,
since most RDBMS servers are based on the SQL language and standard.
In such a client-server architecture, the user interface programs and application programs
can run at the client side. When DBMS access is required, the program establishes a
connection to the DBMS— which is on the server side—and once the connection is
created, the client program can communicate with the DBMS. A standard called Open
Database Connectivity (ODBC) provides an Application Programming Interface (API),
which allows client-side programs to call the DBMS, as long as both client and server
machines have the necessary software installed.
The second approach to client-server was taken by some object-oriented DBMSs.
Because many of those systems were developed in the era of client-server architecture,
the approach taken was to divide the software modules of the DBMS between client and
server in a more integrated way.
Catalogs for Relational DBMSs
The information stored in a catalog of an RDBMS includes the relation names, attribute
names, and attribute domains (data types), as well as descriptions of constraints (primary
keys, secondary keys, foreign keys, NULL/NOT NULL, and other types of constraints),
views, and storage structures and indexes. Security and authorization information is also
kept in the catalog;
In relational DBMSs it is common practice to store the catalog itself as relations and to
use the DBMS software for querying, updating, and maintaining the catalog. To include
information on secondary key attributes of a relation, we can simply extend the preceding
catalog if we assume that an attribute can be a member of one key only. In this case we
can replace the MEMBER_OF_PK attribute of REL_AND_ATTR_CATALOG with an
attribute KEY_NUMBER; the value of KEY_NUMBER is 0 if the attribute is not a
member of any key, 1 if it is a member of the primary key, and i > 1 for the secondary
key, where the secondary keys of a relation are numbered 2, 3, ..., n. However, if an
attribute can be a member of more than one key, which is the general case, the above
representation is not sufficient.
System Catalog Information in ORACLE
The various commercial database products adopt different conventions and terminology
with regard to their system catalog. However, in general, the catalogs contain similar
metadata describing conceptual, internal, and external schemas. In ORACLE, the
collection of metadata is called the data dictionary. The metadata is information about
schema objects, such as tables, indexes, views, triggers, and more. Access to the data
dictionary is allowed through numerous views, which are divided into three categories:
USER, ALL, and DBA. These terms are used as prefixes for the various views. The
views that have a prefix of USER contain schema information for objects owned by a
given user. Those with a prefix of ALL contain schema information for objects owned by
a user as well as objects that the user has been granted access to, and those with a prefix
of DBA are for the database administrator and contain information about all database
objects.
The system catalog contains information about all three levels of database schemas:
(i)
external (view definitions),
(ii)
conceptual (base tables), and
(iii)
internal (storage and index descriptions).
The catalog (metadata) data can be retrieved through SQL statements as can the user
(actual) data.
We start with the conceptual schema information. To find the objects owned by a
particular user, ‘TOM’, we can write the following query:
SELECT *
FROM ALL_CATALOG
WHERE OWNER = ‘TOM’;
The result of this query indicates that three base tables are owned by TOM: ACCOUNT,
CUSTOMERS, and ORDERS, plus a view CUSTORDER. The meaning of each
column in the result should be clear from its name. Because the
USER_TAB_COLUMNS table of the catalog has the prefix USER_, this query must be
submitted by the owner of the ORDERS table. The last three columns specified in the
SELECT-clause of the SQL query play an important role in the query optimization
process. The NUM_DISTINCT column specifies the number of distinct values for a
given column and the LOW_VALUE and HIGH_VALUE specify the lowest and highest
value, respectively, for the given column. These values, called database statistics, are
not automatically updated when tuples are inserted/deleted/modified. Rather, the statistics
are updated, either by exact computation or by estimation, whenever the ANALYZE SQL
statement in ORACLE is executed as follows:
ANALYZE TABLE ORDERS
COMPUTE STATISTICS;
This SQL statement would update all statistics for the ORDERS relation and its
associated indexes. To access information about the internal schema, the
USER_TABLES and USER_INDEXES catalog tables can be queried. For example, to
find storage information about the ORDERS table, the following query can be
submitted:
SELECT PCT_FREE, INITIAL_EXTENT, NUM_ROWS, BLOCKS, EMPTY_BLOCKS,
AVG_ROW_LENGTH
FROM USER_TABLES
WHERE TABLE_NAME = ‘ORDERS’;
The result of this query contains a subset of the available storage information in the
catalog. The information includes—from left to right—the minimum percentage of free
space in a block, the size of the initial storage extent in bytes, the number of rows in the
table, the number of used data blocks allocated to the table, the number of free data
blocks allocated to the table, and the average length of a row in the table in bytes.
The information from USER_TABLES also plays a useful role in query processing and
optimization. For example, in the entire ORDERS table is stored in a single block of
disk storage. So, processing a query that involves only the ORDERS relation can be
done efficiently by accessing a single disk block. To retrieve information about the
indexes for a specific table, the following query can be run:
SELECT
INDEX_NAME,
UNIQUENESS,
BLEVEL,
LEAF_BLOCKS,
DISTINCT_KEYS,
AVG_LEAF_BLOCKS_PER_KEY, AVG_DATA_BLOCKS_PER_KEY
FROM USER_INDEXES
WHERE TABLE_NAME = ‘ORDERS’;
STORAGE INFORMATION
The storage information about the indexes is just as important to the query optimizer as
the storage information about the relations. For example, the number of index blocks that
have to be accessed when searching for a specific key can be computed as the sum of
BLEVEL and LEAF_BLOCKS_PER_KEY.
For information about the external schema, the USER_VIEWS table can be queried as
follows:
SELECT *
FROM USER_VIEWS;
Using the view name, CUSTORDER, the associated column information can be
extracted from the USER_TAB_COLUMNS table.
SELECT COLUMN_NAME, DATA_TYPE, DATA_LENGTH
FROM USER_TAB_COLUMNS
WHERE TABLE_NAME = ‘CUSTORDER’;
Other Catalog Information Accessed by DBMS Software Modules
The DBMS modules use and access a catalog very frequently; that is why it is important
to implement access to the catalog as efficiently as possible. Some of the DBMS software
modules use and access the catalog which includes the following:
1. DDL (and SDL) compilers: These DBMS modules process and check the specification
of a database schema in the data definition language (DDL) and store that description in
the catalog. Schema constructs and constraints at all levels—conceptual, internal, and
external—are extracted from the DDL and SDL (storage definition language)
specifications and entered into the catalog, as is any mapping information among levels,
if necessary. Hence, these software modules actually populate (load) the catalog’s
minidatabase (or metadatabase) with data, the data being the descriptions of database
schemas.
2. Query and DML parser and verifier: These modules parse queries, DML retrieval
statements, and database update statements; they also check the catalog to verify whether
all the schema names referenced in these statements are valid. For example, in a
relational system, a query parser would check that all the relation names specified in the
query exist in the catalog and that the attributes specified belong to the appropriate
relations and have the appropriate type.
3. Query and DML compilers: These compilers convert high-level queries and DML
commands into low-level file access commands. The mapping between the conceptual
schema and the internal schema file structures is accessed from the catalog during this
process. For example, the catalog must include a description of each file and its fields and
the correspondences between fields and conceptual-level attributes.
4. Query and DML optimizer: The query optimizer accesses the catalog for access path,
implementation information, and data statistics to determine the best way to execute a
query or DML command. For example, the optimizer accesses the catalog to check which
fields of a relation have hash access or indexes, before deciding how to execute a
selection or join condition on the relation.
5. Authorization and security checking: The DBA has privileged commands to update
the authorization and security portion of the catalog . All access by a user to a relation is
checked by the DBMS for proper authorization by accessing the catalog.
6. External-to-conceptual mapping of queries and DML commands: Queries and DML
commands specified with reference to an external view or schema must be transformed to
refer to the conceptual schema before they can be processed by the DBMS. This is
accomplished by accessing the catalog description of the view in order to perform the
transformation.
Data Dictionary and Data Repository Systems
The terms data dictionary and data repository are used to indicate a more general
software utility than a catalog. A catalog is closely coupled with the DBMS software; it
provides the information stored in it to users and the DBA, but it is mainly accessed by
the various software modules of the DBMS itself, such as DDL and DML compilers, the
query optimizer, the transaction processor, report generators, and the constraint enforcer.
On the other hand, the software package for a stand-alone data dictionary or data
repository may interact with the software modules of the DBMS, but it is mainly used by
the designers, users, and administrators of a computer system for information resource
management. These systems are used to maintain information on system hardware and
software configurations, documentation, applications, and users, as well as other
information relevant to system administration.
If a data dictionary system is used only by designers, users, and administrators, not by the
DBMS software, it is called a passive data dictionary; otherwise, it is called an active
data dictionary or data directory.
QUERY PROCESSING AND OPTIMIZATION
A query expressed in a high-level query language such as SQL must first be scanned,
parsed, and validated. The scanner identifies the language tokens such as SQL
keywords, attribute names, and relation names—in the text of the query, whereas the
parser checks the query syntax to determine whether it is formulated according to the
syntax rules (rules of grammar) of the query language. The query must also be validated,
by checking that all attribute and relation names are valid and semantically meaningful
names in the schema of the particular database being queried. An internal representation
of the query is then created, usually as a tree data structure called a query tree. It is also
possible to represent the query using a graph data structure called a query graph. The
DBMS must then devise an execution strategy for retrieving the result of the query from
the database files. A query typically has many possible execution strategies, and the
process of choosing a suitable one for processing a query is known as query
optimization.
The query optimizer module has the task of producing an execution plan, and the code
generator generates the code to execute that plan. The runtime database processor has
the task of running the query code, whether in compiled or interpreted mode, to produce
the query result. If a runtime error results, an error message is generated by the runtime
database processor.
Finding the optimal strategy is usually too time-consuming except for the simplest of
queries and may require information on how the files are implemented and even on the
contents of the files—information that may not be fully available in the DBMS catalog.
Hence, planning of an execution strategy may be a more accurate description than query
optimization.
There are two main techniques for implementing query optimization. The first technique
is based on heuristic rules for ordering the operations in a query execution strategy. A
heuristic is a rule that works well in most cases but is not guaranteed to work well in
every possible case. The rules typically reorder the operations in a query tree. The second
technique involves systematically estimating the cost of different execution strategies
and choosing the execution plan with the lowest cost estimate. The two techniques are
usually combined in a query optimizer
Translating SQL Queries into Relational Algebra
SQL is the query language that is used in most commercial RDBMSs. An SQL query is
first translated into an equivalent extended relational algebra expression—represented as
a query tree data structure—that is then optimized. Typically, SQL queries are
decomposed into query blocks, which form the basic units that can be translated into the
algebraic operators and optimized.
A query block contains a single SELECT-FROM-WHERE expression, as well as
GROUP BY and HAVING clauses if these are part of the block. Hence, nested queries
within a query are identified as separate query blocks. Because SQL includes aggregate
operators—such as MAX, MIN, SUM, AND and COUNT.
Consider the following SQL query on the EMPLOYEE relation:
SELECT LNAME, FNAME
FROM EMPLOYEE
WHERE SALARY > (SELECT MAX (SALARY)
FROM EMPLOYEE
WHERE DNO=5);
This query includes a nested sub query and hence would be decomposed into two blocks.
The inner block is ;
(SELECT MAX (SALARY)
FROM EMPLOYEE
WHERE DNO=5)
and the outer block is :
SELECT LNAME, FNAME
FROM EMPLOYEE
WHERE SALARY > c
where c represents the result returned from the inner block.
The inner block could be translated into the extended relational algebra expression
MAX SALARY(s DNO=5 (EMPLOYEE))
and the outer block into the expression
pLNAME, FNAME (sSALARY>C(EMPLOYEE))
The query optimizer would then choose an execution plan for each block.
Basic Algorithms for Executing Query Operations
An RDBMS must include algorithms for implementing the different types of relational
operations that can appear in a query execution strategy. For each such operation or
combination of operations, one or more algorithms would typically be available to
execute the operation.
External Sorting
Sorting is one of the primary algorithms used in query processing. For example,
whenever an SQL query specifies an ORDER BY-clause, the query result must be sorted.
Sorting is also a key component in sort-merge algorithms used for JOIN and other
operations (such as UNION and INTERSECTION), and in duplicate elimination
algorithms for the PROJECT operation (when an SQL query specifies the DISTINCT
option in the SELECT clause).
External sorting refers to sorting algorithms that are suitable for large files of records
stored on disk. The typical external sorting algorithm uses a sort-merge strategy, which
starts by sorting small subfiles—called runs—of the main file and then merges the sorted
runs, creating larger sorted subfiles that are merged in turn.
The sort-merge algorithm, like other database algorithms, requires buffer space in main
memory, where the actual sorting and merging of the runs is performed.
In the sorting phase, runs (portions or pieces) of the file that can fit in the available
buffer space are read into main memory, sorted using an internal sorting algorithm, and
written back to disk as temporary sorted subfiles (or runs).
The size of a run and number of initial runs is dictated by the
a) number of file blocks
(b) and the available buffer space.
For example, if = 5 blocks and the size of the file b = 1024 blocks, then = , or 205 initial
runs each of size 5 blocks. Hence, after the sort phase, 205 sorted runs are stored as
temporary subfiles on disk.
In the merging phase, the sorted runs are merged during one or more passes. The degree
of merging is the number of runs that can be merged together in each pass. In each pass,
one buffer block is needed to hold one block from each of the runs being merged, and one
block is needed for containing one block of the merge result. The minimum of 2 gives the
worst-case performance of the algorithm, which is
(2 * b) + (2 * (b * (log2 b)))
The first term represents the number of block accesses for the sort phase, since each file
block is accessed twice—once for reading into memory and once for writing the records
back to disk after sorting.
Implementing the SELECT Operation
There are many options for executing a SELECT operation; some depend on the file
having specific access paths and may apply only to certain types of selection conditions.
(OP1): s sSN=’123456789’ (EMPLOYEE)
(OP2): s DNUMBER>5 (DEPARTMENT)
(OP3): s DNO=5 (EMPLOYEE)
(OP4): s DNO=5 AND SALARY>30000 AND SEX=’F’ (EMPLOYEE)
(OP5): s ESSN=’123456789’ AND PNO=10 (WORKS_ON)
Search Methods for Simple Selection
A number of search algorithms are possible for selecting records from a file. These are
also known as file scans, because they scan the records of a file to search for and retrieve
records that satisfy a selection condition. If the search algorithm involves the use of an
index, the index search is called an index scan. The following search methods (S1
through S6) are examples of some of the search algorithms that can be used to implement
a select operation:
• S1. Linear search: Retrieve every record in the file, and test whether its attribute
values satisfy the selection condition.
• S2. Binary search: If the selection condition involves equality comparison on a key
attributes on which the file is ordered, binary search—which is more efficient than linear
search—can be used.
An example is OP1 if SSN is the ordering attribute for the EMPLOYEE file
Using a primary index (or hash key): If the selection condition involves an equality
comparison on a key attribute with a primary index (or hash key)—for example, SSN =
‘123456789’ in OP1—use the primary index (or hash key) to retrieve the record.
Using a primary index to retrieve multiple records: If the comparison condition is >, >=,
<, or <= on a key field with a primary index—for example, DNUMBER > 5 in OP2—use
the index to find the record satisfying the corresponding equality condition (DNUMBER
= 5), then retrieve all subsequent records in the (ordered) file. For the condition
DNUMBER < 5, retrieve all the preceding records.
• S5. Using a clustering index to retrieve multiple records: If the selection condition
involves an equality comparison on a non-key attribute with a clustering index—for
example, DNO = 5 in OP3—use the index to retrieve all the records satisfying the
condition.
• S6. Using a secondary (-tree) index on an equality comparison: This search method can
be usedto retrieve a single record if the indexing field is a key (has unique values) or to
retrieve multiple records if the indexing field is not a key. This can also be used for
comparisons involving >, >=, <, or <=.
Method S1 applies to any file, but all the other methods depend on having the appropriate
access path on the attribute used in the selection condition. Methods S4 and S6 can be
used to retrieve records in a certain range—for example, 30000<=SALARY<=35000.
Queries involving such conditions are called range queries.
Search Methods for Complex Selection
If a condition of a SELECT operation is a conjunctive condition—that is, if it is made
up of several simple conditions connected with the AND logical connective such as OP4
above—the DBMS can use the following additional methods to implement the operation:
• S7. Conjunctive selection using an individual index: If an attribute involved in any
single simple condition in the conjunctive condition has an access path that permits the
use of one of the Methods S2 to S6, use that condition to retrieve the records and then
check whether each retrieved record satisfies the remaining simple conditions in the
conjunctive condition
.
• S8. Conjunctive selection using a composite index: If two or more attributes are
involved in equality conditions in the conjunctive condition and a composite index (or
hash structure) exists on the combined fields—for example, if an index has been created
on the composite key (ESSN, PNO) of the WORKS_ON file for OP5—we can use the
index directly.
• S9. Conjunctive selection by intersection of record pointers: If secondary indexes are
available on more than one of the fields involved in simple conditions in the conjunctive
condition, and if the indexes include record pointers, then each index can be used to
retrieve the set of record pointers that satisfy the individual condition. The intersection
of these sets of record pointers gives the record pointers that satisfy the conjunctive
condition, which are then used to retrieve those records directly. If only some of the
conditions have secondary indexes, each retrieved record is further tested to determine
whether it satisfies the remaining conditions .
Whenever a single condition specifies the selection—such as OP1, OP2, or OP3—we can
only checkwhether an access path exists on the attribute involved in that condition. If an
access path exists, the method corresponding to that access path is used; otherwise, the
linear search approach of method S1 can be used. Query optimization for a SELECT
operation is needed mostly for conjunctive select conditions whenever more than one of
the attributes involved in the conditions have an access path. The optimizer should
choose the access path that retrieves the fewest records in the most efficient way by
estimating the different costs and choosing the method with the least estimated cost.
When the optimizer is choosing between multiple simple conditions in a conjunctive
select condition, it typically considers the selectivity of each condition. The selectivity (s)
is defined as the ratio of the number of records (tuples) that satisfy the condition to the
total number of records (tuples) in the file (relation), and thus is a number between zero
and 1—zero selectivity means no records satisfy the condition and 1 means all the
records satisfy the condition. Although exact selectivities of all conditions may not be
available, estimates of selectivities are often kept in the DBMS catalog and are used by
the optimizer.
For example, for an equality condition on a key attribute of relation
r(R), s = 1/ | r(R) |, where | r(R) | is the number of tuples in relation r(R).
For an equality condition on an attribute with i distinct values, s is estimated by
( |r(R) | /i)/ | r(R) | or 1/i, assuming that the records are evenly distributed among the
distinct values . Under this assumption, | r(R) | /i records will satisfy an equality condition
on this attribute. In general, the number of records satisfying a selection condition with
selectivity s is estimated to be | r(R) | * s. The smaller this estimate is, the higher the
desirability of using that condition first to retrieve records. Compared to a conjunctive
selection condition, a disjunctive condition is much harder to process and optimize.
For example, consider OP4:
(OP4): s DNO=5 OR SALARY>30000 OR SEX=‘F’ (EMPLOYEE)
With such a condition, little optimization can be done, because the records satisfying the
disjunctive condition are the union of the records satisfying the individual conditions.
Hence, if any one of the conditions does not have an access path, we are compelled to use
the brute force linear search approach. Only if an access path exists on every condition
can we optimize the selection by retrieving the records satisfying each condition—or
their record ids—and then applying the union operation to eliminate duplicates.
Implementing the JOIN Operation
The JOIN operation is one of the most time-consuming operations in query processing.
Many of the join operations encountered in queries are of the EQUIJOIN and NATURAL
JOIN varieties. The term join refers to an EQUIJOIN (or NATURAL JOIN).
There are many possible ways to implement a two-way join, which is a join on two files.
Joins involving more than two files are called multiway joins. The number of possible
ways to execute multiway joins grows very rapidly.
The algorithms we consider are for join operations of the form
RA=BBB S
where A and B are domain-compatible attributes of R and S, respectively.
The most common techniques for performing such a join, using the following example
operations:
(OP6): EMPLOYEEDNO=DNUMBER DEPARTMENT
(OP7): DEPARTMENTMGRSSN=SSN EMPLOYEE
Methods for Implementing Joins
• J1. Nested-loop join : For each record t in R (outer loop), retrieve every record s from S
(inner loop) and test whether the two records satisfy the join condition t[A] = s[B] .
• J2. Single-loop join (using an access structure to retrieve the matching records): If an
index (or hash key) exists for one of the two join attributes—say, B of S—retrieve each
record t in R, one at a time (single loop), and then use the access structure to retrieve
directly all matching records s from S that satisfy s[B] = t[A].
• J3. Sort–merge join: If the records of R and S are physically sorted (ordered) by value
of the join attributes A and B, respectively, we can implement the join in the most
efficient way possible. Both files are scanned concurrently in order of the join attributes,
matching the records that have the same values for A and B. If the files are not sorted,
they may be sorted first by using external sorting. In this method, pairs of file blocks are
copied into memory buffers in order and the records of each file are scanned only once
each for matching with the other file—unless both A and B are nonkey attributes, in
which case the method needs to be modified slightly.
R(i) is used to refer to the record in R. A variation of the sort-merge join can be used
when secondary indexes exist on both join attributes. The indexes provide the ability to
access the records in order of the join attributes, but the records themselves are physically
scattered all over the file blocks, so this method may be quite inefficient, as every record
access may involve accessing a different disk block.
• J4. Hash-join: The records of files R and S are both hashed to the same hash file, using
the same hashing function on the join attributes A of R and B of S as hash keys. First, a
single pass through the file with fewer records (say, R) hashes its records to the hash file
buckets; this is called the partitioning phase, since the records of R are partitioned into
the hash buckets. In the second phase, called the probing phase, a single pass through
the other file (S) then hashes each of its records to probe the appropriate bucket, and that
record is combined with all matching records from R in that bucket. This simplified
description of hash-join assumes that the smaller of the two files fits entirely into memory
buckets after the first phase.
Implementing Aggregate Operations
The aggregate operators (MIN, MAX, COUNT, AVERAGE, SUM), when applied to an
entire table, can be computed by a table scan or by using an appropriate index, if
available. For example, consider the following SQL query:
SELECT MAX(SALARY)
FROM EMPLOYEE;
If an (ascending) index on SALARY exists for the EMPLOYEE relation, then the
optimizer can decide on using the index to search for the largest value by following the
rightmost pointer in each index node from the root to the rightmost leaf. That node would
include the largest SALARY value as its last entry.
In most cases, this would be more efficient than a full table scan of EMPLOYEE, since
no actual records need to be retrieved. The MIN aggregate can be handled in a similar
manner, except that the leftmost pointer is followed from the root to leftmost leaf. That
node would include the smallest SALARY value as its first entry. The index could also
be used for the COUNT, AVERAGE, and SUM aggregates, but only if it is a dense
index—that is, if there is an index entry for every record in the main file. In this case, the
associated computation would be applied to the values in the index. For a nondense
index, the actual number of records associated with each index entry must be used for a
correct computation (except for COUNT DISTINCT, where the number of distinct values
can be counted from the index itself).
When a GROUP BY clause is used in a query, the aggregate operator must be applied
separately to each group of tuples. Hence, the table must first be partitioned into subsets
of tuples, where each partition (group) has the same value for the grouping attributes. In
this case, the computation is more complex. Consider the following query:
SELECT DNO, AVG(SALARY)
FROM EMPLOYEE
GROUP BY DNO;
The usual technique for such queries is to first use either sorting or hashing on the
grouping attributes to partition the file into the appropriate groups. Then the algorithm
computes the aggregate function for the tuples in each group, which have the same
grouping attribute(s) value. In the example query, the set of tuples for each department
number would be grouped together in a partition and the average computed for each
group.
Implementing Outer Join
The outer join operation was introduced, with its three variations: left outer join, right
outer join, and full outer join. The following is an example of a left outer join operation
in SQL2:
SELECT LNAME, FNAME, DNAME
FROM (EMPLOYEE LEFT OUTER JOIN DEPARTMENT ON DNO=DNUMBER);
The result of this query is a table of employee names and their associated departments. It
is similar to a regular (inner) join result, with the exception that if an EMPLOYEE tuple
(a tuple in the left relation) doesnot have an associated department, the employee’s name
will still appear in the resulting table, but the department name would be null for such
tuples in the query result. Outer join can be computed by modifying one of the join
algorithms, such as nestedloop join or singleloop join.
For example, to compute a left outer join, the left relation is used as the outer loop or
singleloop, because every tuple in the left relation must appear in the result. If there are
matching tuples in the other relation, the joined tuples are produced and saved in the
result. However, if no matching tuple is found, the tuple is still included in the result but
is padded with null value(s).
The sort-merge and hashjoin algorithms can also be extended to compute outer joins.
Alternatively, outer join can be computed by executing a combination of relational
algebra operators.
For example, the left outer join operation shown above is equivalent to the following
sequence of relational operations:
1. Compute the (inner) JOIN of the EMPLOYEE and DEPARTMENT tables.
TEMP1: p LNAME, FNAME, DNAME (EMPLOYEEDNO=DNUMBER DEPARTMENT)
2. Find the EMPLOYEE tuples that do not appear in the (inner) JOIN result.
TEMP2: pLNAME, FNAME (EMPLOYEE) - pLNAME, FNAME (TEMP1)
3. Pad each tuple in TEMP2 with a null DNAME field.
TEMP2: TEMP2 X ‘null’
4. Apply the UNION operation to TEMP1, TEMP2 to produce the LEFT OUTER JOIN
result.
RESULT: TEMP1 X TEMP2
The cost of the outer join as computed above would be the sum of the costs of the
associated steps (inner join, projections, and union). However, note that Step 3 can be
done as the temporary relation is being constructed in Step 2; that is, we can simply pad
each resulting tuple with a null. In addition, in Step 4, we know that the two operands of
the union are disjoint (no common tuples), so there is no need for duplicate elimination.
Using Heuristics in Query Optimization
One of the main heuristic rules is to apply SELECT and PROJECT operations before
applying the JOIN or other binary operations. This is because the size of the file resulting
from a binary operation— such as JOIN—is usually a multiplicative function of the sizes
of the input files. The SELECT and PROJECT operations reduce the size of a file and
hence should be applied before a join or other binary operation.
One of the main heuristic rules is to apply SELECT and PROJECT operations before
applying the JOIN or other binary operations. This is because the size of the file resulting
from a binary operation—such as JOIN—is usually a multiplicative function of the sizes
of the input files. The SELECT and PROJECT operations reduce the size of a file and
hence should be applied before a join or other binary operation.
Notation for Query Trees and Query Graphs
A query tree is a tree data structure that corresponds to a relational algebra expression. It
representsthe input relations of the query as leaf nodes of the tree, and represents the
relational algebra operations as internal nodes. An execution of the query tree consists of
executing an internal node operation whenever its operands are available and then
replacing that internal node by the relation that results from executing the operation.
The execution terminates when the root node is executed and produces the result relation
for the query.
This query is specified on the relational schema corresponds to the following relational
algebra expression:
P PNUMBER, DNUM, LNAME, ADDRESS, BDATE (((sPLOCATION=’Stafford’(PROJECT))
D NUM=DNUMBER(DEPARTMENT))MGRSSN=SSN (EMPLOYEE))
This corresponds to the following SQL query:
Q2: SELECT P.PNUMBER, P.DNUM, E.LNAME, E.ADDRESS, E.BDATE
FROM PROJECT AS P, DEPARTMENT AS D, EMPLOYEE AS E
WHERE P.DNUM=D.DNUMBER AND D.MGRSSN=E.SSN AND
P.PLOCATION=’Stafford’;
Download