ICS 434 - Advanced Database Systems

advertisement
ICS 434 - Advanced Database Systems
Query Processing and Optimization Query Processing
Elmasri/Navathe (3rd Edition)
Chapter 18



In databases that provide low-level access routines such as IMS or flat file
databases, the programmer must write code to perform the queries.
With higher level database query languages such as SQL and QUEL, a special
component of the DBMS called the Query Processor takes care of arranging the
underlying access routines to satisfy a given query.
Thus queries can be specified in terms of the required results rather than in terms
of how to achieve those results.
A query is processed in four general steps:
1.
2.
3.
4.
Scanning and Parsing
Query Optimization or planning the execution strategy
Query Code Generator (interpreted or compiled)
Execution in the runtime database processor
1. Scanning and Parsing







When a query is first submitted (via an applications program), it must be scanned
and parsed to determine if the query consists of appropriate syntax.
Scanning is the process of converting the query text into a tokenized
representation.
The tokenized representation is more compact and is suitable for processing by
the parser.
This representation may be in a tree form.
The Parser checks the tokenized representation for correct syntax.
In this stage, checks are made to determine if columns and tables identified in the
query exist in the database and if the query has been formed correctly with the
appropriate keywords and structure.
If the query passes the parsing checks, then it is passed on to the Query
Optimizer.
2. Query Optimization or Planning the Execution Strategy


For any given query, there may be a number of different ways to execute it.
Each operation in the query (SELECT, JOIN, etc.) can be implemented using one
or more different Access Routines.





For example, an access routine that employs an index to retrieve some rows
would be more efficient that an access routine that performs a full table scan.
The goal of the query optimizer is to find a reasonably efficient strategy for
executing the query (not quite what the name implies) using the access routines.
Optimization typically takes one of two forms: Heuristic Optimization or Cost
Based Optimization
In Heuristic Optimization, the query execution is refined based on heuristic
rules for reordering the individual operations.
With Cost Based Optimization, the overall cost of executing the query is
systematically reduced by estimating the costs of executing several different
execution plans.
3. Query Code Generator (interpreted or compiled)



Once the query optimizer has determined the execution plan (the specific ordering
of access routines), the code generator writes out the actual access routines to be
executed.
With an interactive session, the query code is interpreted and passed directly to
the runtime database processor for execution.
It is also possible to compile the access routines and store them for later
execution.
4. Execution in the runtime database processor




At this point, the query has been scanned, parsed, planned and (possibly)
compiled.
The runtime database processor then executes the access routines against the
database.
The results are returned to the application that made the query in the first place.
Any runtime errors are also returned.
Query Optimization


The goal of the query optimizer is to find a reasonably efficient strategy for
executing the query (not quite what the name implies) using the access routines.
Example
SELECT DISTINCT Emp.name, Dept.phone
FROM Emp, Dept
WHERE
Emp.dept = ‘Sales’ and
Emp.dept = Dept.name
• 1-2 selections, 1 join, 1-2 projections
• Several possible query execution plans (QEPs)
Step 1: R1 = Emp  Dept
•
•
•
•
•
Step 2: R2 = dept = ‘Sales’ and dept = name2 (R1)
Step 3: name1, phone (R2)
or
Step 1: R1 = dept = ‘Sales’ (Emp)
Step 2: R2 = R1 R1.dept = Dept.name Dept
Step 3: name1, phone (R2)
Among all possible equivalent QEPs, the one that can be evaluated with the
minimum cost is the so-called optimal plan
cost = I/O cost + CPU cost
(fetches of data from disk + joins and comparisons in main memory)
Usually, I/O cost >> CPU cost
To reduce I/O cost, indexing structures are used as catalogues to data
– Primary key retrieval: B+ trees, Hashing
– Secondary key retrieval: Inverted files
The Selectivity (s) = the ratio of records satisfying the condition
– C1 = V
1 / COLCARD
– C1 IN (V1, V2, ..)
# of values in list / COLCARD
– C1 > V
(HIGHKEY - V) / (HIGHKEY - LOWKEY)
– C2 < V
(V - LOWKEY) / (HIGHKEY - LOWKEY)
Number of I/O needed = s * # of rows or pages (blocks)
Query Optimization Strategies
We divide the query optimization into two types: Heuristic (sometimes called Rule
based) and Systematic (Cost based).
Heuristic Query Optimization




A query can be represented as a tree data structure. Operations are at the interior
nodes and data items (tables, columns) are at the leaves.
The query is evaluated in a depth-first pattern.
In Oracle this is Rule Based optimization.
Consider this query from Elmasri/Navathe:
SELECT PNUMBER, DNUM, LNAME
FROM
PROJECT, DEPARTMENT, EMPLOYEE
WHERE DNUM=DNUMBER and MGRSSN=SSN and
PLOCATION = 'Stafford';
In relational algebra:
on the following schema:
EMPLOYEE TABLE
FNAME
MI LNAME
SALARY SUPERSSN DNO
-------- -- --------- --------- -JOHN
B SMITH
30000 333445555 5
FRANKLIN T WONG
40000 888665555 5
ALICIA
J ZELAYA
25000 987654321 4
JENNIFER S WALLACE
43000 888665555 4
RAMESH
K NARAYAN
38000 333445555 5
JOYCE
A ENGLISH
25000 333445555 5
AHMAD
V JABBAR
25000 987654321 4
JAMES
E BORG
55000
1
SSN BDATE
ADDRESS
--------- --------- ------------------------- - --123456789 09-JAN-55 731 FONDREN, HOUSTON, TX
M
333445555 08-DEC-45 638 VOSS,HOUSTON TX
M
999887777 19-JUL-58 3321 CASTLE, SPRING, TX
F
987654321 20-JUN-31 291 BERRY, BELLAIRE, TX
F
666884444 15-SEP-52 975 FIRE OAK, HUMBLE, TX
M
453453453 31-JUL-62 5631 RICE, HOUSTON, TX
F
987987987 29-MAR-59 980 DALLAS, HOUSTON, TX
M
888665555 10-NOV-27 450 STONE, HOUSTON, TX
M
DEPARTMENT TABLE:
DNAME
DNUMBER
--------------- --------HEADQUARTERS
1
ADMINISTRATION
4
RESEARCH
5
MGRSSN
--------888665555
987654321
333445555
PROJECT TABLE:
PNAME
PNUMBER
---------------- ------ProductX
1
ProductY
2
ProductZ
3
Computerization
10
Reorganization
20
NewBenefits
30
PLOCATION
---------Bellaire
Sugarland
Houston
Stafford
Houston
Stafford

S
MGRSTARTD
--------19-JUN-71
01-JAN-85
22-MAY-78
DNUM
---5
5
5
4
1
4
WORKS_ON TABLE:
ESSN
PNO
--------- --123456789
1
123456789
2
666884444
3
453453453
1
453453453
2
333445555
2
333445555
3
333445555
10
333445555
20
999887777
30
999887777
10
987987987
10
987987987
30
987654321
30
987654321
20
888665555
20
Which of the following query trees is more efficient ?
HOURS
----32.5
7.5
40.0
20.0
20.0
10.0
10.0
10.0
10.0
30.0
10.0
35.0
5.0
20.0
15.0
null
The left hand tree is evaluated in steps as follows:
The right hand tree is evaluated in steps as follows:



Note the two cross product operations. These require lots of space and time
(nested loops) to build.
After the two cross products, we have a temporary table with 144 records (6
projects * 3 departments * 8 employees).
An overall rule for heuristic query optimization is to perform as many select and
project operations as possible before doing any joins.
Systematic (Cost based) Query Optimization
Note: The following notes are based upon materials presented in the Connolly/Begg 3rd edition. The
notations differ between textbooks.








Just looking at the Syntax of the query may not give the whole picture - need to
look at the data as well.
Several Cost components to consider:
1. Access cost to secondary storage (hard disk)
2. Storage Cost for intermediate result sets
3. Computation costs: CPU, memory transfers, etc. for performing inmemory operations.
4. Communications Costs to ship data around a network. e.g., in a
distributed or client/server database.
Of these, Access cost is the most crucial in a centralized DBMS. The more work
we can do with data in cache or in memory, the better.
Access Routines are algorithms that are used to access and aggregate data in a
database.
An RDBMS may have a collection of general purpose access routines that can be
combined to implement a query execution plan.
We are interested in access routines for selection, projection, join and set
operations such as union, intersection, set difference, cartesian product, etc.
As with heuristic optimization, there can be many different plans that lead to the
same result.
In general, if a query contains n operations, there will be n! possible plans.
However, not all plans will make sense. We should consider:
Perform all simple selections first


Perform joins next
Perform projection last
Overview of the Cost Based optimization process
1. Enumerate all of the legitimate plans (call these P1...Pn) where each plan
contains a set of operations O1...Ok
2. Select a plan
3. For each operation Oi in the plan, enumerate the access routines
4. For each possible Access routine for Oi, estimate the cost
Select the access routine with the lowest cost
5. Repeat previous 2 steps until an efficient access routine has been selected
for each operation
Sum up the costs of each access routine to determine a total cost for the
plan
6. Repeat steps 2 through 5 for each plan and choose the plan with the lowest
total cost.
Example outline: Assume 3 operations (one projection, two selections and a join)
: P1 S1 S2 and J1
In general, perform the selections first, and then the join and finally the projection
1. Enumerate the plans.
Note there are two orderings of selection that are possible so the two plans
become:
Plan A: S1 S2 J1 P1
Plan B: S2 S1 J1 P1
2. Choose a plan (let us start with Plan A)
3. For each operation enumerate the access routines:
Operation S1 has possible access routines: Linear Search and binary
search
Operation S2 has possible access routines: Linear Search and indexed
search
Operation J1 has possible access routines: Nested Loop join and indexed
join
4. Choose the least cost access routine for each operation
Operation S1 least cost access routine is binary search at a cost of 10
blocks
Operation S2 least cost access routine is linear search at a cost of 20
blocks
Operation J1 least cost access routine is indexed join at a cost of 40 blocks
5. The sum of the costs for each access routine are: 10 + 20 + 30 = 70
Thus the total cost for Plan A will be: 70
6. In repeating the steps 2 though 5 for Plan B, we come up with:
Operation S2 least cost access routine is binary search at a cost of 20
blocks
Operation S1 least cost access routine is indexed search at a cost of 5
blocks
Operation J1 least cost access routine is indexed join at a cost of 30 blocks

The sum of the costs for each access routine are: 20 + 5 + 30 = 55
Thus the total cost for Plan B will be: 55
Final result: Plan B would be the best plan - pass Plan B along to the query code
generator.
Columns to Index




Primary key and foreign key columns
Columns that have aggregates computed frequently
Columns that are searched or joined over less than 5 to 10 percent of the rows
Columns frequently used in an ORDER BY, GROUP BY, and DISTINCT clause
Clustering Index



The clustering index holds the most potential for performance gains
A clustered index allow for merge join
Columns that benefit from clustering:
1. Primary key column
2. Columns with low index cardinality (# of distinct values) or skewed
distribution
3. Column frequently processed in sequence (ORDER BY, GROUP BY,
DISTINCT)
4. Columns frequently searched or joined over a range of values such as
BETWEEN, <, >, LIKE Avoid indexing columns
When Not to Index
1. That are frequently updated
2. That have low cardinality, skewed distribution (unless clustering index is
used)
3. That are longer than 40 bytes
4. That cause lock contention problems
5. Avoid indexing tables
6. That are fewer than 7 pages unless used frequently in joins or in referential
integrity checking
Code Transformation


Adding predicates, transforming code, flattening nested select
Examples of adding predicates
1. WHERE
TA.C1 = TB.C5
AND
TA.C1 > 10
AND
TB.C5 > 10
-- added by the optimizer
2. WHERE
TA.C1 = TB.C5
AND
TB.C5 = TC.C10
AND
TC.C10 = TA.C1 -- added by the optimizer

Examples of transforming code
1. WHERE TA.C1 = 10 OR TA.C1 = 30
transformed to:
WHERE TA.C1 IN (10, 30)
Using Hints in Oracle SQL


In Oracle, you can embed hints in SQL statements to guide the optimizer towards
making more efficient choices.
The general syntax for a SELECT statement is:
SELECT /*+ hint */
colA, colB, colC, ...
FROM tab1, tab2, ...




The /* and */ are normally comments.
Placing the + sign (plus sign) after the opening comment causes the comment to
be treated as a hint.
Different values for hint can include:
o ALL_ROWS - Optimize the query for best throughput (lowest resource
utilization)
o FIRST_ROWS - Optimize for fastest response time.
o CHOOSE - Optimizer chooses either Rule based or Cost based. If
statistics are available (via the ANALYZE TABLE command), Cost based
is chosen, otherwise, rule based is chosen.
o RULE - Force the use of the Rule based optimizer.
For example, the following query forces the use of the cost based optimizer,
working to minimize response time:
SELECT /*+ FIRST_ROWS */
fname, lname, salary, dno
FROM
employee
WHERE salary > 39000;
The following query forces the use of the Rule based optimizer:
SELECT /*+ RULE */
fname, lname, salary, dno
FROM
employee
WHERE salary > 39000;

Other types of hints include:
o FULL(table) - Force a full table scan for table. i.e., will ignore an index if
one exists.
o INDEX (table index) - Force the use of a given index when accessing the
given table
o


ORDERED - Force the tables to be joined in the order in which they
appear in the FROM clause.
o STAR - Forces a star query plan (come back to this when we talk about
data warehousing).
o USE_NL (table1, table2) - Use a nested loop to perform the join
operation for table1 to table2
o USE_MERGE (table1, table2) - Use a sort-merge join operation to join
table1 with table2
o DRIVING_SITE - Force the query execution to be performed at a
different site in a distributed database.
o PUSH_SUBQ - Evaluate a non-correlated subquery as early as possible in
the query plan.
For more details, see Chapter 8 Optimizer Modes and Hints in the book:
Oracle8(TM) Server Tuning Release 8.0. Part Number: A54638-01
For a general explanation of how Oracle handles query optimization see Chapter
19 The Optimizer in the book: Oracle8(TM) Server Concepts Release 8.0. Part
Number: A54643-01
Using Hints in MS SQL Server
Note: This material from the MS SQL Server 7.0 Books Online
In MS SQL Server, query hints can be added using an OPTIONS clause at the end of the
statement:
SELECT select_list
FROM table_source
WHERE search_condition
GROUP BY group_by_expression
HAVING search_condition
ORDER BY order_expression
OPTIONS (query options)
Some of the query options available are:




{HASH | ORDER} GROUP
Specifies that aggregations described in the GROUP BY or COMPUTE clause of the query should
use hashing or ordering.
{MERGE | HASH | CONCAT} UNION
Specifies that all UNION operations are performed by merging, hashing, or concatenating UNION
sets. If more than one UNION hint is specified, the query optimizer selects the least expensive
strategy from those hints specified.
{LOOP | MERGE | HASH |} JOIN
Specifies that all join operations are performed by loop join, merge join, or hash join in the whole
query. If more than one join hint is specified, the query optimizer selects the least expensive join
strategy for the allowed ones. If, in the same query, a join hint is also specified for a specific pair
of tables, it takes precedence in the joining of the two tables.
FAST number_rows
Specifies that the query is optimized for fast retrieval of the first number_rows (a nonnegative

integer). After the first number_rows are returned, the query continues execution and produces its
full result set.
FORCE ORDER
Specifies that the join order indicated by the query syntax is preserved during query optimization.
Additional Material for Reading Purposes (Not included in the exam):
Complete Heuristics Example

Consider this query from Elmasri/Navathe:
SELECT PNUMBER, DNUM, LNAME
FROM
PROJECT, DEPARTMENT, EMPLOYEE
WHERE DNUM=DNUMBER and MGRSSN=SSN and
PLOCATION = 'Stafford';
In relational algebra:
on the following schema:
EMPLOYEE TABLE
FNAME
MI LNAME
SALARY SUPERSSN DNO
-------- -- --------- --------- -JOHN
B SMITH
30000 333445555 5
FRANKLIN T WONG
40000 888665555 5
ALICIA
J ZELAYA
25000 987654321 4
JENNIFER S WALLACE
43000 888665555 4
RAMESH
K NARAYAN
38000 333445555 5
JOYCE
A ENGLISH
25000 333445555 5
AHMAD
V JABBAR
25000 987654321 4
JAMES
E BORG
55000
1
SSN BDATE
ADDRESS
S
--------- --------- ------------------------- - --123456789 09-JAN-55 731 FONDREN, HOUSTON, TX
M
333445555 08-DEC-45 638 VOSS,HOUSTON TX
M
999887777 19-JUL-58 3321 CASTLE, SPRING, TX
F
987654321 20-JUN-31 291 BERRY, BELLAIRE, TX
F
666884444 15-SEP-52 975 FIRE OAK, HUMBLE, TX
M
453453453 31-JUL-62 5631 RICE, HOUSTON, TX
F
987987987 29-MAR-59 980 DALLAS, HOUSTON, TX
M
888665555 10-NOV-27 450 STONE, HOUSTON, TX
M
DEPARTMENT TABLE:
DNAME
DNUMBER
--------------- --------HEADQUARTERS
1
ADMINISTRATION
4
RESEARCH
5
MGRSSN
--------888665555
987654321
333445555
PROJECT TABLE:
PNAME
PNUMBER
---------------- -------
PLOCATION
----------
MGRSTARTD
--------19-JUN-71
01-JAN-85
22-MAY-78
DNUM
----
WORKS_ON TABLE:
ESSN
PNO
--------- --123456789
1
123456789
2
666884444
3
453453453
1
453453453
2
333445555
2
333445555
3
333445555
10
HOURS
----32.5
7.5
40.0
20.0
20.0
10.0
10.0
10.0
ProductX
ProductY
ProductZ
Computerization
Reorganization
NewBenefits

1
2
3
10
20
30
Bellaire
Sugarland
Houston
Stafford
Houston
Stafford
5
5
5
4
1
4
333445555
999887777
999887777
987987987
987987987
987654321
987654321
888665555
Which of the following query trees is more efficient ?
The left hand tree is evaluated in steps as follows:
20
30
10
10
30
30
20
20
10.0
30.0
10.0
35.0
5.0
20.0
15.0
null
The right hand tree is evaluated in steps as follows:





Note the two cross product operations. These require lots of space and time
(nested loops) to build.
After the two cross products, we have a temporary table with 144 records (6
projects * 3 departments * 8 employees).
An overall rule for heuristic query optimization is to perform as many select and
project operations as possible before doing any joins.
There are a number of transformation rules that can be used to transform a query:
1. Cascading selections. A list of conjunctive conditions can be broken up
into separate individual conditions.
2. Commutativity of the selection operation.
3. Cascading projections. All but the last projection can be ignored.
4. Commuting selection and projection. If a selection condition only involves
attributes contained in a projection clause, the two can be commuted.
5. Commutativity of Join and Cross Product.
6. Commuting selection with Join.
7. Commuting projection with Join.
8. Commutativity of set operations. Union and Intersection are commutative.
9. Associativity of Union, Intersection, Join and Cross Product.
10. Commuting selection with set operations.
11. Commuting projection with set operations.
12. Logical transformation of selection conditions. For example, using
DeMorgan's law, etc.
13. Combine Selection and Cartesian product to form Joins.
These transformations can be used in various combinations to optimize queries.
Some general steps follow:
1. Using rule 1, break up conjunctive selection conditions and chain them
together.
2. Using the commutativity rules, move the selection operations as far down
the tree as possible.
3. Using the associativity rules, rearrange the leaf nodes so that the most
restrictive selection conditions are executed first. For example, an equality
condition is likely more restrictive than an inequality condition (range
query).
4. Combine cartesian product operations with associated selection conditions
to form a single Join operation.
5. Using the commutativity of Projection rules, move the projection
operations down the tree to reduce the sizes of intermediate result sets.
6. Finally, identify subtrees that can be executed using a single efficient
access method.
Example of Heuristic Query Optimization
1. Original Query Tree
2. Use Rule 1 to Break up Cascading Selections
3. Commute Selection with Cross Product
4. Combine Cross Product and Selection to form
Joins
Complete Systematic (Cost based) Query Optimization Example
Metadata stored in the system catalog
1. Blocking factor - number of data records (tuples) (nTuples(R)) per block. Given
as bFactor(R)
2. Indexes on tables - primary, secondary, etc. Given as the number of index levels
for index I on attributes A
nLevelsA(I).
3. The number of blocks that make up a first level index (leaf level) given as:
nLfBlocksA(I)
4. Number of distinct values of an indexing attribute (such as a key)
nDistinctA(R)
- used for estimation of selectivity
5. Selection cardinality SCA(R) of an attribute A is the average number of records
that will satisfy an equality seelction condition on that attribute.
For non-key attributes, we can estimate:
SC = nTuples(R) / nDistinctA(R)
(assuming a uniform distribution of nDistinctA(R))



Selectivity - The ratio of the number of records returned by a selection to the total
number of records.
Many DBMS maintain estimates of selectivities in the data dictionary.
In Oracle, for example, you can force the collection of statistics on tables and
indexes with the ANALYZE TABLE command.

Since Access cost is dominant, we concentrate on this cost.
Basic variables and notations used
Variable
Connolly/Begg
Number of Records
(tuples) in a table
(relation)
nTuples(R)
Blocking Factor
(records per block)
Spanned: bFactor(R) = Block Size / Record Size(R)
Remember to apply a floor function to this result for unspanned
Unspanned: bFactor(R) = BlockSize / RecordSize(R)
Number of Blocks
required to store the
table
nBlocks(R) = nTuples(R) / bFactor(R)
Remember to apply a ceiling function to this result
Number of Levels
of an Index
nLevelsA(I)
Number of blocks
required for first
index level
nLfBlocksA(I)
Number of distinct
values of an
attribute
nDistinctA(I)
Selection
Cardinality (avg.
number of records
satisfied on
equality)
(Assuming uniform
distribution of
distinct values)
For a non-key:
SCA(R) = nTuples(R) / nDistinctA(R)
For a key:
SCA(R) = 1
Selection
Cardinality for nonequality
(Assuming uniform
distribution of
distinct values)
For inequality (A > c):
SCA(R) = nTuples(R) * (( maxA(R) - c) / (maxA(R) - minA(R) )
For inequality (A < c) According to the textbook:
SCA(R) = nTuples(R) * (( c - maxA(R) ) / (maxA(R) - minA(R) )
Perhaps this is better:
SCA(R) = nTuples(R) * (( c - minA(R) ) / (maxA(R) - minA(R) )
For conjunctive conditions (A
SCA
B(R)
= SCA(R) * SCB(R)
B):
For disjunctive conditions (A
SCA
B(R)
B):
= SCA(R) + SCB(R) - SCA(R) * SCB(R)
Cost functions for SELECT operations
Selection conditions
Connolly/Begg
Non-equality condition or Equality on a non-key: Full
table scan
nBlocks(R)
Equality condition on a key in unordered data: Linear
search
nBlocks(R) / 2
Equality condition on a key in ordered data: Binary
Search:
log2nBlocks(R)
Equality condition on a key in ordered data using
Primary Index:
nLevelsA(I) + 1
Equality condition on a non-key or Non-equality
condition
Retrieve multiple records using a multi-level index:
(assume 1/2 of the records match the condition)
nLevelsA(I) + nBlocks(R) / 2
Equality condition on a non-key using a clustering index:
nLevelsA(I) + SCA(R) /
bFactor(R)
Equality condition on a non-key using secondary index:
because in the worst case, each record may reside on a
different block.
nLevelsA(I) + SCA(R)
Single Table Query Examples
Description
Connolly/Begg
EMPLOYEE table has 10,000 records with a
blocking factor of 5.
nTuples(E) = 10000
bFactor(E) = 5
nBlocks(E) = 2000
Assume a Secondary index on the key attribute
SSN
nLevelsSSN(E.SSN) = 4
SCSSN(E) = 1
Assume a Secondary index on non-key attribute
DNO
nLevelsDNO(E.DNO) = 2
nLfBlocksDNO(E.DNO) = 4
There are 125 distinct values of DNO
nDistinctDNO(E) = 125
SCDNO(E) = nTuples(E) /
nDistinctDNO(E) = 80
The selection cardinality of DNO is
Query 1: SELECT * FROM employee WHERE SSN=123456789
1. Cost of a Full table Scan
nBlocks(E) = 2000 blocks
2. Average Cost (no index) Linear search
bBlocks(E) / 2 = 1000 blocks
3. Cost when using a secondary index on SSN nLevelsSSN(E.SSN) + 1 = 5 blocks
Query 2: SELECT * FROM employee WHERE DNO = 5
1. Cost of a Full table Scan
nBlocks(E) = 2000 blocks
2. Rough estimate using the secondary index nLevelsDNO(E.DNO) + SCDNO(E) = 82
Query 3: SELECT * FROM employee WHERE DNO > 5
1. Cost of a Full table Scan
2. Rough estimate using the
secondary index
nBlocks(E) = 2000 blocks
nLevelsDNO(E.DNO) + SCDNO(E) = 9,680 blocks
Where SCDNO(E) is estimated using the formula for
inequality (A < c):
SCDNO(E) = nTuples(E) * (( maxDNO(E) - c ) /
(maxDNO(E) - minDNO(E) ) =
10,000 * (( 125 - 5) / (125 - 1) ) =
9,678 records
9677.419
Query 4: SELECT * FROM employee WHERE SALARY > 30000 AND
DNO > 4
This needs to be done in two steps:
TEMP <dno > 4 (E)
salary > 30000 (TEMP)
salary > 30000
(
dno > 4
(E)) or
1. Cost of a two Full table Scans nBlocks(E) + nBlocks(TEMP) <= 4000 blocks
2. Rough estimate using the
secondary
index followed by full table scan
( nLevelsDNO(E.DNO) + SCDNO(E) ) +
nBlocks(TEMP) <= 2082
=
of TEMP
Cost function estimates for Join Operations

Basic join function: relation R with relation S where R.A = S.B:
T=R
R.A = S.B
S
Variable
Number of blocks for each relation
nBlocks(R)
nBlocks(S)
Number or tuples for each relation
nTuples(S)
Blocking factor of the joined relation T
Block size divided by the Sum of the record
sizes of R and S
bFactor(T)
Size of cartesian product of R and S
nTuples(R) * nTuples(s)
nBlocks(R) * nBlocks(S)
Join Cardinalities (Connolly/Begg)
Join R and S resulting in T
If A is a key of R then nTuples(T) <=
nTuples(S)
If B is a key of S then nTuples(T) <=
nTuples(R)
R
R.A = S.B
S
Otherwise, estimate as
nTuples(T) = SCA(R) * nTuples(S)
or
nTuples(T) = SCB(S) * nTuples(R)

Two parts to a join: The cost to read the tuples from each relation + the cost to
store the results.
Join type
Connolly/Begg
Nested loop join
nBlocks(R) + ( nBlocks(R) * nBlocks(S) ) + [nTuples(T) /
bFactor(T)]
Indexed Join
Assuming an index on S.B
nBlocks(R) + [nTuples(R) * (nLevelsB(I) + SCB(S) ) ] +
[nTuples(T) / bFactor(T)]
Sort-Merge Join (assuming
tables
are already sorted on the
same attributes)
nBlocks(R) + nBlocks(S) + [nTuples(T) / bFactor(T)]
Cost to sort relation R
Cost to sort relation S
nBlocks(R) * [log2nBlocks(R)]
nBlocks(S) * [log2nBlocks(S)]
Example Joins
Description
Connolly/Begg
EMPLOYEE table has 10,000 records with a
blocking factor of 5.
nTuples(E) = 10000
bFactor(E) = 5
nBlocks(E) = 2000
Assume a Secondary index on the key attribute
SSN
nLevelsSSN(E.SSN) = 4
SCSSN(E) = 1
Assume a Secondary index on non-key attribute nLevelsDNO(E.DNO) = 2
DNO
nLfBlocksDNO(E.DNO) = 4
There are 125 distinct values of DNO
nDistinctDNO(E) = 125
The selection cardinality of DNO is
SCDNO(E) = nTuples(E) /
nDistinctDNO(E) = 80
nTuples(D) = 125
Department table has 125 records with blocking
bFactor(D) = 10
factor 10
nBlocks(D) = 13
There is a single level primary index on
DNUMBER
nLevelsDNUMBER(D.DNUMBER) = 1
There is a secondary index on MGRSSN
nLevelsMGRSSN(D.MGRSSN) = 2
SCMGRSSN(D) = 1
Join Selection Cardinality estimate
(DNUMBER is a key of DEPARTMENT)
nTuples(T) <= nTuples(E) = 10000
Blocking factor for the resulting joined table
bFactor(ED) = 4
Join query:
SELECT employee.*, department.*
FROM
employee, department
WHERE employee.dno = department.dnumber
Join Access Routine
Use nested loop with
EMPLOYEE on the outside
Formulas and Notations
nTuples(T) = 10,000
bFactor(T) = bFactor(ED) = 4
nBlocks(E) + (nBlocks(E) * nBlocks(D) + [nTuples(T) /
bFactor(T)]
= 2000 + (2000 * 13) + [ 10000 / 4 ]
= 30,500
Use nested loop with
DEPARTMENT on the
outside
nBlocks(D) + (nBlocks(E) * nBlocks(D) ) + [nTuples(T) /
bFactor(T)]
= 13 + (2000 * 13) + [ 10000 / 4 ]
= 28,513
Use index structure
on DNUMBER with
EMPLOYEE on the outside
nBlocks(E) + (nTuples(E) *
(nLevelsDNUMBER(D.DNUMBER) + 1)) + [nTuples(T) /
bFactor(T)]
= 2000 + (10000 * 2) + [ 10000 / 4 ]
= 24,500
Use index structure on
DNO with DEPARTMENT
on the outside
nBlocks(D) + (nTuples(D) * (nLevelsDNO(E.DNO) +
SCDNO(E)) + [nTuples(T) / bFactor(T)]
= 13 + (125 * (2 + 80)) + [ 10000 / 4 ]
= 12,763
Sort-Merge Join
Sort: [nBlocks(E) * log2nBlocks(E)] + [nBlocks(D) *
log2nBlocks(D)]
= 2000 * 11 + 13 * 4
= 22,052
Join: nBlocks(E) + nBlocks(D) + [nTuples(T) /
bFactor(T)] = 2000 + 13 + [ 10000 / 4 ] = 4,513
Total cost: 22052 + 4513 = 26,565
 Another example:
SELECT department.*, employee.*
FROM
department, employee
WHERE department.mgrssn = employee.ssn
Assume an employee may manage at most one department. So join cardinality = 125
(nTuples(T))
1. Use nested loop with DEPARTMENT on the outside:
nBlocks(D) + (nBlocks(D) * nBlocks(E)) + [nTuples(T) / bFactor(T)]
= 13 + (13 * 2000) + [ 125 / 4 ]
= 13 + 26000 + [125 / 4]
= 26,045
Query Optimization Examples

Use the schema from page 205 of the Elmasri/Navathe 3rd edition textbook:
Consider the following query to retrieve the employee names and project names
of all employees in Department 5 who work more than 10 hours per week on any
project.
Note: This schema is different from the one you have for your homework
assignment
SELECT EMPLOYEE.FNAME, EMPLOYEE.LNAME, PROJECT.PNAME
FROM EMPLOYEE, WORKS_ON, PROJECT
WHERE EMPLOYEE.SSN = WORKS_ON.ESSN
AND WORKS_ON.PNO = PROJECT.PNUMBER
AND WORKS_ON.HOURS > 10
AND PROJECT.DNUM = 5
ORDER BY EMPLOYEE.LNAME ;

This query requires the following steps:
1.
1. Select only those records where PROJECT.DNUM = 5
2.
2. Select only those records where WORKS_ON.HOURS > 10
3.
1. Join the WORKS_ON and PROJECT where WORKS_ON.PNO =
PROJECT.PNUMBER
4.







2. Join EMPLOYEE and WORKS_ON where EMPLOYEE.SSN =
WORKS_ON.ESSN
5. Sort the results from the prior step on LNAME
6. Project the FNAME, LNAME and PNAME columns
Note that these steps can be done in any order, except for the last 2 steps (ORDER
BY and Projection).
Given that the Sort (ORDER BY) and Projection must come last, there are a total
of 4! orders of operations.
We can reduce this complexity further by noting that
1 and
2 are
independent. So this redces the solution space to 3! or 6 orders of operations.
Our heuristics tell us to try
1 and
2 first followed by
1 and
2
Assume the block size is 200 bytes and we are using unspanned records.
Also assume EMPLOYEE.SSN has a 2 level index and PROJECT.DNUM has a
single level clustering index.
What we know about the tables in question:
Table Name
# of
Records
EMPLOYEE
nTuples(E)
=8
PROJECT
nTuples(P)
=6
Record
Size
100
bytes
55
bytes
Blocking
Factor
# of Blocks
Indexes
Selection
Card.
bFactor(E)
=2
nBlocks(E)
=4
nLevelse.ssn
=2
SCssn(E) = 1
nLevelsp.dnum
=1
SCdnum(P) =
nTuples(P) /
nDistinctdnum(P)
= 6/3 = 2
bFactor(P)
=3
nBlocks(P)
=2
nTuples(W) 35
WORKS_ON
= 16
bytes
bFactor(W) nBlocks(W)
no indexes
=5
=4
Ordering #1:
2, Sort, Projection
1.
1,
2,
1,
SChours(W) =
nTuples(W) /
nDistincthours(P)
= 16/9 = 2
1: Select only those records where PROJECT.DNUM = 5
o Method a: Equality condition on a non-key: Full table scan
Cost: nBlocks(P) = 2 blocks
o Method b: Using clustering index
Algorithm: Multiple records using a clustering index: nLevelsdnum + (
SCdnum / bFactor(P) )
We need to know the selection cardinality of dnum. we can estimate:
SCdnum(P) = nTuples(P) / nDistinctdnum(P)
From the data we can see that there are 3 distinct values of dnum
SCdnum = ( 6 / 3 ) = 2
Remember, this is only an estimate of SCdnum
Cost: nLevelsdnum + ( SCdnum / bFactor(P) ) = 1 + (2 / 3) = 1.66 or 2 blocks
So in this step, it does not matter if we use the index or not, the cost will be the
same: Cost is 2 blocks (Step 1)
The result of this step is a temporary table we will call P1 with the following
characteristics (note the record size remains the same, only the number of records
has been reduced):
Table
Name
P1
# of Records
nRecords(P1) =
3
Record
Size
55 bytes
Blocking
Factor
# of Blocks
bFactor(P1) = 3
nBlocks(P1) =
1
Temporary table P1 has no indexes associated with it.
2.
2: Select only those records where WORKS_ON.HOURS > 10
Since there are no indexes on the HOURS column, and since the condition is nonequality, we are left with only full table scan.
o Method a: Full table scan
Algorithm: nBlocks(W) Cost: 4 blocks
Thus the cost for this step is: Cost = 4 blocks (Step 2)
The result of this step is a temporary table we will call W1 with the following
characteristics (note the record size remains the same, only the number of records
has been reduced):
Table
Name
W1
# of Records
nRecords(W1) =
8
Record
Size
35 bytes
Blocking
Factor
# of Blocks
bFactor(W1) =
5
nBlocks(W1) =
2
Temporary table W1 has no indexes associated with it.
At this point, the two temporary tables W1 and P1 would look like the following
(note the DBMS does not actually know this at the planning stage; this is just for
illustrative purposes only):
P1 TEMP TABLE:
PNAME
PNUMBER
---------------- ------ProductX
1
ProductY
2
ProductZ
3
3.
PLOCATION
---------Bellaire
Sugarland
Houston
DNUM
---5
5
5
W1 TEMP TABLE:
ESSN
PNO
--------- --123456789
1
666884444
3
453453453
1
453453453
2
999887777
30
987987987
10
987654321
30
987654321
20
HOURS
----32.5
40.0
20.0
20.0
30.0
35.0
20.0
15.0
1: Join the WORKS_ON and PROJECT where WORKS_ON.PNO =
PROJECT.PNUMBER
Since we have already filtered out those PROJECT records with DNUM=5, we
can use the P1 table in place of the full PROJECT table.
Since we have already filtered out those WORKS_ON records with HOURS > 10,
we can use the W1 table in place of the full WORKS_ON table.
We need to know what the blocking factor of the resulting table will be.
Since we are joining these two tables, the resulting temporary table (we will call
W1P1) will have all of the columns from both tables, so we need to add up the
sizes of their respective records:
P1 record size is 55 bytes + W1 record size is 45 bytes so W1P1 record size will
be 100 bytes
Thus given a block size of 200 bytes, bFactor(W1P1) = 2
We also need to know join selection cardinality for the new joined table.
Since PROJECT.PNUMBER is the key of the PROJECT table, we can make use
of the following estimation:
If PNUMBER is a key of P1 then nTuples(W1P1) <= nTuples(W1)
We therefore estimate nTuples(W1P1) = nTuples(W1) = 8.
This is an upper bound, not an exact figure.
o
o
Method a: Nested loop (W1 on the outside)
Algorithm: nBlocks(W1) + ( nBlocks(W1) * nBlocks(P1) ) + [
nTuples(W1P1) / bFactor(W1P1) ]
Cost: 2 + (2 * 1) + [ 8 / 2] = 4 + [8/2] = 8 blocks (method a)
Method b: Nested loop (P1 on the outside)
Algorithm: nBlocks(P1) + ( nBlocks(W1) * nBlocks(P1) ) + [
nTuples(W1P1) / bFactor(W1P1) ]
Cost: 1 + (2 * 1) + [ 8 / 2] = 3 + [8/2] = 7 blocks (method b)
In this step, it matters which table we put on the outside loop.
Putting P1 on the outside loop (method b) gives the best performance: Cost is 7
blocks (Step 3)
The result of this step is a temporary table we will call W1P1 with the following
characteristics:
Table
Name
W1P1
# of Records
(upper bound)
Record
Size
nTuples(W1P1) = 8
(estimate)
100
bytes
Blocking
Factor
# of Blocks
bFactor(W1P1)
=2
nBlocks(W1P1)
=4
This W1P1 table would actually look like:
ESSN
123456789
666884444
453453453
453453453
4.
PNO
1
3
1
2
HOURS
32.5
40.0
20.0
20.0
PNAME
ProductX
ProductZ
ProductX
ProductY
PNUMBER
1
3
1
2
PLOCATION
Bellaire
Houston
Bellaire
Sugarland
DNUM
5
5
5
5
2: Join EMPLOYEE and WORKS_ON where EMPLOYEE.SSN =
WORKS_ON.ESSN
Once again, since we have already joined WORKS_ON with PROJECTS
(actually a reduced version of WORKS_ON and PROJECTS tables), we can
make use of the W1P1 temporary table.
We need to know what the blocking factor of the resulting table will be.
Since we are joining these two tables, the resulting temporary table (E1W1P1)
will have all of the columns from both tables, so we need to add up the sizes of
their respective records:
W1P1 record size is 100 bytes + E record size is 100 bytes so E1W1P1 record
size will be 200 bytes
Thus given a block size of 200 bytes, bFactor(E1W1P1) = 1
We also need to know join selection cardinality for the resulting table.
Since EMPLOYEE.SSN is the key of the PROJECT table, we can make use of
the following estimation:
If SSN is a key of EMPLOYEE then nTuples(E1W1P1) <= nTuples(W1P1)
We therefore estimate nTuples(E1W1P1) = nTuples(W1P1) = 8.
This is an upper bound, not an exact figure.
o
o
o
Method a: Nested loop (EMPLOYEE on the outside)
Algorithm: nBlocks(E) + (nBlocks(E) * nBlocks(W1P1)) + [
nTuples(E1W1P1) / bFactor(E1W1P1) ]
Cost: 4 + (4 * 4) + [ 8 / 1] = 4 + 16 + [8/1] = 28 blocks (method a)
Method b: Nested loop (W1P1 on the outside)
Algorithm: nBlocks(W1P1) + (nBlocks(E) * nBlocks(W1P1)) + [
nTuples(E1W1P1) / bFactor(E1W1P1) ]
Cost: 4 + (4 * 4) + [ 8 / 1] = 4 + 16 + [8/1] = 28 blocks (method b)
Method c: Single loop making use of index on SSN (W1P1 on the outside)
Algorithm: nBlocks(W1P1) + [ nTuples(W1P1) * (nLevelsssn(E) +
SCssn(E)) ] + [ nTuples(E1W1P1) / bFactor(E1W1P1) ]
Cost: 4 + [ 8 * (2 + 1) ] + [ 8 / 1] = 4 + 21 + [8/1] = 33 blocks (method c)
Therefore, the lowest cost method is to use nested loop (either way) (method a or
b) which gives: Cost = 28 blocks (Step 4)
As before, we have a new temporary table called E1W1P1 with the following
characteristics:
Table
Name
E1W1P1
# of Records
Record
Size
nTuples(E1W1P1) 200
=8
bytes
Blocking Factor
# of Blocks
bFactor(E1W1P1) nBlocks(E1W1P1)
=1
=8
The E1W1P1 table would look like this (all of the columsn in EMPLOYEE would
also be in there):
ESSN
FNAME
123456789
John
666884444
Ramesh
453453453
Joyce
PNO
MINIT
1
B
3
K
1
A
HOURS
LNAME
32.5
Smith
40.0
Narayan
20.0
English
PNAME
PNUMBER
SSN
...
ProductX
1
123456789
ProductZ
3
666884444
ProductX
1
453453453
PLOCATION
DNUM
Bellaire
5
Houston
5
Bellaire
5
453453453
2
Joyce
A
20.0
ProductY
English 453453453
2
Sugarland
5
5. Sort the final temp table E1W1P1 on the LNAME column.
The cost to sort a relation is: nBlocks(R) * [log2nBlocks(R)]
Cost: nBlocks(E1W1P1) * [log2nBlocks(E1W1P1)] = 8 * 3 = 24 blocks
Cost is 24 blocks (Step 5)
6. Project the FNAME, LNAME, PNAME columns from the final temp table
We may consider doing this as part of the previous join operation in which case
the cost of this step is 0.
In summary, we have completed the 6 main steps for this query and each step produced
the following costs:
1 Select only those records where PROJECT.DNUM = 5
2 blocks
2 Select only those records where WORKS_ON.HOURS > 10
4 blocks
3 Join the WORKS_ON and PROJECT where WORKS_ON.PNO = PROJECT.PNUMBER
7 blocks
4 Join EMPLOYEE and WORKS_ON where EMPLOYEE.SSN = WORKS_ON.ESSN
28 blocks
5 Sort the final temp table on the LNAME column
24 blocks
6 Project the FNAME, LNAME and PNAME columns
0 blocks
Total estimated cost for the query (using steps in this order)
65 blocks
This represents the total cost, in blocks, give the specific order of operations:
1,
1,
2,
2, Sort, Projection
Thus the above steps should then be repeated given other orders of operations. In fact, all
of the arrangements (ordering of the steps) should be tried until the ordering of steps that
produces the least total cost is found.
In practice, it is generally most efficient if all of the simple selections are done first.
Followed by the joins in later steps.
Now look at the optimization done using the Heuristic (rule based) approach:
Original Query Tree
Cascading the Selections
Commuting Selection and Cartesian product
Combining Selection and cartesian product to for
joins
Download