cse4701chap19 - University of Connecticut

advertisement
Chapter 19 6e - 17 & 18 5: System Catalog
and Query Optimization
CSE
4701
Prof. Steven A. Demurjian, Sr.
Computer Science & Engineering Department
The University of Connecticut
191 Auditorium Road, Box U-155
Storrs, CT 06269-3155
steve@engr.uconn.edu
http://www.engr.uconn.edu/~steve
(860) 486 - 4818



A portion of these slides are being used with the permission of Dr. Ling Lui,
Associate Professor, College of Computing, Georgia Tech.
Other slides have been adapted from the AWL web site for the textbook.
Remaining slides represent new material.
Chapter 19-1
Overview of Material

CSE
4701


Key Background Topics:
 What are Typical Database Processing Actions?
 Disk Drives and Disk Storage
 Database Processing/Architectures
 Motivating Query Optimization
 Query Processing
Chapter 17 - System Catalog
 What is it?
 How is it Used?
Chapter 18 - Query Optimization in RDBMS
 High-level Query Optimization (Algebraic)
 Low-level Query Optimization (Cost-based)
Chapter 19-2
Typical Database Processing
CSE
4701
Parsed and
Optimized
User Trans.
Pre-Processing
- Parser/Lexical
- Optimizer/Views
Concurrency Control
Lock Request
Response
User Transaction
Errors
Post-Processing
- Collection of Results
- Aggregation Operations
- Security Checks
Low-Level Processing
- Enqueue Trans.
- Request Locks
- Issue I/Os
- Process Returned Data
- Integrity Checks
- Security Checks
- Logging for Recovery
- Release Locks
- Dequeue Trans.
High-Level Processing
- Enqueue Trans.
- Request Locks
- Release Locks
-Dequeue Trans.
Response to User
I/O
Request
Errors
Results
Lock Request
Results
Disk I/O
Recovery
Chapter 19-3
What are the Processing Issues for DBs?

CSE
4701



Database Applications of Today and Tomorrow
Require High Volumes of Information!
Increase of Information Still Requires High
Performance!
Throughput and Response Time
Where's the Bottleneck in DBS?
 CPU ??
 Main Memory Size/Speed ??
 Virtual Memory Limitations ??
 Communications Bus ??
 I/O Channel ??
Chapter 19-4
90-10 Rule for Database Processing

CSE
4701



Load (Transaction per second) vs.
Performance (Response Time of Transactions)
Processing of Large Amounts of Raw Data
 Addressed in Secondary Storage
 Staged to Main Memory
Identifying Relevant Data
 Large Amounts of Raw Data Discarded
 Focus on Data Most Likely to Contain Answers
 Possible Loss of CPU and Main Memory Cycles
This is Double Jeopardy!
 Load of DBS Must be Reduced
 Performance of DBS Degrades
Chapter 19-5
90-10 Rule for Conventional DBS

CSE
4701
Only 10% of Relevant
Data has Answers
Application
Programs

Operating
System

Database
Functions
Only 10% of Raw Data is
Relevant
On-Line
I/O
Disk I/O
Note: Naive Approach to Database Searching Often Occurs
(Little or No Indexing in Practice!)
Chapter 19-6
Randomly Accessed Storage Devices

CSE
4701





Popular Media (Hard Drives, CDs, DVDs, etc.)
Access to Information in Any Order
Sequential Access Not Typically Supported or
Needed, Since “Files” Not Stored Sequentially
Recall, Disk Defragmentation on PC Platform
Block-Oriented Utilization of Device
 Block Access to Optimize Transfer
 Block Size is Device/Controller Dependent
 Linear/Non-Linear Byte Orders with Blocks
Key Concepts …
 Platter
 Track
 Sector
 Cylinder
 Read/Write Heads
Chapter 19-7
Rotating Storage
CSE
4701
Track
R/W
Heads
Platters
Cylinder
Top View of a Surface
Note: Parallel Read/Write Drives
Activate All Heads Simultaneously
Chapter 19-8
Disk Drive Components
CSE
4701
Chapter 19-9
Disk Characteristics and Access

CSE
4701


Transfer Time: Time to Copy Bits From Disk Surface
to Primary Memory
Disk Latency Time:
 Rotational Delay Waiting for Proper Sector to
Rotate Under R/W Head
 Rotate to Next Sector to Process Next Request
Disk Seek Time:
 Delay While R/W Head Moves to the Destination
Track/Cylinder
 Move Head In/Out to Seek Next Track/Cylinder

Access = Seek (In/Out) + Latency (Around) + Transfer (Bytes)

For DBMS - Key is Moving Data To/From Disk
ASAP w.r.t. Performance and Response Time
Improve on 90-10 via Processing/Optimization

Chapter 19-10
Historical DB Architecture - Mainframe
CSE
4701
Chapter 19-11
Client/Server DBS Architecture
CSE
4701
Chapter 19-12
Mixed Architecture
CSE
4701
Chapter 19-13
Three and Four Tier Architectures
CSE
4701
From: http://java.sun.com/javaone/javaone98/sessions/T400/index.html
Chapter 19-14
What is MBDS?

CSE
4701


MBDS is Multi-Process, Multi-Computer, Parallel
Database System
MBDS Composed of …
 Host for Issuing User Requests
 Controller to Interact with Host (and User)
 One or More Backend Database Processors
Goals of MBDS
 Suppose Request Takes 4 Minutes with One
Backend
 Improve Response Time by Increasing Backends
 Two Backends - Request 2+ Minutes
 Four Backends - Request 1+ Minutes
Chapter 19-15
What is MBDS Architecture?
CSE
4701
Database Blocks are Distributed Across All Backends
Backend (BE) DB
Processors are Replicated
Database Controller
Sends Same Query
in Parallel to all BEs
Host
User
Database
Controller
Backend
Database
Processor
Backend
Database
Processor
BEs work in Parallel on
Each Query and Communicate for Join
Results are Sent to and Collected by
the DB Controller - then to the User
Backend
Database
Processor
Chapter 19-16
Approach Distributes Data Across Backends

CSE
4701



Suppose System has 10
Backends
Consider a Number of Tables
 Inventory
 Customers
 Employees
 …
What Happens if Place
 One Table/Backend?
What Happens if you Distribute
…
 Table Across 10 Backends?
Backend
Database
Processor 2
Backend
Database
Processor 1
Backend
Database
Processor 10
Chapter 19-17
What are MBDS Processes?
CSE
4701
Database
Controller
Request
Preparation
Post
Processing
Put Msg.
Get Msg.
Get Msg.
Put Msg.
Directory
Management
Record
Processing
Concurrency
Control
Disk I/O
Backend
Database
Processor
Chapter 19-18
What are MBDS Messages?
CSE
4701
No.
1
2
3
4
6
12
15
16
21
22
23
Type
New Request
Results of Request
Number of Reqs in Transaction
Aggregate Operators (Sum, etc.)
Parsed Request to Backends
Backend Aggregate Operator Results
Ids for Accessing Database Indexes
Request and Disk Addresses
Ids for Accessing Database Records
Locks Obtained: Okay to Execute
Request ID of Finished Request
SRC
Host
PoPr
ReqP
ReqP
ReqP
RecP
DM
DM
DM
CC
RecP
DST
ReqP
Host
PoPr
PoPr
DM
PoPr
DMs
RecP
CC
RecP
CC
Chapter 19-19
Sample Processing of Retrieve Request
CSE
4701
A1
F15 From
Other
Backend
Request
Preparation
D6
Put Msg.
B3
C4
K12
Post
Processing
K12
Get Msg.
E15 To Backend(s)
Get Msg.
Put Msg.
D6,F15
E15
Directory
Management
G21
K12
H22
Record
Processing
I16
Concurrency
Control
J23
Disk I/O
Chapter 19-20
What are Synchronization Issues in MBDS?

CSE
4701
Coordination of Synchronous Behavior …
 Within Controller and Backend to Allow Multiple
Active Requests within
 Each Process
 Requests at Different Stages in Different Processes

Between Controller and Backends to Allow
 A Request to be Processed by All Backends
 A Request to be Processed by One Backend

Among Multiple Backends to Allow a Backend
 to Synchronize its Work on one Request with Other
Backends
 to Forward Results to Another Backend
Chapter 19-21
Introduction to Query Processing

CSE
4701

Query optimization:
 The process of choosing a suitable execution
strategy for processing a query.
Two internal representations of a query:
 Query Tree
 Query Graph
Chapter 19-22
Introduction to Query Processing
CSE
4701
Chapter 19-23
How Does this Relate to Compilers?
Source Program
CSE
4701
1
2
3
Symbol-table
Manager
Lexical
Analyzer
Syntax Analyzer
Semantic Analyzer
Error Handler
4
5
6
Intermediate
Code Generator
Code Optimizer
Code Generator
Target Program
1, 2, 3 : Analysis - Our Focus
4, 5, 6 : Synthesis
Chapter 19-24
Translating SQL Queries into Relational Algebra

CSE
4701



Query block:
 The basic unit that can be translated into the
algebraic operators and optimized.
A query block contains a single SELECT-FROMWHERE expression, as well as GROUP BY and
HAVING clause if these are part of the block.
Nested queries within a query are identified as
separate query blocks.
Aggregate operators in SQL must be included in the
extended algebra.
Chapter 19-25
Translating SQL Queries into Relational Algebra
CSE
4701
SELECT
FROM
WHERE
SELECT
FROM
WHERE
LNAME, FNAME
EMPLOYEE
SALARY > (
SELECT
FROM
WHERE
LNAME, FNAME
EMPLOYEE
SALARY > C
πLNAME, FNAME
MAX (SALARY)
EMPLOYEE
DNO = 5);
SELECT
FROM
WHERE
MAX (SALARY)
EMPLOYEE
DNO = 5
ℱMAX SALARY (σDNO=5 (EMPLOYEE))
(σSALARY>C(EMPLOYEE))
Chapter 19-26
Why is Query Optimization Needed?

CSE
4701
Data Volume in any Type of Join or Cartesian Product
has the Potential to be Very Large!

Consider R(A, B) = {r1, r2 , ..., rn}
Consider S(C, D) = {s1, s2 , ..., sm}
R x S = {r1 s1, r1 s2, r1 s3, r1 s4, …
r2 s1, r2 s2, r2 s3, r2 s4, … }
which contains n x m tuples!
What is the Issue?




If n is 10,000 and m is 20,000 then
 Cartesian Product has 200,000,000 Tuples
 Join must Perform 200,000,000 Comparisons
Chapter 19-27
Aside – What is an External Sort?

CSE
4701


Traditional – All Algorithm/Programming Classes
Focus on Internal Searches/Sorts
 Internal – All Data Loaded into Main Memory
 Data Searched/Sorted in Main Memory
 Results in Main Memory
What Happens When Data Source to be Searched
Exceeds Main Memory (or Virtual Memory)?
External Search
 Stages Blocks of Data from Disk into Memory
 Sorts/Searches with Blocks in Memory
 Writes Intermediate results to Disk
 Need to Reread Results from Disk for Final Search
Results, Merging for Sort, etc.
Chapter 19-28
Why is Query Optimization Needed?

CSE
4701


n/m - Number of Tuples of R/S Respectively
bR / bS - Number of Tuples/Block of Memory
Assume that K Blocks Fit into Primary Memory
1 Block
of R
K-1
Blocks
of S
n / bR (m / bS ) Number of Blocks for R/S
1
2
3
(m / bS )/(K-1) Number of Times that K-1
Memory Chunk Filled by S
(n / bR )[(m / bS )/(K-1)] Which if Filled for
Each Block of R
(n / bR ) + (n / bR )[(m / bS )/(K-1)]
K-1
Total Block Reads Must also Read Blocks of R
Chapter 19-29
Why is Query Optimization Needed?
1CSE
Block
4701
of R
K-1
Blocks
of S
n / bR (m / bS ) Number of Blocks for R/S
1
2
3
(m / bS )/(K-1) Number of Times that K-1
Memory Chunk Filled by S
(n / bR )[(m / bS )/(K-1)] Which if Filled for
Each Block of R
(n / bR ) + (n / bR )[(m / bS )/(K-1)]
K-1
Total Block Reads Must also Read Blocks of R
If n = m = 10,000 and bR = bS = 5, and K= 100
(10,000/5)+(10,000/5)[(10,000/5)/99] = 42,400 Blocks to Read
At 20 Blocks/Second - 35 Minutes!
Chapter 19-30
Observation

CSE
4701


Cartesian Product Yields Unwanted Data
SELECT R.A
FROM R, S
WHERE R.B = S.C and S.C = 99
In Relational Algebra:
A ( B=C and D=99 (R x S))
= A ( B=C (R x  D=99 (S) ))
= A (R x B=C ( D=99 (S)))

Has Performance Improved? How?
Chapter 19-31
Evaluation

CSE
4701


Cartesian Product for SELECT - 40,000 Blocks
SELECT R.A
FROM R, S
WHERE R.B = S.C and S.C = 99
Relational Algebra with Equijoin:
A (R x B=C ( D=99 (S)))
The  D=99 (S) Limits the Size of S Dramatically
As a Result, the Equijoin of R and  D=99 (S) Would
Likely Reduce the Total Blocks Required to 4,000

Thus, a “Smart” Query Execution Strategy Can
Dramatically Reduce the Amount of I/Os
Chapter 19-32
Query Optimization Goal

CSE
4701


Limit Costly Join Operation by Reducing Data to be
Scanned or that Participates in the Join
Query Optimization is Strategy to Achieve Goal
While Improving Selection and Projection can Help,
the Main Objective is Join
 In Worst Case - Cartesian Product
 Can Improve by Introducing Indices on the Join
Attributes (R.B and S.C) to Limit “Product”
 Can Further Improve by Sorting on the Join
Attributes (R.B and S.C)
 This Reduces Block Accesses by Limiting the Number
of Blocks that Must be Examined in a Join
 If B’s Values Range from 0 to 100 and C from 50 to
150, only need to Compare from 50 to 100
Chapter 19-33
Query Processing

CSE
4701
Internal Data Structure
 Memory Hierarchy
 Main Memory + Secondary Memory
 Information Must be Staged from Secondary to Primary
Memory for Database Operation

Sequential Search
 Brute force Approach

Direct Access (Indexed Search)
 Hash, Inverted Index file, Binary Search Tree, B-tree,
B+-tree
 Improves Selection by Focusing on Subset of Tuples
that are Involved in the Answer and Equijoin by Not
Having to Compare All Blocks in Two Relations
Chapter 19-34
Algorithms for Database Query Operators

CSE
4701

Largely Fall into Three Classes
 Sorting-Based Methods
 Hash-Based Methods
 Index-Based Methods
Such Algorithms are Divided into Three Degrees of
Difficulty and Cost (Limiting Factor is Size of Data)
 One Pass Algorithms
 Where Data is Only Read Once From Disk

Two-pass Algorithms
 Data is Read from Disk, Processed in Some Way,
Written Back to Disk, Read Again for Processing, etc.

Multi-pass Algorithms
 Where 3 or More Passes Are Required, i.e., Recursive
Generalization of the Two-pass Algorithms
Chapter 19-35
Database Join and Sort are External

CSE
4701


Suppose that your DBS has 1,000 1K Blocks of
Memory Available for Performing Operations (e.g.,
Select, Project, Join, Union, Aggregation, etc.)
Suppose Sort R by R.B
 R Contains 5000 Blocks
 In order to Perform a Sort/Merge - You Must Use
External Algorithm since all 5000 Blocks Can Fit
Into Memory at the Same Time
Suppose Join R (500 Blocks) and S (800 Blocks)
 Again - their Total Exceeds Memory - Hence you
Must Take an Approach that Compares One Block
of R with All Blocks of S, etc.
1
2
3
1000
Chapter 19-36
Database Join and Sort are External

CSE
4701





What’s True about Today’s DBMS Like Oracle?
Oracle Recommends 2 Gigabytes of Primary Memory
That 2 Gigabytes Must be Shared by:
 Operating System
 Other Applications Running on “Same” Server
(Web Server, etc.)
 Database Management Software
Even if there was 1.5 Gigabytes Available, Modern
DBs can Exceed that size Very Easily
Moreover,
 Cartesian Product Could Exceed Available Mem.
 Join Could Require External Approach Since All
Tables Involved in Join Can’t fit in 1.5 Gigabytes
External Sorting/Block Oriented Processing is Norm
Chapter 19-37
Algorithms for DB Query Operators

CSE
4701
Relational Algebra Operators can be Classified into
Three Groups
 Tuple-at-a-time Unary Operators
 Selection and Projection
 No Need to Bring Entire Relation into Memory at One
Time

Full-Relation Unary Operators
 Duplicate Elimination and Grouping
 Requires Seeing All or Most of the Tuples in Memory
at Once

Full-Relation Binary Operators
 Set and Bag Versions of Union, Intersection, and
Difference, Joins, and Cartesian Products
 Requires Seeing the Tuples of Both Relations in
Memory
Chapter 19-38
Query Access
CSE
4701
Application
Programs
Application
Interfaces
Dbms
DML
Preprocessor
Object
Code of Aps
Database
Schema
Query
Query Processor
Database
Manager
DDL
Preprocessor
File Manager
Data Files
SELECT EMP.ENAME
FROM EMP, WORKS, PROJ
Disk
Storage
System Catalog
WHERE (EMP.ENO= WORKS.ENO)
AND
(WORKS.PNO = PROJ.PNO)
AND
(PROJ.PNAME = “CAD/CAM”)
Chapter 19-39
SELECT EMP.ENAME
FROM EMP, WORKS, PROJ
WHERE (EMP.ENO= WORKS.ENO)
AND
(WORKS.PNO = PROJ.PNO)
AND
(PROJ.PNAME = “CAD/CAM”)
Database Access
CSE
4701
User Program A
DBMS
System Buffer
7
9
Language
User Work Area
(UWA)
10
2
1
8
(DBMS)
3
6
5
Database
Operating
System
external schema
used by user
program A
Schema
4
Physical/Internal
Data Schema
Chapter 19-40
Database Access
1.
CSE
4701
2.
3.
4.
5.
6.
7.
8.
9.
User program A sends to DBMS an invoke command to
retrieve a (set of) record
DBMS analyzes the external schema of the user program A and
finds the database description of the record.
DBMS checks with the schema to get the data types and
location information of record
DBMS checks with the physical schema to find out which
device the record is in and what access methods can be used.
According to 4, DBMS sends OS a read command to execute
the search.
OS issues the page invoke command to the correspond device,
and then puts the page fetched into the system buffer.
DBMS uses the schema and the external schema to infer the
logical structure of the retrieving record.
DBMS places the relevant data to the UWA, and
provides the status information at the program invocation exit
Chapter 19-41
The System Catalog

CSE
4701
Store the Meta Information that Describes Each
Database, Including a Description of
 Conceptual Database Schema (Logical Data
Model)
 Relations, Attributes, Keys, Indexes, Views
Internal Schema
 External Schema
Store Information Needed by Specific DBMS
Modules
 Query Optimization Module
 Security and Authorization


Chapter 19-42
Metadata - What is it?

CSE
4701

System metadata:
 Where data came from
 How data were changed
 How data are stored
 How data are mapped
 Who owns data
 Who can access data
 Data usage history
 Data usage statistics
System metadata are critical
in a DBMS


Application metadata:
 What data are available
 Where data are located
 What the data mean
 How to access the data
 Predefined reports
 Predefined queries
 How current the data
are
Application metadata are
critical in a database system
Chapter 19-43
Metadata v.s. Data
CSE
4701

Meta schema


Data Dictionary Schema


contains copy of
metaschema; schema for
format definitions;
schema for data about
application data
Data Dictionary Data


describes all schemata
that can be defined in the
data model
schema for application
data; metadata about
application data
Data

raw formatted application
data
relations
rel-name
att-name dom-name
access-rights
user
relation
operation
relations
rel-name att-name dom-name
(u1, supplier, insert)
(u2, supplier, delete)
supplier
s#
sname
location
(s1, smith, london)
(s2, jones, boston)
Chapter 19-44
Example of Catalog Information
CSE
4701
Chapter 19-45
Relational DBMS Catalog

CSE
4701

All Metadata Stored as Relations
Example of Metadata Tables are:
Chapter 19-46
EER Diagram for Relational Catalog
CSE
4701
Chapter 19-47
Metadata in Oracle

CSE
4701
Complex Data Dictionary
 All Schema Objects (Tables,Views, Indices, …)
 User, All, and DBA Views
SELECT *
FROM ALL_CATALOG
WHERE OWNER=‘SMITH’;
SELECT COLUMN_NAME, DATA_TYPE, DATA_LENGTH,
NUM_DISTINCT, LOW_VALUE, HIGH_VALUE
FROM USER_TAB_COLUMS
WHERE TABLE_NAME=‘ORDERS’;
Chapter 19-48
Metadata in Oracle
CSE
4701
SELECT PCT_FREE, INITIAL_EXTENT, NUM_ROWS,
BLOCKS, EMPTY_BLOCKS, AVG_ROW_LENGTH
FROM USER_TABLES
WHERE TABLE_NAME = ‘ORDERS’;
SELECT INDEX_NAME, UNIQUENESS, BLEVEL, LEAF_BLOCKS,
DISTINCT_KEYS, AVG_LEAF_BLOCKS_PER_KEY,
AVG_DATA_BLOCKS_PER_KEY
FROM USER_INDEXES
WHERE TABLE_NAME = ‘ORDERS’;
Chapter 19-49
Uses of System Catalog

CSE
4701

DDL Compilers:
SELECT EMP.ENAME
FROM EMP, WORKS, PROJ
 Correct Definition of
Relations and Attributes WHERE (EMP.ENO= WORKS.ENO)
AND(WORKS.PNO = PROJ.PNO)
DML (Query) Compiler: AND(PROJ.PNAME = “CAD/CAM”)
 DML Parser
 Guided by the Description of DML Syntax and the
Schema Information in the Catalog, Generates a Query
Tree after Parser

Optimizer
 Generates Access Paths that is Relatively Optimal for
Executing a Query/ DML Command, by Accessing the
Database Structure Information (Schemas), and
Mapping High-level SQL Queries Into Low-level File
Access Commands
Chapter 19-50
Revisit Typical Database Processing
CSE
4701
Parsed and
Optimized
User Trans.
Pre-Processing
- Parser/Lexical
- Optimizer/Views
Concurrency Control
Lock Request
Response
User Transaction
Errors
Post-Processing
- Collection of Results
- Aggregation Operations
- Security Checks
Low-Level Processing
- Enqueue Trans.
- Request Locks
- Issue I/Os
- Process Returned Data
- Integrity Checks
- Security Checks
- Logging for Recovery
- Release Locks
- Dequeue Trans.
High-Level Processing
- Enqueue Trans.
- Request Locks
- Release Locks
-Dequeue Trans.
Response to User
I/O
Request
Errors
Results
Lock Request
Results
Disk I/O
Recovery
Chapter 19-51
Typical Database Processing

CSE
4701
Pre-Processing
 Actions Taken Upon Receipt of a Query from User
 SQL Query via Query Tool or JDBC Call
 “Compilation” of DB Query
 Check Syntax, Optimize, Develop Run-Time
Strategy (Similar to PL Compilation)
 Query is Translated to DB Transaction
 A Transaction Contains Multiple DB Operations
 Transaction has Explicit Order of Operations

Database Transaction Must Succeed or Fail
 There is no Intermediate State – All or Nothing
 Completely Executed and Committed or
Aborts at any Point and Undone

New State or Previous State of DB
Chapter 19-52
Typical Database Processing

CSE
4701
High-Level Processing
 Enqueue Transaction from Pre-Processing
 Transaction Must Wait for “Earlier” Transactions
 Remember - Shared DB State!

Request Locks from Concurrency Control
 All Locks Before Proceeding vs. Locks as Needed
 Avoid Deadlock and Livelock

Release Locks
 As Use of Data Completes to Increase Availability
 What Happens if Failure of Later Step in Transaction

Dequeue Transaction
 Completes Transaction Processing
 Return “Result” to Post-Processing
Chapter 19-53
What are Deadlock and Livelock?

CSE
4701
Deadlock
 Query 1 Gets Access to Table A Needs Table B
 Query 2 Gets Access to Table B Needs Table A
 Query 1 Won’t Release A until it Gets B
 Query 2 Won’t Release B until it Gets A
This is Deadlock!
Livelock
 Query 1 Gets A, Seeks B Can’t so Releases A
 Query 2 Gets B, Seeks A, Can’t so Releases B
 Process Keeps Repeating
 Can Lead to Starvation
Analogy – Two People Trying to Pass in Narrow Hall



Chapter 19-54
Typical Database Processing

CSE
4701
Low-Level Processing
 Enqueue Transaction - Do Actual DB Operations
 Request Locks - Lower Granularity Level
 Issue I/Os - Based on Operations to Access
“Correct” and “Relevant” DB Records
 Process Returned Data - Aggregation, Sorting
 Integrity Checks: Do I/D/U Satisfy Constraints?
 Security Checks: Is DB R/I/D/U Allowed?
 Logging for Recovery - Commit the Transaction
 Release Locks - Available to Others
 Dequeue Transaction - Return Results to HighLevel Processing
 Note: The Multiple Operations of Each DB
Transaction All Must be Successful
Chapter 19-55
Typical Database Processing

CSE
4701
Post Processing
 Collection of Results
 May be Passed Portions of Results as they Complete
 For Example, Sorted Blocks of Data that are then
Merged in a Final Step

Aggregation Operations
 May be Passed Aggregate Intermediate Results
 Sum for Different Departments to be Totaled

Security Checks
 Last Step Filtering to Insure Only Allowed Data is
Returned
 May Execute Query but Only see Aggregate Result

Send Results to User
Chapter 19-56
Typical Database Processing

CSE
4701
Concurrency Control
 Control Access to Information
 Data and Metadata
 Prevent Simultaneous Updates
 Ensure Database Always Correct and Consistent
 Serial Schedule vs. Serializable Transaction
 Two Types
 Pessimistic - Locking-Based - Assume Collisions Will
Occur - e.g., Peoplesoft Course Registration
 Optimistic - Time-Based - Fix Problems After the Fact e.g., ATM Machines Example

CC Manages Locks at Different Granularity Levels
(DB, Table, Attribute, View, Tuple, Metadata, etc.)
Chapter 19-57
Typical Database Processing

CSE
4701
Disk I/O
 Performs the Actual Disk I/O for Read/Writes
 Block Oriented Activity
 Maintain Queue of All I/O Requests
 Ordering is Critical
 Related to Concurrency Control and Consistency




Single DB Transactions can have Multiple DB
Operations with Multiple Disk I/Os
Disk I/Os for Different Operations at Different
Times
High and Low Level Processing will Determine
What Operations Needed When
Disk I/O - Relatively “Dumb”
Chapter 19-58
Typical Database Processing

CSE
4701
Recovery
 Tightly Tied to DB Transaction Concept
 Transactions Must be:
 Atomic - Happens or Doesn’t
 Durable - Once Committed, Results Survive Failure
 Consistent - Follows Protocol/Correct DB State

When Failure Occurs, Can we:
 Recover to a Correct “Earlier” State
 Reconcile all “Active” Transactions that were
Executing at Failure Time


Involves Logging of Database Actions
Objective: High Availability and Reliability
Chapter 19-59
Query Optimization

CSE
4701

Not Really Optimizing, but Planning to Avoid Bad
Execution Strategies
Models
 Heuristics-Based
 Apply Transformation Rules According to a General
Strategy
 Focus on Relational Algebra that Underlies Each Query
 Improve the “Order” of Relational Operations

Cost-Based
 Minimize a Cost Function
I/O Cost + CPU Cost
 Subject to a Set of Constraints
Chapter 19-60
Query Processing Methodology
CSE
4701
High-level Calculus-based Query
EXTERNAL
SCHEMA
Query
Preprocessing
Algebraic Query (a tree structure)
LOGICAL
SCHEMA
Query
Optimization
INTERNAL
SCHEMA
Execution Schedule (file access plan)
Chapter 19-61
Query Preprocessing

CSE
4701




Input: Calculus Query on Base Relations
Normalization
 Manipulate Query Quantifiers and Qualification
Analysis
 Detect and Reject Incorrect Queries
 Possible for Only a Subset of Relational Calculus
Simplification
 Eliminate Redundant Predicates
Restructuring
 Calculus Query  Algebraic Query
 More Than One Translation is Possible
 Use Transformation Rules
Chapter 19-62
Normalization

CSE
4701

Lexical and Syntactic Analysis (Similar to Compilers)
 Check Validity
 Check for Attributes and Relations
 Type Checking on the Qualification
Put into Normal Form
 Conjunctive Normal Form
 (p11p12…p1n) …pm1pm2…pmn)
 Disjunctive Normal Form
 (p11p12…p1n) …pm1pm2…pmn)
 OR's Mapped into Union
 AND's Mapped into Join or Selection
Chapter 19-63
Refute Incorrect Queries

Example:
E(ENAME, ENO), P(JNO,JNAME), W(ENO,PNO,DUR)
SELECT ENAME, PNAME
FROM E, P, W
WHERE DUR > 27 AND DUR < 25
CSE
4701


Incorrect
 Disjoint Components are Useless
 Multiple Relations, Missing Joins, may not be
incorrect, but may indicate Cartesian product
Contradictory
 Qualification can not be Satisfied by any Tuple
 DUR > 27 AND DUR < 25
Chapter 19-64
Simplification

CSE
4701

Why Simplify?
 The Simpler the Query, the Less Work there is and
the Better the Performance
How? Use transformation rules
 Elimination of Redundancy
 Idempotency Rules
p1  ¬(p1) = false
¬(p1 p2) = ¬(p1) ¬(p2)
p1  false = p1
– …
 Application of Transitivity
 Use of Integrity Rules

Example
 x > a and x > b
 DUR > 27 AND DUR > 25
Chapter 19-65
Restructuring
Convert Relational Calculus to
Relational Algebra
ENAME
 Make use of Query Trees
 Example
Find the names of employees
other than J. Doe who worked (DUR=12 OR DUR=24) AND
JNAME=“CAD/CAM” AND
on the CAD/CAM project for
ENAME°“J. DOE”
either 1 or 2 years.

CSE
4701
SELECT ENAME
FROM
E, W, P
WHERE E.ENO=W.ENO
AND
W.JNO=P.JNO
AND
E.ENAME°"J. Doe"
AND
P.JNAME="CAD/CAM"
P
AND
(W.DUR=12 OR
W.DUR=24)
Project
Select
JNO
Join
ENO
W
E
Chapter 19-66
Query Optimization Objectives

CSE
4701




Improving Performance
Arriving at a Query Plan of Execution
Analyzing the Relational Algebra Query
 Replace Costly Operations
 Do Selections and Projections Early
Optimization Heuristics for the Relational Algebra
 Performing Selection and Projection Before Join
 Combining Several Selections Over a Single
Relation Into One Selection
 Find Common Subexpressions
 Algebraic Rewriting/transformation Rules
General Transformation Rules for Relational Algebra
(Equivalence-preserving Algebraic Rewriting Rules)
Chapter 19-67
Query Optimization: An Example
CSE
4701

Why is it important?
SELECT
FROM
WHERE
AND
ENAME
E,W
E.ENO = W.ENO
W.RESP = "Manager"

Strategy 1
 ENAME(RESP="Manager"E.ENO=G.ENO(E  W))

Strategy 2
 ENAME( E
ENO(RESP="Manager"(W)))
Chapter 19-68
Cost of Alternatives
Assume :
 card(E) = 4,000; card(W)=10,000
 10% of tuples in W satisfy RESP="Manager"
(selection generates 1,000 tuples)
 Execution time Proportional to the Sum of the
Cardinalities of the Temporary Relations
 Searching is Done by Sequential Scanning

CSE
4701
Strategy 1
Cartesian prod. = 40,000,000
Search over all = 40,000,000
80,000,000
Strategy 2
Selection over W =
10,000
Join(4000*1000) = 4,000,000
4,010,000
Chapter 19-69
General Query Optimization Strategy

CSE
4701


Perform Selections Early
 Yields Smaller Intermediate Results
 Direct Impact on Subsequent Join/Cartesian Prod.
Combine Selections with a Prior Cartesian Product
into a Theta or Equi Join
 Join is a Cheaper Operation
Combine (Cascade) Selections and Projections
AB(B (R))  AB(R)
p1 ( p2 (R))  p1 ^ p2 (R)
This Results in One Pass Instead of Two over Table
Chapter 19-70
General Query Optimization Strategy

CSE
4701


Identify Common Subexpressions
 Compute Once and Store
 use Stored Version for Subsequent Times
 Often Useful When Views are Employed
Preprocess Data via Sorts and Indexes
 Speeds up Searches and Joins by Limiting Scope
Evaluate and Assess Different Options
 For Cartesian Product, Use Smaller Relation for
Comparison
 Use System Catalog (Meta-data) to Effect Order in
Query Execution Plan
Chapter 19-71
Relational Algebra Transformations
1.
CSE
4701
Cascade of Selection

2.
p1 ^ p2 ^ …^ pn(R)p1(p2(...(pn(R))...))
Commutativity of Selection

p1(p2(R))p2(p1(R))
p1 or p2(R )p1(R p2(R)
Cascade of Projection

3.

4.
A1,A2, … An(R)A1(A2(...(An(R))...))
A1(R) if A1 A2 ...  An
Commuting Selection with Projection (A’s not in p)

A1,A2,...,An(p(R))p(A1,A2,...,An(R)
Chapter 19-72
Relational Algebra Transformations
5.
CSE
4701
6.
Commutativity of Theta Join and Cartesian Product
 R
A SS
AR
 R  SS  R
Commuting Selection with Theta Join (Cartesian)
 p(A)(R S) p(A)(R)) S
A defined on R only
 p(A)^p(B)(R S)  p(A)(R))  p(B)(S))
(A defined on R, B defined on S)
Also Holds for Theta Join as Well
Commuting Projection with Theta Join (Cartesian)
 C(R S) A(R) B(S) where AB=C
 A are Attributes in C for R and B are Attributes in
C for S

7.
Chapter 19-73
Relational Algebra Transformations
8.
CSE
4701
9.
10.
Commutativity of Set Operations
 R S S R
 R S S R
Associativity of Set Operations
 (R S) T R S T)
 (R
S)
T R
(S
T)
 (R S) S R  (S  T)
 (R S) S R (S T)
Commuting Select with Set Operations
 p(Ai)(R T) p(Ai)(R) p(Ai)(T)
where Ai is defined on both R and T

p(Ai)(R T) p(Ai)(R) p(Ai)(T)
where Ai is defined on both R and T
Chapter 19-74
Relational Algebra Transformations
CSE
4701
11. Commuting Projection with Union
 C(R
q(Aj,Bk) S) A(R)
q(Aj,Bk)
B(S)
C(R S) A’ (R) B’ (S)
where R[A] and S[B]
C = A' B' where A'  A, B’  B
12. Converting Selection/Cartesian Into Theta Join
 C (R S)  R
S
C

Chapter 19-75
Using Heuristics in Query Optimization

CSE
4701
Process for heuristics optimization
1. The parser of a high-level query generates an initial
internal representation;
2. Apply heuristics rules to optimize the internal
representation.
3. A query execution plan is generated to execute
groups of operations based on the access paths
available on the files involved in the query.

The main heuristic is to apply first the
operations that reduce size of intermediate
results

E.g., Apply SELECT and PROJECT operations
before applying the JOIN or other operations.
Chapter 19-76
Using Heuristics in Query Optimization (2)

CSE
4701
Query tree:



A tree data structure that corresponds to a relational algebra
expression. It represents the input relations of the query as
leaf nodes of the tree, and represents the relational algebra
operations as internal nodes.
An execution of the query tree consists of executing an
internal node operation whenever its operands are
available and then replacing that internal node by the
relation that results from executing the operation.
Query graph:

A graph data structure that corresponds to a relational
calculus expression. It does not indicate an order on which
operations to perform first. There is only a single graph
corresponding to each query.
Chapter 19-77
Using Heuristics in Query Optimization

CSE
4701
Heuristic Optimization of Query Trees:




The same query could correspond to many different
relational algebra expressions — and hence many different
query trees.
Remember – Not One Soln to Each Query on Exam
The task of heuristic optimization of query trees is to find a
final query tree that is efficient to execute.
Example:
Q: SELECT
FROM
WHERE
LNAME
EMPLOYEE, WORKS_ON, PROJECT
PNAME = ‘AQUARIUS’ AND
PNMUBER=PNO AND ESSN=SSN
AND BDATE > ‘1957-12-31’;
Chapter 19-78
Heuristics Algebraic Optimization Concepts

CSE
4701


Using Cascade of Selections Rule, Break up Any
Selections With Conjunctive Conditions Into a
Cascade of Selections
 Allows More Freedom in Moving Selections Down
Different Branches of the Tree
Using Commutativity of Selections with Other
Operations Rules, Move Each Selection Down the
Query Tree as far as Possible
If Possible, Combine a Cartesian Product With a
Selection Into a Join
Chapter 19-79
Heuristics Algebraic Optimization Concepts

CSE
4701


Using Associativity of Binary Operations, Rearrange
the Leaf Nodes So That the Most Restrictive
Selections Are Executed First
 The Fewer Tuples the Resulting Relation Contains,
the More Restrictive the Selection
 Reducing the Size of Intermediate Results
Improves Performance
Using Cascade of Projections and Commutativity of
Projections with Other Operations, Move Projections
Down the Query Tree as Far as Possible
Identify Subtrees that Represent Groups of Operations
that can be Executed by a Single Algorithm
Chapter 19-80
Summary of All Rules
CSE
4701
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Cascade of Selection
Commutativity of Selection
Cascade of Projection
Commuting Selection with Projection (A’s not in p)
Commutativity of Theta Join and Cartesian Product
Commuting Selection with Theta Join (Cartesian)
Commuting Projection with Theta Join (Cartesian)
Commutativity of Set Operations
Associativity of Set Operations
Commuting Select with Set Operations
Commuting Projection with Union
Converting Selection/Cartesian Into Theta Join
Chapter 19-81
Heuristic Algebraic Optimization Algorithm

CSE
4701





Use Rule 1 to Break up Selects with Conjunctions into
a Cascade to Move them Down the Query Tree
Use Rules 2, 4, 6, and 10 to Commute Select with
Project, Join, Cart. Prod., Union, and Intersection
Use Rule 5 (Commute) and 9 (Associative) to
Rearrange the Leaf Nodes of Query Tree to:
 Most Restrictive Select Executed First
 Avoid Cartesian Product in Leaf Nodes
Use Rule 12 to Convert a Select/Cart Prod to Join
Use Rules 3, 4, 7, and 11 to Cascade and Commute
Project - Pushing Down Tree as Far as Possible
Identify Subtrees that Can Execute as Independent
Algorithms (Set of Operations)
Chapter 19-82
Heuristic Optimization: Example
CSE
4701
Canonical query tree at the end of
query preprocessing phase
ENAME
(DUR=12 OR DUR=24)
AND
JNAME=“CAD/CAM” AND
ENAME= “J. DOE”
E(ENAME, ENO)
P(JNO,JNAME)
W(ENO,PNO,DUR)
JNO
ENO
P
W
E
Chapter 19-83
Heuristic Optimization– Example
ENAME
CSE
4701
DUR=12 OR DUR=24
JNAME=“CAD/CAM”
ENAME = “J. DOE”
Use cascading of selections
rule to decompose selections
JNO
P
ENO
W
E
Chapter 19-84
Heuristic Optimization– Example
ENAME
CSE
4701
DUR=12 OR DUR=24
JNAME=“CAD/CAM”
Push selection down
using commutativity of
selection over join
JNO
ENO
ENAME = "J. Doe"
P
W
E
Chapter 19-85
Heuristic Optimization–Example
CSE
4701
ENAME
DUR=12 OR DUR=24 Push selection down
JNO
JNAME = "CAD/CAM"
using commutativity of
selection over join
ENO
ENAME = "J. Doe"
P
W
E
Chapter 19-86
Heuristic Optimization–Example
CSE
4701
ENAME
JNO
Push selection down
ENO
JNAME = "CAD/CAM"
P
DUR =12 DUR=24
W
ENAME = "J. Doe"
E
Chapter 19-87
Heuristic Optimization–Example
ENAME
CSE
4701
JNO
JNO,ENAME
Do early projection
ENO
JNO
JNAME = "CAD/CAM" 
P
JNO,ENO
DUR =12 DUR=24
W
ENO,ENAME
ENAME = "J. Doe"
E
Chapter 19-88
Heuristic Optimization–Example
ENAME
CSE
4701
Identify subtrees that
can be implemented in
one algorithm
JNO
JNO,ENAME
ENO
JNO
JNAME = "CAD/CAM"
JNO,ENO
JNO,ENAME
DUR =12 DUR=24
ENAME = "J. Doe"
P
W
E
Chapter 19-89
Heuristic Optimization: A Second Example
CSE
4701
BOOKS(Title, Author, Pname, LC_No)
PUBLISHERS(Pname, Paddr, Pcity)
BORROWERS(Name, Addr, City, Card_No)
LOANS(Card_No, LC_No, Date)
Let XLOANS = S(F(Loans x Borrowers x Books))
where:
S ={Title, Author, Pname, LC_No, Name,
Addr, City, Card_No, Date}
and
F = {Borrower.Card_No = Loans.Card_No ^
Books.LC_No = Loans.LC_No}
Chapter 19-90
Heuristic Optimization: A Second Example

CSE
4701
Title, Author, Pname,
LC_No, Name, Addr,
City, Card_No, Date
 Borrower.Card_No = Loans.Card_No ^
Books.LC_No = Loans.LC_No
XLOANS
X
Books
X
Loans
Borrower
BOOKS(Title, Author, Pname, LC_No)
PUBLISHERS(Pname, Paddr, Pcity)
BORROWERS(Name, Addr, City, Card_No)
LOANS(Card_No, LC_No, Date)
Chapter 19-91
Heuristic Optimization: A Second Example
 Title
CSE
4701
 Date  1/1/88

Title, Author, Pname,
LC_No, Name, Addr,
City, Card_No, Date
 Borrower.Card_No = Loans.Card_No ^
Books.LC_No = Loans.LC_No
X
Books
X
Loans
Query= TITLE(Date  1/1/88 (XLOANS))
Borrower
BOOKS(Title, Author, Pname, LC_No)
PUBLISHERS(Pname, Paddr, Pcity)
BORROWERS(Name, Addr, City, Card_No)
LOANS(Card_No, LC_No, Date)
Chapter 19-92
Heuristic Optimization: A Second Example
 Title
Try to Cascade
CSE
4701
Date  1/1/88
 Date  1/1/88

Title, Author, Pname,
LC_No, Name, Addr,
City, Card_No, Date
 Borrower.Card_No = Loans.Card_No ^
Books.LC_No = Loans.LC_No
X
Books
X
Loans
Borrower
BOOKS(Title, Author, Pname, LC_No)
PUBLISHERS(Pname, Paddr, Pcity)
BORROWERS(Name, Addr, City, Card_No)
LOANS(Card_No, LC_No, Date)
Chapter 19-93
Heuristic Optimization: A Second Example
 Title
CSE
4701

Title, Author, Pname,
LC_No, Name, Addr,
City, Card_No, Date
 Date  1/1/88
Commute Select
and Project
 Borrower.Card_No = Loans.Card_No ^
Books.LC_No = Loans.LC_No
X
Books
X
Loans
Borrower
BOOKS(Title, Author, Pname, LC_No)
PUBLISHERS(Pname, Paddr, Pcity)
BORROWERS(Name, Addr, City, Card_No)
LOANS(Card_No, LC_No, Date)
Chapter 19-94
Heuristic Optimization: A Second Example
 Title
CSE
4701

Title, Author, Pname,
LC_No, Name, Addr,
City, Card_No, Date
 Borrower.Card_No = Loans.Card_No ^
Books.LC_No = Loans.LC_No
 Date  1/1/88
Commute Select
and Select
X
Books
X
Loans
Borrower
BOOKS(Title, Author, Pname, LC_No)
PUBLISHERS(Pname, Paddr, Pcity)
BORROWERS(Name, Addr, City, Card_No)
LOANS(Card_No, LC_No, Date)
Chapter 19-95
Heuristic Optimization: A Second Example
 Title
CSE
4701

Title, Author, Pname,
LC_No, Name, Addr,
City, Card_No, Date
 Borrower.Card_No = Loans.Card_No ^
Books.LC_No = Loans.LC_No
X
Books
X
 Date  1/1/88
Loans
Borrower
Commute Select and
Cartesian Product
Two Levels Down
BOOKS(Title, Author, Pname, LC_No)
PUBLISHERS(Pname, Paddr, Pcity)
BORROWERS(Name, Addr, City, Card_No)
LOANS(Card_No, LC_No, Date)
Chapter 19-96
Heuristic Optimization: A Second Example
 Title
Try to Cascade
CSE
4701

Borrower.Card_No = Loans.Card_No
Title, Author, Pname,
LC_No, Name, Addr,
City, Card_No, Date
 Borrower.Card_No = Loans.Card_No ^
Books.LC_No = Loans.LC_No
X
Books
X
 Date  1/1/88
Loans
Borrower
BOOKS(Title, Author, Pname, LC_No)
PUBLISHERS(Pname, Paddr, Pcity)
BORROWERS(Name, Addr, City, Card_No)
LOANS(Card_No, LC_No, Date)
Chapter 19-97
Heuristic Optimization: A Second Example
 Title
CSE
4701

Title, Author, Pname,
LC_No, Name, Addr,
City, Card_No, Date
 Books.LC_No = Loans.LC_No
X
Books
 Borrower.Card_No = Loans.Card_No
Commute Select and
Cartesian Product
One Level Down
X
What’s Next?
 Date  1/1/88
Loans
Borrower
BOOKS(Title, Author, Pname, LC_No)
PUBLISHERS(Pname, Paddr, Pcity)
BORROWERS(Name, Addr, City, Card_No)
LOANS(Card_No, LC_No, Date)
Chapter 19-98
Heuristic Optimization: A Second Example
 Title
CSE
4701

Title, Author, Pname,
LC_No, Name, Addr,
City, Card_No, Date
 Books.LC_No = Loans.LC_No
X
Books
Combine
Projections
Borrower.Card_No = Loans.Card_No
X
 Date  1/1/88
Loans
Borrower
BOOKS(Title, Author, Pname, LC_No)
PUBLISHERS(Pname, Paddr, Pcity)
BORROWERS(Name, Addr, City, Card_No)
LOANS(Card_No, LC_No, Date)
Chapter 19-99
Heuristic Optimization: A Second Example
BOOKS(Title, Author, Pname, LC_No)
PUBLISHERS(Pname, Paddr, Pcity)
BORROWERS(Name, Addr, City, Card_No)
LOANS(Card_No, LC_No, Date)
 Title
CSE
4701
 Books.LC_No = Loans.LC_No
X
Books
Borrower.Card_No = Loans.Card_No
X
 Date  1/1/88
Loans
Borrower
What is Still a Problem?
We are Not Projecting so All Attributes are
Still Collected Until the Final Project!
Chapter 19-100
Heuristic Optimization: A Second Example
 Title
CSE
4701
 Books.LC_No = Loans.LC_No
 Loans.LC_No
 Books.LC_No, Title
X
Books
 Borrower.Card_No = Loans.Card_No
 Loans.LC_No,
X
 Borr.Card_No
Loans.Card_No
 Date  1/1/88
Loans
Borrower
Add Strategic Projections
to Send Only the Minimum
Up the Tree as Needed
for Join/Result Set
Chapter 19-101
Heuristic Optimization: A Second Example
CSE
4701
 Title
What is the Final Step?
Combine Select and
Cartesian Product
 Books.LC_No = Loans.LC_No
Result: Equijoins!
 Loans.LC_No
X

 Loans.LC_No,
 Books.LC_No, Title
Books
Borrower.Card_No = Loans.Card_No
X
 Borr.Card_No
Loans.Card_No
 Date  1/1/88
Borrower
Loans
Chapter 19-102
Heuristic Optimization: A Second Example
CSE
4701
FINAL TREE with
Equijoins!
 Title
LC_No
 Loans.LC_No
 Books.LC_No, Title
Books
Card_No
 Loans.LC_No,
 Borr.Card_No
Loans.Card_No
 Date  1/1/88
Borrower
Loans
Chapter 19-103
Heuristic Optimization: A Third Example

CSE
4701
Heuristic Optimization of Query Trees:



The same query could correspond to many different
relational algebra expressions — and hence many
different query trees.
The task of heuristic optimization of query trees is to find
a final query tree that is efficient to execute.
Example:
Q: SELECT
FROM
WHERE
LNAME
EMPLOYEE, WORKS_ON, PROJECT
PNAME = ‘AQUARIUS’ AND
PNMUBER=PNO AND ESSN=SSN
AND BDATE > ‘1957-12-31’;
Chapter 19-104
Heuristic Optimization: A Third Example
CSE
4701
What’s one Approach?
Chapter 19-105
Heuristic Optimization: A Third Example
CSE
4701
Moving Selects Down
Is this Optimal?
Chapter 19-106
Heuristic Optimization: A Third Example
CSE
4701
No! Prior Version
Retrieved All Employees
Without First Apply
Pname Select
Chapter 19-107
Heuristic Optimization: A Third Example
CSE
4701
Replace CART PRODUCT
Plus SELECT with JOIN!
What’s left to do?
Chapter 19-108
Heuristic Optimization: A Third Example
CSE
4701
Chapter 19-109
Heuristic Optimization: A Fourth Example
CSE
4701
Sailors (sid, sname, rating, age)
Boats (bid, bname, color)
Reserves (sid, bid, day, rname)
Query: Find all Sailors that have Reserved red Boats that
are younger who are younger than 30 and have a
rating of at least 11.
SELECT S.sid, S.sname, S.age
FROM Sailors S, Boats B, Reserves R
WHERE B.bid=R.bid AND S.sid=R.sid AND S.Rating
>= 11 AND B.color = “Red” AND S.age < 30;
πS.sid, S.sname, S.age(σ
B.bid=R.bid^S.sid=R.sid^S.age<30^
B.color=“Red”^S.rating≥11(B×S×R)
Chapter 19-110
Heuristic Optimization: A Fourth Example

CSE
4701

S.sid, S.sname, S.age
B.bid=R.bid^S.sid=R.sid^
S.age < 30 ^ S.Rating >= 11 ^ B.color = “Red”
X
Boats
Sailors (sid, sname, rating, age)
Boats (bid, bname, color)
Reserves (sid, bid, day, rname)
X
Reserves
Sailors
Step 1 - Break up Selects
Chapter 19-111
Heuristic Optimization: A Fourth Example
CSE
4701

S.sid, S.sname, S.age

B.bid=R.bid^S.sid=R.sid

S.age < 30 ^ S.Rating >= 11

B.color = “Red”
Sailors (sid, sname, rating, age)
Boats (bid, bname, color)
Reserves (sid, bid, day, rname)
X
Boats
X
Step 2 – Move that Boats Select
Reserves
Sailors
Chapter 19-112
Heuristic Optimization: A Fourth Example
CSE
4701

S.sid, S.sname, S.age

B.bid=R.bid^S.sid=R.sid

S.age < 30 ^ S.Rating >= 11
X

Sailors (sid, sname, rating, age)
Boats (bid, bname, color)
Reserves (sid, bid, day, rname)
B.color = “Red”
Boats
Step 3 – Move that Sailor Select
X
Sailors
Reserves
Chapter 19-113
Heuristic Optimization: A Fourth Example

CSE
4701

S.sid, S.sname, S.age
Sailors (sid, sname, rating, age)
Boats (bid, bname, color)
Reserves (sid, bid, day, rname)
B.bid=R.bid^S.sid=R.sid
X
X

B.color = “Red”
Boats
Reserves

S.age < 30 ^ S.Rating >= 11
Sailors
Step 4 – Introduce Projections
Chapter 19-114
Heuristic Optimization: A Fourth Example

CSE
4701

S.sid, S.sname, S.age
Sailors (sid, sname, rating, age)
Boats (bid, bname, color)
Reserves (sid, bid, day, rname)
B.bid=R.bid^S.sid=R.sid
Step 5 – What’s Next Step?
X


X
B.bid
B.color = “Red”
Boats

R.sid,R.bid
Reserves

S.sid,S.name,S.age

S.age < 30 ^ S.Rating >= 11
Sailors
Chapter 19-115
Heuristic Optimization: A Fourth Example
CSE
4701

S.sid, S.sname, S.age

B.bid=R.bid
Sailors (sid, sname, rating, age)
Boats (bid, bname, color)
Reserves (sid, bid, day, rname)
Step 6 - Move Down S.sid=R.sid
X

S.sid=R.sid
Step 7 – What’s Next Step?
X



B.bid
B.color = “Red”
Boats

R.sid,R.bid
Reserves
S.sid,S.name,S.age

S.age < 30 ^ S.Rating >= 11
Sailors
Chapter 19-116
Heuristic Optimization: A Fourth Example
CSE
4701

S.sid, S.sname, S.age

B.bid=R.bid
Sailors (sid, sname, rating, age)
Boats (bid, bname, color)
Reserves (sid, bid, day, rname)
Step 7 – Combined for Equi Join
X
Step 8 – What’s Final Step?
S.sid=R.sid



B.bid
B.color = “Red”
Boats

R.sid,R.bid
Reserves
S.sid,S.name,S.age

S.age < 30 ^ S.Rating >= 11
Sailors
Chapter 19-117
Heuristic Optimization: A Fourth Example
CSE
4701

Sailors (sid, sname, rating, age)
Boats (bid, bname, color)
Reserves (sid, bid, day, rname)
S.sid, S.sname, S.age
Step 8 – Introduce Final EquiJoin
B.bid=R.bid
S.sid=R.sid



B.bid
B.color = “Red”
Boats

R.sid,R.bid
Reserves
S.sid,S.name,S.age

S.age < 30 ^ S.Rating >= 11
Sailors
Chapter 19-118
Converting Relational Algebra to Query Tree
Movies1997
=
CSE
Lname,Fname,State(
4701
Person.PersonID = AllActors.PersonID ^
Movies1997.ShowID=MovieRoles.ShowID ^ Year=1997
(Person x Movies x MovieRoles))
 Lname,Fname,State

Person.PersonID = AllActors.PersonID ^
Movies1997.ShowID=MovieRoles.ShowID ^ Year=1997
X
X
Person
Movies
MovieRoles
Chapter 19-119
Converting Relational Algebra to Query Tree
Lname,Fname,RLName,RFName
FriendsActors =
CSE
(
4701
ShowName=Friends ^ TVRoles.ShowID = Friends.ShowID ^
EpisodeID>10 ^ EpisodeId<26 ^ Person.PersonID =
RoleNames.PersonID(TVShows x TVRoles x Roles x Person))
 ShowID

ShowName=Friends ^ TVRoles.ShowID = Friends.ShowID ^
EpisodeID>10 ^ EpisodeId<26 ^ Person.PersonID =
RoleNames.PersonID
X
X
TVShows
TVRoles
Roles
X
Person
Chapter 19-120
Heuristics Query Optimization: Summary

CSE
4701
First Apply Operations that Reduce the Size of
Intermediate Results
 Move Selections and Projections Down the Tree as
far as Possible
 Early Selections Reduce the Number of Tuples
 Early Projections Reduce the Number of Attributes

Selection and Join Should be Executed Before
Other Similar Operations.
 This is Accomplished by Reordering the Leaf Nodes of
the Tree Among Themselves and Adjusting the Rest of
the Tree Appropriately
Chapter 19-121
Cost-Based Optimization
CSE
4701


Reduce Defined Cost of Executing Queries
What is Involved in the Cost of Executing a Query?
 Access Cost to Secondary Storage
 Search for Data Block (Index)
 Read/Write Index and Data Blocks

Storage Cost
 Index and Data Blocks
 Intermediate Files

Computation Cost
 Query Planning - Optimization Effort
 Record Search, Sort, Merge
 Actual Transaction/Query Operations

Communications Cost
 Transfer of Results to the User
Chapter 19-122
Complexity of Relational Operations

CSE
4701


Assuming
 Relations of
Cardinality n
 Sequential Scan
of Data in each
Relation
Complexity of Each
Operation is
Indicated
Avoid Cartesian
Product at All Costs!
Operation
Select
Project
(w/o duplicate elimination)
Project
(with duplicate elimination)
Group
Complexity
O(n)
O(nlog n)
Join
Division
O(nlog n)
Set Operators
Cartesian Product
O(n2)
Chapter 19-123
Cost-Based Optimization

CSE
4701




To Understand Cost-Based Operations, we Must
Focus on Implementation Strategy of:
 Select
 Project
 Join
For Select and Project - There is a Fixed Cost that we
Must Live With
For Join
 Implementation Strategy
 Different Join Strategies
Objective:
 Minimize the Number of Blocks Involved
Note that Cost-Based and Relational Algebra Heuristic
Optimization Can Complement One Another
Chapter 19-124
Implementation of SELECT

CSE
4701



Principles
 Equality Eliminates Many Tuples
 Index Focuses and Limits Search Scope
Sequential Scan
 Brute Force
 Search All Records to Find Matching Ones
Binary Search
 Equality Comparison on a Key Attribute
Primary Index or Hash Key for Single Record
 Equality Comparison on a Key Attribute With
Primary Index or Hash Key
 Go Directly to Record; No Need to Scan Entire
Table
 Cost to Maintain Index/Hash
Chapter 19-125
Implementation of SELECT

CSE
4701




Primary Index for Multiple Records
 Use Primary Key to Find the Equality Attribute
 Go Forward (> or ) or Backward (< or )
According to the Comparison Operator
Clustering Index for Multiple Records
 Equality Comparison on a Non-key Attribute With
a Clustering Index (e.g., Sort-Merge Algorithm)
Secondary Index
 Equality or Range Queries
Primary Indexes Play a Role Similar to Searching
Sorted Array
We’ll Discuss Indexing Techniques at a Later Time
Chapter 19-126
Recall B+ Tree – Find Leaf and Go L or R
CSE
4701
Chapter 19-127
Recall B+ Tree – Find Leaf and Go L or R
CSE
4701
Chapter 19-128
Implementation of SELECT

CSE
4701


Conjunctive Selection (C1 C2  …  CN)
 If One of the Conjuncts has a Good Access Path,
Use it and Check the Other Conjuncts for Each of
These Records
 Pick the one that is based on “concrete” value
Composite Index
 If an Index has Been Established Jointly for a
Number of Attributes in the Conjunct
 Equality Condition
Intersection of Pointers
 If Secondary Indexes Exist on All or Most of the
Attributes in the Conjunct and the Indexes Include
Record Pointers
 Retrieve Each Attribute Using These Indexes and
Then Take Their Intersection
Chapter 19-129
Implementing PROJECT

CSE
4701

If <Attribute List> Includes Key
 Simple Since the Cardinality of the Result is the
Same as the Cardinality of the Original Relation
 No Need to Remove Duplicates - Key Attribute
If <Attribute List> Does Not Include Key
 Duplicates Allowed
 Duplicate Elimination
 Sort After Projection and then Eliminate Consecutively
Appearing Duplicates
 See Textbook for Algorithms
 Use Hashing: Hash Each Record Into a Bucket and
Check Against Records Already in That Bucket

Size Estimation: card(A(R))=card(R)
Chapter 19-130
Implementing JOIN

CSE
4701
Nested Loop
 Simple Iteration and Block-Oriented Iteration
For Each Block in R do
Retrieve Every Record from S and Test Join Condition
 An
Index for S may Speed up the Inner Loop
 Smaller Relation should be Outer Loop
 Calculation of I/O
Let bo (bi) be the Number of Blocks taken up by Outer
(Inner) Relation
Let nB (>1) the Buffer Size (in blocks) Devoted to
Arguments
Let bR be the size of the Resulting Relation (in blocks)
Total no. of Block Access = bo+ bo/(nB-1)bi+ bR
Chapter 19-131
Implementing JOIN

CSE
4701

Sort-Merge Join
 Physically Sort Relations R and S
 Scan R and S in the Sorted Order and Merge
 See Algorithm in Textbook
If Files are Not Physically Sorted, but Sorted on the
Join Attributes, a Variation May be used
 Quite Inefficient Since Records are Scattered Over
the Disk
 Total number of block access =
b + bi+ bolog2bo + bilog2bi + bR
Chapter 19-132
Implementing JOIN

CSE
4701

Hash Join
 Hash R and S Using the Same Hash Function
 If Hash File Can Be Memory-Resident, it is
Efficient and Easy to Implement
If Buffer Space is Insufficient, then Part of the Hash
File has to be on Disk
 Various Optimizations for this Case
 Hybrid Hash Join is Described in the Book
 Again - Biggest Problem is Overhead Associated with
Maintaining Hash Index Over Time
Chapter 19-133
Access Using Indices: Estimation of Costs
CSE
4701
Example: Given a bank database consisting of the following three
relation schemas:
Branch(bank-name, assets, bank-city)
Deposit(bank-name, account-number, customer-name, balance)
Customer(customer-name, street, zipcode, customer-city)
Consider the SQL query for the bank database:
Select account-number
From Deposit
Where bank-name = “BofA” and
customer-name = “Bill” and
balance > 1000;
Chapter 19-134
Heuristic Optimization

CSE
4701


Use Cascading of Selections Rule to Decompose,
Three Logical Query Plan Alternatives Are Obtained
Objective - Choose the “Best” Alternative in Terms of
Execution Time (Block Reads)
What should be the Focus in Select Order?
Account-Number
 bank-name = “BofA”
customer-name = “Bill”
balance > 1000;
Deposit
Account-Number
balance > 1000;
Account-Number
 bank-name = “BofA”
customer-name = “Bill”
balance > 1000;
 bank-name = “BofA”
customer-name = “Bill”
Deposit
Deposit
Chapter 19-135
Access Using Indices: Estimation of Costs
CSE
4701
Assumptions:
100 Different Banks (bank-name)
1000 Customers (on average) per bank
Balance could range from 0 to 10,000 dollars
Branch(bank-name, assets, bank-city)
Deposit(bank-name, account-number, customer-name, balance)
Customer(customer-name, street, zipcode, customer-city)
Select account-number
From Deposit
Where bank-name = “BofA” and
customer-name = “Bill” and
balance > 1000;
Chapter 19-136
Estimation of Cost of Access - Version 1
CSE
4701
Branch(bank-name, assets, bank-city)
Deposit(bank-name, account-number, customer-name, balance)
Customer(customer-name, street, zipcode, customer-city)

Account-Number
 bank-name = “BofA”
customer-name = “Bill”


balance > 1000;

Deposit

Recall Assumptions
 100 Banks
 1000 Customers/Bank
 0 to 10,000 dollars/account that are
Distributed Evenly Across Accts.
Tuples in Deposit? 100,000
What Does balance > 1000 do?
 Retrieve 90% of Accounts
 All Banks, All Customers
What Does customer-name = “bill” do?
 All Customers Named Bill
Regardless of the Bank
Is this a Good Strategy?
Chapter 19-137
Estimation of Cost of Access - Version 2
CSE
4701
Branch(bank-name, assets, bank-city)
Deposit(bank-name, account-number, customer-name, balance)
Customer(customer-name, street, zipcode, customer-city)
Account-Number

balance > 1000;
customer-name = “Bill”

 bank -name = “BofA”

Deposit


Recall Assumptions
 100 Banks
 1000 Customers/Bank
 0 to 10,000 dollars/account that are
Distributed Evenly Across Accts.
 Tuples in Deposit 100,000
What Does bank-name = “BofA” do?
 Retrieves 1000 Tuples for BofA on
Average
What Does customer-name = “bill” do?
 The Customer “Bill”
What Does balance > 1000 do?
Is this a Good Strategy?
Chapter 19-138
Estimation of Cost of Access - Version 3
CSE
4701
Branch(bank-name, assets, bank-city)
Deposit(bank-name, account-number, customer-name, balance)
Customer(customer-name, street, zipcode, customer-city)

Account-Number
 bank -name = “BofA”
balance > 1000;

customer-name = “Bill”

Deposit


Recall Assumptions
 100 Banks
 1000 Customers/Bank
 0 to 10,000 dollars/account that are
Distributed Evenly Across Accts.
 Tuples in Deposit 100,000
What Does customer-name = “bill” do?
 Retrieves 100 Tuples
 One per Bank
What Does balance > 1000 do?
 Do they Have Enough Money?
What Does bank-name= “BofA” do?
Is this a Good Strategy?
Chapter 19-139
Join Strategies

CSE
4701
Several Factors Influence the Selection of an Optimal
 The Physical Order of Tuples in a Relation
 The Presence of Indices and the Type of Index
(Clustering or Nonclustering)
 The Cost of Computing a Temporary Index for the
Sole Purpose of Processing One Query
Example: Consider the natural join
Deposit  Customer
• nDeposit = 10,000.
• nCustomer = 200.
• 20 tuples fit in one block for both relations
• buffersize = 2 blocks
Chapter 19-140
Join Strategies: Block-Oriented Iteration
CSE
4701
Block-oriented Iteration:
• Process the relations on a per-block basis rather on a per-tuple
basis
• Using this approach, a major saving in block accesses results
Example: Consider the natural join Deposit  Customer
nDeposit = 10,000.
nCustomer = 200.
20 tuples fit in one block for both relations
Case 1: outerloop: Deposit , inner loop: Customer
• reading Customer once for every block of Deposit tuples |
requires (200/20) * (10,000/20) = 10 * 500
• reading Deposit relation requires 10000/20 = 500 block reads
• the total cost in terms of block accesses is 5500
-> 5000 blocks accesses to Customer and
-> 500 blocks accesses to Deposit
Chapter 19-141
Join Strategies: Block-oriented Iteration
CSE
4701
Case 2: outerloop: Customer, inner loop: Deposit
• Reading Deposit once for every block of Customer
tuples requires (10,000/20) * (200/20) = 5000
• Reading Customer relation |
requires 200/20 = 10 block reads
• The total cost in terms of block accesses is 5010
==>5000 accesses to Deposit blocks and
==>10 accesses to Customer blocks
Case 3: If Customer relation is smaller enough to fit in main
memory, our strategy requires only
==>500 blocks to read Deposit relation and
==>10 blocks to read Customer relation.
The total comes to 510 blocks
Chapter 19-142
Query Execution Cost: Summary

CSE
4701



Access Cost to Secondary Storage
 Search for Data Block (Index)
 Read/write Index and Data Blocks
Storage Cost
 Index and Data Blocks
 Intermediate Files
Computation Cost
 Query Planning
 Record Search, Sort, Merge
 Actual Transaction/query Operations
Communications Cost
 Data Transfer Across a Network
Chapter 19-143
Access Plan

CSE
4701


Access Plan is a Concrete Query Processing Plan
which Presents a Detailed Strategy for Processing a
Query
The Main Cost Factors to Be Considered Include
 The Relational Operations to be Performed
 Indices to be Used
 The Order in Which Tuples are to be Accessed
 The Order in Which Operations are to be
Performed
Typical Focus is on Join and Optimizing its Execution,
Particularly when Multiple Tables are Involved
Chapter 19-144
Statistics

CSE
4701

The Following are Kept in the System Catalog for
Optimization Purposes
 File Parameters: Block Size
 Number of Tuples in Each Relation
 Size of Tuples
 Key Fields, Indices
 Number of Levels in an Index
 Highest Key, Lowest Key
 Number of Distinct Values (Maybe)
 Others: Frequency of Operations, Join Keys, Etc.
All DBMSs Keep the First Four, Many Keep All
Chapter 19-145
Join Ordering

CSE
4701

Given R S T W Determine the Best Ordering
Alternative
 ((R
S)
T)
W
 (R
(S
T))
W
 R
(S
(T
W))
 ((R
T)
S) W
 ((R
W)
S)
T
 …
 (R
S)
(T
W)
Ordering is Critical to Arrive at “Best” Strategy for
Execution, Particularly as
 Number of Relations Increase
 Size of Relation (Tuples/Blocks) Increase
Chapter 19-146
Query Optimization Search Strategies

CSE
4701

Exhaustive Search
 “Optimal”
 Combinatorial Complexity in the Number of
Relations
Heuristics
 Not Optimal
 Group Common Sub-expressions
 Perform Selection, Projection First
 Replace a Join by a Series of Semi-joins
 Reorder Operations to Reduce Intermediate
Relation Size
 Optimize Individual Operations
Chapter 19-147
Query Optimization Timing Issues

CSE
4701


Static
 Compilation ==> Optimize Prior to the Execution
 Difficult to Estimate the Size of the Intermediate
Results ==> Error Propagation
 Can Amortize Over Many Executions
Dynamic
 Run Time Optimization
 Exact Information on the Intermediate Relation
Sizes
 Have to Reoptimize for Multiple Executions
Hybrid
 Compile Using a Static Algorithm
 If the Error in Estimate Sizes > Threshold,
Reoptimize at Run Time
Chapter 19-148
Concluding Remarks

CSE
4701



Most Systems Implement Only a Few Strategies
The Number of Strategies that are Considered by Any
Query Optimizer is Limited
Some Systems Reduce the Number of Strategies by
Making a Heuristic Guess of Strategy for Each Query
 The Optimizer Considers Every Possible Strategy,
but Terminates as Soon as it Determines the Cost
is Greater than the Pre-chosen Strategy
 Thus Only a Few Competing Strategies Require
Full Analysis of the Cost
 The Overhead of Query Optimization is Reduced
Remember - Trade off in Optimization Time
 For PL - Optimization is Pre-Execution (Compile)
 For DB - Optimization is Part of Execution (Run)
Chapter 19-149
Download