Compiler Concepts for Database Systems

advertisement
Compiler Concepts for Database Systems
CSE
4100
Prof. Steven A. Demurjian
Computer Science & Engineering Department
The University of Connecticut
371 Fairfield Way, Unit 2155
Storrs, CT 06269-3155
steve@engr.uconn.edu
http://www.engr.uconn.edu/~steve
(860) 486 - 4818
CH10.1
Overview
CSE
4100




Motivation and Background
Database System Architecture
 Exploring its Capabilities
 Focusing on Compiler-Related Concepts
Compile Time Issues in Database Systems
 The SQL Query Language
 Optimization Issues in Database Systems
 Typing
Runtime Issues in Database Systems
 Transaction Processing
 Execution for Complex Joins
CH10.2
Database System Architecture
CSE
4100


What are the Various Components?
How do they Relate to Compilers?
CH10.3
How Does it Compare to Java Environment?
CSE
4100
CH10.4
Database Concepts - Summary
CSE
4100

Schema vs. Data
 Database-Structured Collection of Data Describing
 Objects of Universe of Discourse being Modeling.
 A Database Consists of Schema and Data
 Schema: Describes the Intension (Type) of Objects
 Data: Describes the Extension (Instances) of Objects

What is Schema w.r.t. Compilers? What is Data?
CH10.5
What is a DBMS?
CSE
4100



A Database Management System (DBMS) is the
Generalized Tool that Facilitates the Management of and
Access to the Database
Main Functions:
 Defining a Database: Specifying Data Types,
Structures, and Constraints
 Constructing a Database: the Process of Storing the
Data Itself on Some Storage Medium
 Manipulating a Database: Function for Querying
Specific Data in the Database and Updating the
Database
What are the Analogies of Each of the Main Functions
w.r.t. Programming Languages and Compilers?
CH10.6
What is a DBMS?
CSE
4100

Additional Functions:
 Interaction with File Manager
 So that Details Related to Data Storage and Access are
Removed From Application Programs

Integrity Enforcement
 Guarantee Correctness, Validity, Consistency

Security Enforcement
 Prevent Data From Illegal Uses

Concurrency Control
 Control the Interference Between Concurrent Programs
Recovery from Failure
 Query Processing and Optimization
Again – What are Relevant Compiler Concepts?


CH10.7
DBMS Architecture
CSE
4100

DBMS Languages
 Data Definition Language (DDL)
 Data Manipulation Language (DML)
 From Embedded Queries or DB Commands Within a
Program
 “Stand-alone” Query Language


Host Language:
 DML Specification (e.g., SQL) is Embedded in a
“Host” Programming Language (e.g., Java, C++)
DBMS Interfaces
 Menu-Based Interface
 Graphical Interface
 Forms-Based Interface
 Interface for DBA (DB Administrator)
CH10.8
DBMS Architecture
CSE
4100


Main DBMS Modules
 DDL Compiler
 DML Compiler
 Ad-hoc (Interactive) Query Compiler
 Run-time Database Processor
 Stored Data Manager
 Concurrency/Back-Up/Recovery Subsystem
DBMS Utility Modules
 Loading Routines
 Backup Utility
 System Catalog/data Dictionary
CH10.9
Components of a DBMS
CSE
4100
CH10.10
ANSI/SPARC - Three Schema Architecture
CSE
4100




External Data Schema (Users’ view)
Conceptual Data Schema (Logical Schema)
Internal Data Schema (Physical Schema)
What are the Programming Language Analogies?
CH10.11
Conceptual Schema
CSE
4100



Describes the Meaning of Data in the Universe of
Discourse
 Emphasizes on General, Conceptually Relevant,
and Often Time Invariant Structural Aspects of the
Universe of Discourse
Excludes the Physical Organization and Access
Aspects of the Data
This could be a UML Design that Realizes a Set of
Classes (no data) or Java Class Declarations (APIs)
CH10.12
Conceptual Schema
CSE
4100

Another Example – A Programming Language Level
Definition
CH10.13
External Schema
CSE
4100




Describes Parts of the Information in the Conceptual
Schema in a form Convenient to a Particular User
Group’s View
Derived from the Conceptual Schema
What is the View of the Outside World in OO?
Akin to Public Interface
CH10.14
External Schema
CSE
4100

Another Example
CH10.15
Internal Schema
CSE
4100

Describes How the Information Described in the
Conceptual Schema is Physically Represented in a
Database to Provide the Overall Best Performance
CH10.16
Internal Schema
CSE
4100

Another Example

This Corresponds to Data Typing and Layout in
Compilers from Runtime Environment!
CH10.17
Unified Example of Three Schemas
CSE
4100
CH10.18
Database Access Process
CSE
4100

What Does This Access Process Resemble?

Akin to Runtime Execution Environment!
A More Complex Activation Process!

CH10.19
Metadata vs. Data
CSE
4100

Recall Introspection and Reflection in Java where you
Can “Look” into the Class Definitions Themselves!
CH10.20
Data Independence
CSE
4100




Ability that Allows Application Programs Not Being
Affected by Changes in Irrelevant Parts of the
Conceptual Data Representation, Data Storage
Structure and Data Access Methods
Invisibility (Transparency) of the Details of Entire
Database Organization, Storage Structure and Access
Strategy to the Users
Recall Software Engineering Concepts:
 Abstraction the Details of an Application's
Components Can Be Hidden, Providing a Broad
Perspective on the Design
 Representation Independence: Changes Can Be
Made to the Implementation that have No Impact
on the Interface and Its Users
Realized in Today’s Modern PLs!
CH10.21
What are System Components?
CSE
4100

How are these Similar to Complier/PL Concepts?
CH10.22
Relational Model
CSE
4100




Relational Model of Data Based on the Concept of a
Relation
Relation - a Mathematical Concept Based on Sets
Strength of the Relational Approach to Data
Management Comes From the Formal Foundation
Provided by the Theory of Relations
RELATION: A Table of Values
 A Relation May Be Thought of as a Set of Rows
 A Relation May Alternately be Though of as a Set
of Columns
 Each Row of the Relation May Be Given an
Identifier
 Each Column Typically is Called by its Column
Name or Column Header or Attribute Name
CH10.23
Relational Tables - Rows/Columns/Tuples
CSE
4100
CH10.24
Relational Database Definition
CSE
4100
CREATE TABLE Student:
Name(CHAR(30)), SSN(CHAR(9)), Gpa(FLOAT(2))
CREATE TABLE Faculty:
Name(CHAR(30)), SSN(CHAR(9)), Ophone(CHAR(7))
CREATE TABLE Courses:
Course#(CHAR(6)), Title(CHAR(20)), Descrip(CHAR(100)),
PCourse#(CHAR(6))
CREATE TABLE Formats:
Section#(INTEGER(3)), Quarter(CHAR(10)), Campus(CHAR(15))
CREATE TABLE TakeorTeach:
SSN(CHAR(9)), Course#(CHAR(6)), Section#(INTEGER(3))
CREATE TABLE COfferings:
Course#(CHAR(6)), Section#(INTEGER(3))
Student(Name*, SSN, Gpa)
Faculty(Name*, SSN, Ophone)
Courses(Course#*, Title, Descrip, PCourse#*)
Formats(Section#*, Quarter, Campus)
TakeorTeach(SSN, Course#, Section#)
COfferings(Course#, Section#)
CH10.25
Relational Views
CSE
4100

Two Views Derived From Prior Tables
 Student Transcript View
 Course Prerequisite View
CH10.26
SQL: Tuple Relational Calculus-Based
CSE
4100
SQL is a Partial Example of a Tuple Relational
Language
 Simple Queries are all Declarative
 More Complex Queries are both Declarative and
Procedural (e.g., joins, nested queries)
 Find the names of employees working on the CAD/CAM
project
SELECT
EMP.ENAME
FROM EMP, WORKS, PROJ
WHERE (EMP.ENO= WORKS.ENO)
AND (WORKS.PNO = PROJ.PNO)
AND (PROJ.PNAME = “CAD/CAM”)
 SQL Defines a Programming Language and Associated
Semantics for Usage and Processing

CH10.27
SQL Components
CSE
4100




Data Definition Language (DDL)
 For External and Conceptual Schemas
 Views - DDL for External Schemas
Data Manipulation Language (DML)
 Interactive DML Against External and Conceptual
Schemas
 Embedded DML in Host PLs (EQL, JDBC, etc.)
Note: Separation of Definition (DDL) from Usage
(DML) – Is there Something Similar in PLs?
Others
 Integrity (Allowable Values/Referential)
 Transaction Control (Long-Duration and Batch)
 Authorization (Who can Do What When)
CH10.28
SQL DDL and DML
CSE
4100

Data Definition Language (DDL) - Declarations
 Defining the Relational Schema - Relations,
Attributes, Domains - The Meta-Data
CREATE TABLE Student:
Name(CHAR(30)),SSN(CHAR(9)),GPA(FLOAT(2))
CREATE TABLE Courses:
Course#(CHAR(6)), Title(CHAR(20)),
Descrip(CHAR(100)), PCourse#(CHAR(6))

Data Manipulation Language (DML) - Code
 Defining the Queries Against the Schema
SELECT Name, SSN
From Student
Where GPA > 3.00
CH10.29
Data Definition Language - DDL
CSE
4100








A Pre-Defined set of Primitive Types
 Numeric
 Character-string
 Bit-string
 Additional Types
Defining Domains
Defining Schema
Defining Tables
Defining Views
Note: Each DBMS May have their Own DBMS
Specific Data Types - Is this Good or Bad?
What is this Similar to re. Different C++ Compilers?
These are Akin to PL Data Types!
CH10.30
DDL - Primitive Types
CSE
4100

Numeric

INTEGER (or INT), SMALLINT

REAL, DOUBLE PRECISION

FLOAT(N) Floating Point with at Least N Digits
DECIMAL(P,D) (DEC(P,D) or NUMERIC(P,D))
have P Total Digits with D to Right of Decimal
Note that INTs and REALs are Machine Dependent
(Based on Hardware/OS Platform)
Again – this is Similar to PLs/Compilers and Code
Generation – Data Layout



CH10.31
DDL - Primitive Types
CSE
4100


Character-String
 CHAR(N) or CHARACTER(N) - Fixed
 VARCHAR(N), CHAR VARYING(N), or
CHARACTER VARYING(N)
Variable with at Most N Characters
Bit-Strings
 BIT(N) Fixed
 VARBIT(N) or BIT VARYING(N)
Variable with at Most N Bits
CH10.32
DDL - Primitive Types
CSE
4100









These Specialized Primitive Types are Used to:
 Simplify Modeling Process
 Include “Popular” Types
 Reduce Composite Attributes/Programming
DATE : YYYY-MM-DD
TIME: HH-MM-SS
TIME(I): HH-MM-SS-F....F - I Fraction Seconds
TIME WITH TIME ZONE: HH-MM-SS-HH-MM
TIME-STAMP:
YYYY-MM-DD-HH-MM-SS-F...F{-HH-MM}
PLs also have Specialized Types!
Problem: Different Database Systems Sometime
Implement these Types very Differently
This Impacts Portability!
CH10.33
What is a SQL Schema?
CSE
4100



A Schema in SQL is the Major Meta-Data Construct
Supports the Definition of:
 Relation - Table with Name
 Attributes - Columns and their Types
 Identification - Primary Key
 Constraints - Referential Integrity (FK)
Two Part Definition
 CREATE Schema - Named Database or
Conceptually Related Tables
 CREATE Table - Individual Tables of the Schema
CH10.34
DDL-Create/Drop a Schema
CSE
4100

Creating a Schema:
CREATE SCHEMA MY_COMPANY AUTHORIZATION
Demurjian;
Schema MY_COMPANY bas Been Created and is
Owner by the User “Demurjian”
 Tables can now be Created and Added to Schema
Dropping a Schema:


DROP SCHEMA MY_COMPANY RESTRICT;
DROP SCHEMA MY_COMPANY CASCADE;

Restrict:
 Drop Operation Fails If Schema is Not Empty

Cascade:
 Drop Operation Removes Everything in the Schema
CH10.35
DDL - Create Tables
CSE
4100
CREATE TABLE EMPLOYEE
( FNAME
VARCHAR(15)
NOT NULL ,
MINIT
CHAR ,
LNAME
VARCHAR(15)
NOT NULL ,
SSN
CHAR(9)
NOT NULL ,
BDATE
DATE
ADDRESS VARCHAR(30) ,
SEX
CHAR ,
SALARY
DECIMAL(10,2) ,
SUPERSSN CHAR(9) ,
DNO INT NOT NULL ,
PRIMARY KEY (SSN) ,
FOREIGN KEY (SUPERSSN)
REFERENCES EMPLOYEE(SSN) ,
FOREIGN KEY (DNO)
REFERENCES DEPARTMENT(DNUMBER) ) ;
CH10.36
DDL - Create Tables (continued)
CSE
4100
CREATE TABLE DEPARTMENT
( DNAME VARCHAR(15)
NOT NULL ,
DNUMBER INT NOT NULL ,
MGRSSN
CHAR(9)
NOT NULL ,
MGRSTARTDATE DATE ,
PRIMARY KEY (DNUMBER) ,
UNIQUE (DNAME) ,
FOREIGN KEY (MGRSSN)
REFERENCES EMPLOYEE(SSN) ) ;
CREATE TABLE DEPT_LOCATIONS
(DNUMBER INT NOT NULL ,
DLOCATION VARCHAR(15) NOT NULL ,
PRIMARY KEY (DNUMBER, DLOCATION) ,
FOREIGN KEY (DNUMBER)
REFERENCES DEPARTMENT(DNUMBER) ) ;
CH10.37
DDL - Create Tables (continued)
CSE
4100
CREATE TABLE PROJECT
(PNAME
VARCHAR(15) NOT NULL ,
PNUMBER INT NOT NULL ,
PLOCATION VARCHAR(15) ,
DNUM
INT NOT NULL ,
PRIMARY KEY (PNUMBER) , UNIQUE (PNAME) ,
FOREIGN KEY (DNUM)
REFERENCES DEPARTMENT(DNUMBER) ) ;
CREATE TABLE WORKS_ON
(ESSN CHAR(9) NOT NULL , PNO INT NOT NULL ,
HOURS DECIMAL(3,1) NOT NULL ,
PRIMARY KEY (ESSN, PNO) ,
FOREIGN KEY (ESSN)
REFERENCES EMPLOYEE(SSN) ,
FOREIGN KEY (PNO)
REFERENCES PROJECT(PNUMBER) ) ;
CH10.38
DDL - Create Tables with Constraints
CSE
4100
CREATE TABLE EMPLOYEE
(...,
DNO INT NOT NULL
DEFAULT 1,
CONSTRAINT EMPPK
PRIMARY KEY (SSN) ,
CONSTRAINT EMPSUPERFK
FOREIGN KEY (SUPERSSN)
REFERENCES
EMPLOYEE(SSN)
ON DELETE SET NULL
ON UPDATE CASCADE ,
CONSTRAINT EMPDEPTFK
FOREIGN KEY (DNO)
REFERENCES DEPARTMENT(DNUMBER)
ON DELETE SET DEFAULT
ON UPDATE CASCADE );
CH10.39
DDL - Create Tables with Constraints
CSE
4100
CREATE TABLE DEPARTMENT
(...,
MGRSSN CHAR(9) NOT NULL
DEFAULT '888665555' ,
...,
CONSTRAINT DEPTPK
PRIMARY KEY (DNUMBER) ,
CONSTRAINT DEPTSK
UNIQUE (DNAME),
CONSTRAINT DEPTMGRFK
FOREIGN KEY (MGRSSN)
REFERENCES EMPLOYEE(SSN)
ON DELETE SET DEFAULT
ON UPDATE CASCADE );



Is there an Equivalent to Keys and Constraints in PLs?
What Does Java Have Internally?
Constraints Facilitate Type Checking at Data Level!
CH10.40
Data Manipulation Language - DML
CSE
4100




SQL has the SELECT Statement for Retrieving Info.
from a Database (Not Relational Algebra Select)
SQL vs. Formal Relational Model
 SQL Allows a Table (Relation) to have Two or
More Identical Tuples in All Their Attribute Values
 Hence, an SQL Table is a Multi-set (Sometimes
Called a Bag) of Tuples; it is Not a Set of Tuples
SQL Relations Can Be Constrained to Sets by
 PRIMARY KEY or UNIQUE Attributes
 Using the DISTINCT Option in a Query
Implied Processing and Procedural Semantics
 SQL Queries have Specific Semantics
 These Semantics Dictate Processing
 Includes Code Generation, Optimization, etc.
CH10.41
Interactive DML - Main Components
CSE
4100

Select-from-where Statement Contains:
 Select Clause - Chosen Attributes/Columns
 From Clause - Involved Tables
 Where Clause - Constrain Tuple Values
 Tuple Variables - Distinguish Among Same Names
in Different Tables
 String Matching - Detailed Matching Including
 Exact
 Starts With
 Near

Ordering of Rows - Sorting Tuple Results
CH10.42
Recall Prior Schema
CSE
4100
CH10.43
…and Corresponding DB Tables
CSE
4100
Which Represent Tuples/Instances of Each Relation
A
S
C
null
W
B
null
null
1
4
5
5
CH10.44
…and Corresponding DB Tables
CSE
4100
CH10.45
Simple SQL Queries
CSE
4100

Query 0: Retrieve the Birthdate and Address of the
Employee whose Name is 'John B. Smith'.
SELECT BDATE, ADDRESS
FROM EMPLOYEE
WHERE FNAME='John' AND MINIT='B’
AND LNAME='Smith’


Which Row(s) are Selected?
B
S
C
null
W
B
null
null
Note: While All of these Next Queries are from
Chapter 8, Some are From “Earlier” Edition
CH10.46
Simple SQL Queries
CSE
4100

Query 1: Retrieve Name and Address of all Employees
who work for the 'Research' Department
SELECT FNAME, MINIT, LNAME, ADDRESS, DNAME
FROM EMPLOYEE, DEPARTMENT
WHERE DNAME='Research' AND DNUMBER=DNO

What Action is Being Performed? Join! Cartesian
Product!
CH10.47
Simple SQL Queries - Result
CSE
4100
Theta Join on DNO=DNUMBER
CH10.48
Simple SQL Queries
CSE
4100

Query 2: For Every Project in 'Stafford', list the Project
Number, the Controlling Dept. Number, and the Dept.
Manager's Last Name, Address, and Birthdate
SELECT PNUMBER, DNUM, LNAME, BDATE,ADDRESS
FROM PROJECT, DEPARTMENT, EMPLOYEE
WHERE DNUM=DNUMBER AND MGRSSN=SSN AND
PLOCATION='Stafford'

In Q2, there are Two Join Conditions:
 The Join Condition DNUM=DNUMBER Relates a
Project to its Controlling Department
 The Join Condition MGRSSN=SSN Relates the
Controlling Department to the Employee who
Manages that Department
CH10.49
Query Results
CSE
4100
SELECT PNUMBER, DNUM, LNAME, BDATE,ADDRESS
FROM PROJECT, DEPARTMENT, EMPLOYEE
WHERE DNUM=DNUMBER AND MGRSSN=SSN AND
PLOCATION='Stafford'
A
S
C
null
W
B
null
null
CH10.50
Qualification of Attributes
CSE
4100




In SQL, the Same Name for Two (or More) Attributes
is Allowed if Attributes are in Different Relations
In Those Cases, Query Must Qualify by Prefixing the
Relation Name to the Attribute Name
 EMPLOYEE.LNAME, DEPARTMENT.DNAME
Aliases: When Queries Must Refer to the Same
Relation Twice
 Alias is Akin to a Variable – Reference in PL!
 In These Situations, it is Considered that there are
Two Different Copies of the Same Relation
Let’s See Examples of Both Concepts
CH10.51
Attribute Qualification
CSE
4100

Query 8: For Each Employee, Retrieve the Employee's
Name, and Name of his or her Immediate Supervisor
SELECT E.FNAME, E.LNAME, S.FNAME, S.LNAME
FROM EMPLOYEE E S
WHERE E.SUPERSSN=S.SSN


E and S are aliases for the EMPLOYEE relation
 E Represents Employees in the Role of Supervisees
 S Represents Employees in the Role of Supervisor
Another Form of Query 8 is:
SELECT E.FNAME, E.LNAME, S.FNAME, S.LNAME
FROM EMPLOYEE AS E, EMPLOYEE AS S
WHERE E.SUPERSSN=S.SSN
CH10.52
Query Results
CSE
4100
SELECT
FROM
WHERE
E.FNAME, E.LNAME, S.FNAME, S.LNAME
EMPLOYEE AS E, EMPLOYEE AS S
E.SUPERSSN=S.SSN
A
S
C
null
W
B
null
null
CH10.53
Nested Queries
CSE
4100




SQL SELECT Nested Query is Specified within
WHERE-clause of another Query (the Outer Query)
Query 1A: Retrieve the Name and Address of all
Employees who Work for the 'Research' Department
SELECT
FNAME, LNAME, ADDRESS
FROM EMPLOYEE
WHERE
DNO IN
(SELECT DNUMBER
FROM
DEPARTMENT
WHERE DNAME='Research' )
Note: This Reformulates Earlier Query 1
The End Result is Essentially:
 Outer and Inner For/While Loops!
CH10.54
How Does Nested Query Work?
CSE
4100





The Nested Query Selects Number of 'Research' Dept.
The Outer Query Selects an EMPLOYEE Tuple If Its
DNO Value Is in the Result of Either Nested Query
IN represents Set Inclusion of Result Set
We Can Have Several Levels of Nested Queries
SELECT
FNAME, LNAME, ADDRESS
FROM EMPLOYEE
WHERE
DNO IN
(SELECT DNUMBER
FROM
DEPARTMENT
WHERE Dname=’Research' )
CH10.55
NULLS in SQL Queries
CSE
4100



SQL Allows Queries that Check if a value is NULL
(Missing or Undefined or not Applicable)
SQL uses IS or IS NOT to compare NULLs since it
Considers each NULL value Distinct from other NULL
Values, so Equality Comparison is not Appropriate
Query 18: Retrieve the names of all employees who do
not have supervisors.
SELECT
FNAME, LNAME
FROM EMPLOYEE
WHERE SUPERSSN IS NULL

Why Would Such a Capability be Useful?
 Downloading/Crossloading a Database
 Promoting a Attribute to PK/FK
CH10.56
Aggregate Functions in SQL Queries
CSE
4100

Query 19: Find Maximum Salary, Minimum Salary,
and Average Salary among all Employees
SELECT
FROM

MAX(SALARY), MIN(SALARY),
AVG(SALARY)
EMPLOYEE
Query 20: Find maximum and Minimum Salaries
among 'Research' Department Employees
SELECT MAX(SALARY), MIN(SALARY)
FROM EMPLOYEE, DEPARTMENT
WHERE DNAME='Research' AND DNUMBER=DNO

What Does Query 22 Do?
SELECT COUNT(*)
FROM EMPLOYEE, DEPARTMENT
WHERE DNAME='Research' AND DNUMBER=DNO
CH10.57
Grouping in SQL Queries
CSE
4100





Query 24: For Each Department, Retrieve the DNO,
Number of Employees, and Their Average Salary
SELECT DNO, COUNT (*), AVG (SALARY)
FROM EMPLOYEE
GROUP BY DNO
EMPLOYEE tuples are Divided into Groups; each
group has the Same Value for Grouping Attribute DNO
COUNT and AVG functions are applied to each Group
of Tuples Aeparately
SELECT-clause Includes only the Grouping Attribute
and the Functions to be Applied on each Tuple Group
Are there PL Equivalents to these Data Oriented
Actions? Yes – in Specific APIs but Not PL Itself!
CH10.58
Results of Query 24:
CSE
4100

SELECT DNO, COUNT (*), AVG (SALARY)
FROM EMPLOYEE
GROUP BY DNO
CH10.59
INSERT SQL Queries
CSE
4100


Add one or more Tuples to a Relation, with Attribute
values Listed in the order specified in the CREATE
Update 1:
INSERT INTO EMPLOYEE
VALUES ('Richard','K','Marini', '653298653',
'30-DEC-52', '98 Oak Forest,Katy,TX', 'M',
37000,'987654321', 4 )

Another Form of Update 1:
INSERT INTO EMPLOYEE (FNAME, LNAME, SSN)
VALUES ('Richard','K','Marini')




All PK and FK Values must be Provided
Nulls are Allowed
DDL Constraints are Enforced
Another form of “Type Checking” at Instance Level
 This is Akin to Dynamic Type Checking!
CH10.60
DELETE SQL Queries
CSE
4100

Sample Deletes Include
DELETE FROM EMPLOYEE
WHERE
LNAME='Brown'
DELETE FROM EMPLOYEE
WHERE
SSN='123456789’
DELETE FROM EMPLOYEE
WHERE
DNO IN
(SELECT
DNUMBER
FROM
DEPARTMENT
WHERE
DNAME='Research')
DELETE FROM EMPLOYEE


No. of Tuples Deleted Dependent on WHERE Clause
Referential Integrity (Type Checking!) is Enforced
During DELETE
CH10.61
UPDATE SQL Queries
CSE
4100



Give all Employees in the 'Research' Dept. a 10% raise
UPDATE EMPLOYEE
SET SALARY = SALARY *1.1
WHERE
DNO IN
(SELECT DNUMBER
FROM
DEPARTMENT
WHERE DNAME='Research')
Modified SALARY Value Depends on the Original
SALARY Value in each Tuple
SALARY = SALARY *1.1 - Use PL Interpretation
CH10.62
Query Processing and Optimization
CSE
4100

What are the Processing Issues for DBs?
 Database Applications of Today and Tomorrow
Require High Volumes of Information!
 Increase of Information Still Requires High
Performance!
 Throughput and Response Time
 Where's the Bottleneck in DBS?






CPU ??
Main Memory Size/Speed ??
Virtual Memory Limitations ??
Communications Bus ??
I/O Channel ??
How Does this Relate to Compilers/PLs?
CH10.63
90-10 Rule for Database Processing
CSE
4100




Load (Transaction per second) vs.
Performance (Response Time of Transactions)
Processing of Large Amounts of Raw Data
 Addressed in Secondary Storage
 Staged to Main Memory
Identifying Relevant Data
 Large Amounts of Raw Data Discarded
 Focus on Data Most Likely to Contain Answers
 Possible Loss of CPU and Main Memory Cycles
This is Double Jeopardy!
 Load of DBS Must be Reduced
 Performance of DBS Degrades
CH10.64
90-10 Rule for Conventional DBS
CSE
4100

Only 10% of Relevant
Data has Answers
Application
Programs

Operating
System

Database
Functions
Only 10% of Raw Data is
Relevant
On-Line
I/O
Disk I/O
Note: Naive Approach to Database Searching Often Occurs
(Little or No Indexing in Practice!)
CH10.65
Query Optimization Goal
CSE
4100


Limit Costly Join Operation by Reducing Data to be
Scanned or that Participates in the Join
While Improving Selection and Projection can Help,
the Main Objective is Join
 In Worst Case - Cartesian Product
 Can Improve by Introducing Indices on the Join
Attributes (R.B and S.C) to Limit “Product”
 Can Further Improve by Sorting on the Join
Attributes (R.B and S.C)
 This Reduces Block Accesses by Limiting the Number
of Blocks that Must be Examined in a Join
 If B’s Values Range from 0 to 100 and C from 50 to 150,
only need to Compare from 50 to 100

Focus is on Reducing Costly Ops – Same as PL
Optimization to Replace * with +
CH10.66
Query Processing
CSE
4100

Internal Data Structure
 Memory Hierarchy
 Main Memory + Secondary Memory
 Information Must be Staged from Secondary to Primary
Memory for Database Operation

Sequential Search
 Brute force Approach

Direct Access (Indexed Search)
 Hash, Inverted Index file, Binary Search Tree, B-tree,
B+-tree
 Improves Selection by Focusing on Subset of Tuples
that are Involved in the Answer and Equijoin by Not
Having to Compare All Blocks in Two Relations
CH10.67
Algorithms for Database Query Operators
CSE
4100


Largely Fall into Three Classes: Sorting-Based
Methods, Hash-Based Methods, Index-Based Methods
Such Algorithms are Divided into Three Degrees of
Difficulty and Cost (Limiting Factor is Size of Data)
 One Pass Algorithms
 Where Data is Only Read Once From Disk

Two-pass Algorithms
 Data is Read from Disk, Processed in Some Way,
Written Back to Disk, Read Again for Processing, etc.

Multi-pass Algorithms
 Where 3 or More Passes Are Required, i.e., Recursive
Generalization of the Two-pass Algorithms

Akin to Multiple Pass Compilers at Data Level
CH10.68
Database Join and Sort are External
CSE
4100



Suppose that your DBS has 1,000 1K Blocks of
Memory Available for Performing Operations (e.g.,
Select, Project, Join, Union, Aggregation, etc.)
Suppose Sort R by R.B
 R Contains 5000 Blocks
 In order to Perform a Sort/Merge - You Must Use
External Algorithm since all 5000 Blocks Can Fit
Into Memory at the Same Time
Suppose Join R (500 Blocks) and S (800 Blocks)
 Again - their Total Exceeds Memory - Hence you
Must Take an Approach that Compares One Block
of R with All Blocks of S, etc. (Slides 22,23)
1
2
3
1000
CH10.69
Database Join and Sort are External
CSE
4100






What’s True about Today’s DBMS Like Oracle?
Oracle Recommends 2 Gigabytes of Primary Memory
That 2 Gigabytes Must be Shared by:
 Operating System
 Other Applications Running on “Same” Server
(Web Server, etc.)
 Database Management Software
Even if there was 1.5 Gigabytes Available, Modern
DBs can Exceed that size Very Easily
Moreover,
 Cartesian Product Could Exceed Available Mem.
 Join Could Require External Approach Since All
Tables Involved in Join Can’t fit in 1.5 Gigabytes
External Sorting/Block Oriented Processing is Norm
CH10.70
The System Catalog
CSE
4100

Store the Meta Information that Describes Each
Database, Including a Description of
 Conceptual Database Schema (Logical Data
Model)
 Relations, Attributes, Keys, Indexes, Views
Internal Schema
 External Schema
Store Information Needed by Specific DBMS Modules
 Query Optimization Module
 Security and Authorization


CH10.71
Example of Catalog Information
CSE
4100
CH10.72
Relational DBMS Catalog
CSE
4100


All Metadata Stored as Relations
Example of Metadata Tables are:
CH10.73
Uses of System Catalog
CSE
4100


DDL Compilers:
 Correct Definition of
Relations and Attributes
DML (Query) Compiler:
 DML Parser
SELECT EMP.ENAME
FROM EMP, WORKS, PROJ
WHERE (EMP.ENO= WORKS.ENO)
AND(WORKS.PNO = PROJ.PNO)
AND(PROJ.PNAME = “CAD/CAM”)
 Guided by the Description of DML Syntax and the
Schema Information in the Catalog, Generates a Query
Tree after Parser

Optimizer
 Generates Access Paths that is Relatively Optimal for
Executing a Query/ DML Command, by Accessing the
Database Structure Information (Schemas), and
Mapping High-level SQL Queries Into Low-level File
Access Commands
CH10.74
Revisit Typical Database Processing
CSE
4100
Parsed and
Optimized
User Trans.
Pre-Processing
- Parser/Lexical
- Optimizer/Views
Concurrency Control
Lock Request
Response
User Transaction
Errors
Post-Processing
- Collection of Results
- Aggregation Operations
- Security Checks
Low-Level Processing
- Enqueue Trans.
- Request Locks
- Issue I/Os
- Process Returned Data
- Integrity Checks
- Security Checks
- Logging for Recovery
- Release Locks
- Dequeue Trans.
High-Level Processing
- Enqueue Trans.
- Request Locks
- Release Locks
-Dequeue Trans.
Response to User
I/O
Request
Errors
Results
Lock Request
Results
Disk I/O
Recovery
CH10.75
Typical Database Processing
CSE
4100

Pre-Processing
 Actions Taken Upon Receipt of a Query from User
 SQL Query via Query Tool or JDBC Call
 “Compilation” of DB Query
 Check Syntax, Semantics, Optimize, Develop RunTime Strategy (Similar to PL Compilation)
 Query is Translated to DB Transaction
 A Transaction Contains Multiple DB Operations
 Transaction has Explicit Order of Operations

Database Transaction Must Succeed or Fail
 There is no Intermediate State
 Completely Executed and Committed or
Aborts at any Point and Undone

New State or Previous State of DB
CH10.76
Typical Database Processing
CSE
4100

High-Level Processing
 Enqueue Transaction from Pre-Processing
 Transaction Must Wait for “Earlier” Transactions
 Remember - Shared DB State!

Request Locks from Concurrency Control
 All Locks Before Proceeding vs. Locks as Needed
 Avoid Deadlock and Livelock

Release Locks
 As Use of Data Completes to Increase Availability
 What Happens if Failure of Later Step in Transaction

Dequeue Transaction
 Completes Transaction Processing
 Return “Result” to Post-Processing
CH10.77
Typical Database Processing
CSE
4100

Low-Level Processing
 Enqueue Transaction - Do Actual DB Operations
 Request Locks - Lower Granularity Level
 Issue I/Os - Based on Operations to Access
“Correct” and “Relevant” DB Records
 Process Returned Data - Aggregation, Sorting
 Integrity Checks: Do I/D/U Satisfy Constraints?
 Security Checks: Is DB R/I/D/U Allowed?
 Logging for Recovery - Commit the Transaction
 Release Locks - Available to Others
 Dequeue Transaction - Return Results to HighLevel Processing
 Note: The Multiple Operations of Each DB
Transaction All Must be Successful
CH10.78
Typical Database Processing
CSE
4100

Post Processing
 Collection of Results
 May be Passed Portions of Results as they Complete
 For Example, Sorted Blocks of Data that are then
Merged in a Final Step

Aggregation Operations
 May be Passed Aggregate Intermediate Results
 Sum for Different Departments to be Totaled

Security Checks
 Last Step Filtering to Insure Only Allowed Data is
Returned
 May Execute Query but Only see Aggregate Result

Send Results to User
CH10.79
Typical Database Processing
CSE
4100

Concurrency Control
 Control Access to Information
 Data and Metadata
 Prevent Simultaneous Updates
 Ensure Database Always Correct and Consistent
 Serial Schedule vs. Serializable Transaction
 Two Types
 Pessimistic - Locking-Based - Assume Collisions Will
Occur - e.g., Peoplesoft Course Registration
 Optimistic - Time-Based - Fix Problems After the Fact e.g., ATM Machines Example

CC Manages Locks at Different Granularity Levels
(Table, Attribute, View, Tuple, Metadata, etc.)
CH10.80
Typical Database Processing
CSE
4100

Disk I/O
 Performs the Actual Disk I/O for Read/Writes
 Block Oriented Activity
 Maintain Queue of All I/O Requests
 Ordering is Critical
 Related to Concurrency Control and Consistency




Single DB Transactions can have Multiple DB
Operations
Disk I/Os for Different Operations at Different
Times
High and Low Level Processing will Determine
What Operations Needed When
Disk I/O - Relatively “Dumb”
CH10.81
Typical Database Processing
CSE
4100

Recovery
 Tightly Tied to DB Transaction Concept
 Transactions Must be:
 Atomic - Happens or Doesn’t
 Durable - Once Committed, Results Survive Failure
 Consistent - Follows Protocol/Correct DB State

When Failure Occurs, Can we:
 Recover to a Correct “Earlier” State
 Reconcile all “Active” Transactions that were Executing
at Failure Time


Involves Logging of Database Actions
Objective: High Availability and Reliability
CH10.82
Query Optimization
CSE
4100


Not Really Optimizing, but Planning to Avoid Bad
Execution Strategies
Models
 Heuristics-Based
 Apply Transformation Rules According to a General
Strategy
 Focus on Relational Algebra that Underlies Each Query
 Improve the “Order” of Relational Operations

Cost-Based
 Minimize a Cost Function
I/O Cost + CPU Cost
 Subject to a Set of Constraints
CH10.83
Query Processing Methodology
CSE
4100
High-level Calculus-based Query
EXTERNAL
SCHEMA
Query
Preprocessing
Algebraic Query (a tree structure)
LOGICAL
SCHEMA
Query
Optimization
INTERNAL
SCHEMA
Execution Schedule (file access plan)
CH10.84
Refute Incorrect Queries
CSE
4100

Example:
E(ENAME, ENO), P(JNO,JNAME), W(ENO,PNO,DUR)
SELECT ENAME, PNAME
FROM E, P, W
WHERE DUR > 27 AND DUR < 25


Incorrect
 Disjoint Components are Useless
 Multiple Relations, Missing Joins, may not be
incorrect, but may indicate Cartesian product
Contradictory
 Qualification can not be Satisfied by any Tuple
 DUR > 27 AND DUR < 25
CH10.85
Simplification
CSE
4100


Why Simplify?
 The Simpler the Query, the Less Work there is and
the Better the Performance
How? Use transformation rules
 Elimination of Redundancy
 Idempotency Rules
 Application of Transitivity
 Use of Integrity Rules

Example
 x > a and x > b
 DUR > 27 AND DUR > 25
CH10.86
Restructuring
CSE
4100
Convert Relational Calculus to
Relational Algebra
ENAME
 Make use of Query Trees
 Example
Find the names of employees
other than J. Doe who worked (DUR=12 OR DUR=24) AND
JNAME=“CAD/CAM” AND
on the CAD/CAM project for
ENAME°“J. DOE”
either 1 or 2 years.

SELECT ENAME
FROM
E, W, P
WHERE E.ENO=W.ENO
AND
W.JNO=P.JNO
AND
E.ENAME°"J. Doe"
AND
P.JNAME="CAD/CAM"
AND
(W.DUR=12 OR
W.DUR=24)
Project
Select
JNO
Join
ENO
P
W
E
CH10.87
Query Optimization Objectives
CSE
4100





Improving Performance
Arriving at a Query Plan of Execution
Analyzing the Relational Algebra Query
 Replace Costly Operations
 Do Selections and Projections Early
Optimization Heuristics for the Relational Algebra
 Performing Selection and Projection Before Join
 Combining Several Selections Over a Single
Relation Into One Selection
 Find Common Subexpressions
 Algebraic Rewriting/transformation Rules
General Transformation Rules for Relational Algebra
(Equivalence-preserving Algebraic Rewriting Rules)
CH10.88
Query Optimization: An Example
CSE
4100

Why is it important?
SELECT
FROM
WHERE
AND
ENAME
E,W
E.ENO = W.ENO
W.RESP = "Manager"

Strategy 1
 ENAME(RESP="Manager"E.ENO=G.ENO(E  W))

Strategy 2
 ENAME( E
ENO(RESP="Manager"(W)))
CH10.89
Cost of Alternatives
CSE
4100
Assume :
 card(E) = 4,000; card(W)=10,000
 10% of tuples in W satisfy RESP="Manager"
(selection generates 1,000 tuples)
 Execution time Proportional to the Sum of the
Cardinalities of the Temporary Relations
 Searching is Done by Sequential Scanning

Strategy 1
Cartesian prod. = 40,000,000
Search over all = 40,000,000
80,000,000
Strategy 2
Selection over W =
10,000
Join(4000*1000) = 4,000,000
4,010,000
CH10.90
General Query Optimization Strategy
CSE
4100



Perform Selections Early
 Yields Smaller Intermediate Results
 Direct Impact on Subsequent Join/Cartesian Prod.
Combine Selections with a Prior Cartesian Product into
a Theta or Equi Join
 Join is a Cheaper Operation
Combine (Cascade) Selections and Projections
AB(B (R))  AB(R)
p1 ( p2 (R))  p1 ^ p2 (R)
This Results in One Pass Instead of Two over Table
CH10.91
General Query Optimization Strategy
CSE
4100



Identify Common Subexpressions
 Compute Once and Store
 use Stored Version for Subsequent Times
 Often Useful When Views are Employed
Preprocess Data via Sorts and Indexes
 Speeds up Searches and Joins by Limiting Scope
Evaluate and Assess Different Options
 For Cartesian Product, Use Smaller Relation for
Comparison
 Use System Catalog (Meta-data) to Effect Order in
Query Execution Plan
CH10.92
Relational Algebra Transformations
CSE
4100

Cascade of Selection


p1 ^ p2 ^ …^ pn(R)p1(p2(...(pn(R))...))
Commutativity of Selection

p1(p2(R))p2(p1(R))
p1 or p2(R )p1(R p2(R)
Cascade of Projection




A1,A2, … An(R)A1(A2(...(An(R))...))
A1(R) if A1 A2 ...  An
Commuting Selection with Projection

A1,A2,...,An(p(R))p(A1,A2,...,An(R)
CH10.93
Relational Algebra Transformations
CSE
4100


Commutativity of Theta Join and Cartesian Product
 R
A SS
AR
 R  SS  R
Commuting Selection with Theta Join (Cartesian)
 p(A)(R S) p(A)(R)) S
A defined on R only
 p(A)^p(B)(R S)  p(A)(R))  p(B)(S))
(A defined on R, B defined on S)
Also Holds for Theta Join as Well
Commuting Projection with Theta Join (Cartesian)
 C(R S) A(R) B(S) where AB=C
 A are Attributes in C for R and B are Attributes in C
for S


CH10.94
Relational Algebra Transformations
CSE
4100



Commutativity of Set Operations
 R S S R
 R S S R
Associativity of Set Operations
 (R S) T R S T)
 (R
S) T R
(S T)
 (R S) S R  (S  T)
 (R S) S R (S T)
Commuting Select with Set Operations
 p(Ai)(R T) p(Ai)(R) p(Ai)(T)
where Ai is defined on both R and T

p(Ai)(R T) p(Ai)(R) p(Ai)(T)
where Ai is defined on both R and T
CH10.95
Relational Algebra Transformations
CSE
4100
11. Commuting Projection with Union
 C(R
q(Aj,Bk) S) A’(R)
q(Aj,Bk)
B’(S)
C(R S) A’ (R) B’ (S)
where R[A] and S[B]
C = A' B' where A'  A, B’  B
12. Converting Selection/Cartesian Into Theta Join
 C (R S)  R
S
C

CH10.96
Heuristic Optimization: Example
CSE
4100
Canonical query tree at the end of
query preprocessing phase
ENAME
(DUR=12 OR DUR=24)
AND
JNAME=“CAD/CAM” AND
ENAME= “J. DOE”
E(ENAME, ENO)
P(JNO,JNAME)
W(ENO,PNO,DUR)
JNO
ENO
P
W
E
CH10.97
Heuristic Optimization– Example
ENAME
CSE
4100
DUR=12 OR DUR=24
JNAME=“CAD/CAM”
ENAME = “J. DOE”
Use cascading of selections
rule to decompose selections
JNO
P
ENO
W
E
CH10.98
Heuristic Optimization– Example
ENAME
CSE
4100
DUR=12 OR DUR=24
JNAME=“CAD/CAM”
Push selection down
using commutativity of
selection over join
JNO
ENO
ENAME = "J. Doe"
P
W
E
CH10.99
Heuristic Optimization–Example
CSE
4100
ENAME
DUR=12 OR DUR=24 Push selection down
JNO
JNAME = "CAD/CAM"
using commutativity of
selection over join
ENO
ENAME = "J. Doe"
P
W
E
CH10.100
Heuristic Optimization–Example
CSE
4100
ENAME
JNO
Push selection down
ENO
JNAME = "CAD/CAM"
P
DUR =12 DUR=24
W
ENAME = "J. Doe"
E
CH10.101
Heuristic Optimization–Example
CSE
4100
ENAME
JNO
JNO,ENAME
Do early projection
ENO
JNO
JNAME = "CAD/CAM" 
P
JNO,ENO
DUR =12 DUR=24
W
JNO,ENAME
ENAME = "J. Doe"
E
CH10.102
Heuristic Optimization–Example
ENAME
CSE
4100
Identify subtrees that
can be implemented in
one algorithm
JNO
JNO,ENAME
ENO
JNO
JNAME = "CAD/CAM"
JNO,ENO
JNO,ENAME
DUR =12 DUR=24
ENAME = "J. Doe"
P
W
E
CH10.103
Heuristic Optimization: A Second Example
CSE
4100
 Title
What is the Final Step?
Combine Select and
Cartesian Product
 Borrower.Card_No = Loans.Card_No
Result: Equijoins!
 Loans.LC_No
X
 Books.LC_No, Title
Books
 Books.LC_No = Loans.LC_No
 Loans.LC_No,
X
 Borr.Card_No
Loans.Card_No
 Date  1/1/88
Borrower
Loans
CH10.104
Cost-Based Optimization
CSE
4100


Reduce Defined Cost of Executing Queries
What is Involved in the Cost of Executing a Query?
 Access Cost to Secondary Storage
 Search for Data Block (Index)
 Read/Write Index and Data Blocks

Storage Cost
 Index and Data Blocks
 Intermediate Files

Computation Cost
 Query Planning - Optimization Effort
 Record Search, Sort, Merge
 Actual Transaction/Query Operations

Communications Cost
 Transfer of Results to the User
CH10.105
Complexity of Relational Operations
CSE
4100



Assuming
 Relations of
Cardinality n
 Sequential Scan
of Data in each
Relation
Complexity of Each
Operation is
Indicated
Avoid Cartesian
Product at All Costs!
Operation
Select
Project
(w/o duplicate elimination)
Project
(with duplicate elimination)
Group
Complexity
O(n)
O(nlog n)
Join
Division
O(nlog n)
Set Operators
Cartesian Product
O(n2)
CH10.106
Cost-Based Optimization
CSE
4100





To Understand Cost-Based Operations, we Must Focus
on Implementation Strategy of:
 Select
 Project
 Join
For Select and Project - There is a Fixed Cost that we
Must Live With
For Join
 Implementation Strategy
 Different Join Strategies
Objective:
 Minimize the Number of Blocks Involved
Note that Cost-Based and Relational Algebra Heuristic
Optimization Can Complement One Another
CH10.107
Optimization Summary
CSE
4100




Most Systems Implement Only a Few Strategies
The Number of Strategies that are Considered by Any
Query Optimizer is Limited
Some Systems Reduce the Number of Strategies by
Making a Heuristic Guess of Strategy for Each Query
 The Optimizer Considers Every Possible Strategy,
but Terminates as Soon as it Determines the Cost is
Greater than the Pre-chosen Strategy
 Thus Only a Few Competing Strategies Require
Full Analysis of the Cost
 The Overhead of Query Optimization is Reduced
Remember - Trade off in Optimization Time
 For PL - Optimization is Pre-Execution (Compile)
 For DB - Optimization is Part of Execution (Run)
CH10.108
Download