Compiler Concepts for Database Systems CSE 4100 Prof. Steven A. Demurjian Computer Science & Engineering Department The University of Connecticut 371 Fairfield Way, Unit 2155 Storrs, CT 06269-3155 steve@engr.uconn.edu http://www.engr.uconn.edu/~steve (860) 486 - 4818 CH10.1 Overview CSE 4100 Motivation and Background Database System Architecture Exploring its Capabilities Focusing on Compiler-Related Concepts Compile Time Issues in Database Systems The SQL Query Language Optimization Issues in Database Systems Typing Runtime Issues in Database Systems Transaction Processing Execution for Complex Joins CH10.2 Database System Architecture CSE 4100 What are the Various Components? How do they Relate to Compilers? CH10.3 How Does it Compare to Java Environment? CSE 4100 CH10.4 Database Concepts - Summary CSE 4100 Schema vs. Data Database-Structured Collection of Data Describing Objects of Universe of Discourse being Modeling. A Database Consists of Schema and Data Schema: Describes the Intension (Type) of Objects Data: Describes the Extension (Instances) of Objects What is Schema w.r.t. Compilers? What is Data? CH10.5 What is a DBMS? CSE 4100 A Database Management System (DBMS) is the Generalized Tool that Facilitates the Management of and Access to the Database Main Functions: Defining a Database: Specifying Data Types, Structures, and Constraints Constructing a Database: the Process of Storing the Data Itself on Some Storage Medium Manipulating a Database: Function for Querying Specific Data in the Database and Updating the Database What are the Analogies of Each of the Main Functions w.r.t. Programming Languages and Compilers? CH10.6 What is a DBMS? CSE 4100 Additional Functions: Interaction with File Manager So that Details Related to Data Storage and Access are Removed From Application Programs Integrity Enforcement Guarantee Correctness, Validity, Consistency Security Enforcement Prevent Data From Illegal Uses Concurrency Control Control the Interference Between Concurrent Programs Recovery from Failure Query Processing and Optimization Again – What are Relevant Compiler Concepts? CH10.7 DBMS Architecture CSE 4100 DBMS Languages Data Definition Language (DDL) Data Manipulation Language (DML) From Embedded Queries or DB Commands Within a Program “Stand-alone” Query Language Host Language: DML Specification (e.g., SQL) is Embedded in a “Host” Programming Language (e.g., Java, C++) DBMS Interfaces Menu-Based Interface Graphical Interface Forms-Based Interface Interface for DBA (DB Administrator) CH10.8 DBMS Architecture CSE 4100 Main DBMS Modules DDL Compiler DML Compiler Ad-hoc (Interactive) Query Compiler Run-time Database Processor Stored Data Manager Concurrency/Back-Up/Recovery Subsystem DBMS Utility Modules Loading Routines Backup Utility System Catalog/data Dictionary CH10.9 Components of a DBMS CSE 4100 CH10.10 ANSI/SPARC - Three Schema Architecture CSE 4100 External Data Schema (Users’ view) Conceptual Data Schema (Logical Schema) Internal Data Schema (Physical Schema) What are the Programming Language Analogies? CH10.11 Conceptual Schema CSE 4100 Describes the Meaning of Data in the Universe of Discourse Emphasizes on General, Conceptually Relevant, and Often Time Invariant Structural Aspects of the Universe of Discourse Excludes the Physical Organization and Access Aspects of the Data This could be a UML Design that Realizes a Set of Classes (no data) or Java Class Declarations (APIs) CH10.12 Conceptual Schema CSE 4100 Another Example – A Programming Language Level Definition CH10.13 External Schema CSE 4100 Describes Parts of the Information in the Conceptual Schema in a form Convenient to a Particular User Group’s View Derived from the Conceptual Schema What is the View of the Outside World in OO? Akin to Public Interface CH10.14 External Schema CSE 4100 Another Example CH10.15 Internal Schema CSE 4100 Describes How the Information Described in the Conceptual Schema is Physically Represented in a Database to Provide the Overall Best Performance CH10.16 Internal Schema CSE 4100 Another Example This Corresponds to Data Typing and Layout in Compilers from Runtime Environment! CH10.17 Unified Example of Three Schemas CSE 4100 CH10.18 Database Access Process CSE 4100 What Does This Access Process Resemble? Akin to Runtime Execution Environment! A More Complex Activation Process! CH10.19 Metadata vs. Data CSE 4100 Recall Introspection and Reflection in Java where you Can “Look” into the Class Definitions Themselves! CH10.20 Data Independence CSE 4100 Ability that Allows Application Programs Not Being Affected by Changes in Irrelevant Parts of the Conceptual Data Representation, Data Storage Structure and Data Access Methods Invisibility (Transparency) of the Details of Entire Database Organization, Storage Structure and Access Strategy to the Users Recall Software Engineering Concepts: Abstraction the Details of an Application's Components Can Be Hidden, Providing a Broad Perspective on the Design Representation Independence: Changes Can Be Made to the Implementation that have No Impact on the Interface and Its Users Realized in Today’s Modern PLs! CH10.21 What are System Components? CSE 4100 How are these Similar to Complier/PL Concepts? CH10.22 Relational Model CSE 4100 Relational Model of Data Based on the Concept of a Relation Relation - a Mathematical Concept Based on Sets Strength of the Relational Approach to Data Management Comes From the Formal Foundation Provided by the Theory of Relations RELATION: A Table of Values A Relation May Be Thought of as a Set of Rows A Relation May Alternately be Though of as a Set of Columns Each Row of the Relation May Be Given an Identifier Each Column Typically is Called by its Column Name or Column Header or Attribute Name CH10.23 Relational Tables - Rows/Columns/Tuples CSE 4100 CH10.24 Relational Database Definition CSE 4100 CREATE TABLE Student: Name(CHAR(30)), SSN(CHAR(9)), Gpa(FLOAT(2)) CREATE TABLE Faculty: Name(CHAR(30)), SSN(CHAR(9)), Ophone(CHAR(7)) CREATE TABLE Courses: Course#(CHAR(6)), Title(CHAR(20)), Descrip(CHAR(100)), PCourse#(CHAR(6)) CREATE TABLE Formats: Section#(INTEGER(3)), Quarter(CHAR(10)), Campus(CHAR(15)) CREATE TABLE TakeorTeach: SSN(CHAR(9)), Course#(CHAR(6)), Section#(INTEGER(3)) CREATE TABLE COfferings: Course#(CHAR(6)), Section#(INTEGER(3)) Student(Name*, SSN, Gpa) Faculty(Name*, SSN, Ophone) Courses(Course#*, Title, Descrip, PCourse#*) Formats(Section#*, Quarter, Campus) TakeorTeach(SSN, Course#, Section#) COfferings(Course#, Section#) CH10.25 Relational Views CSE 4100 Two Views Derived From Prior Tables Student Transcript View Course Prerequisite View CH10.26 SQL: Tuple Relational Calculus-Based CSE 4100 SQL is a Partial Example of a Tuple Relational Language Simple Queries are all Declarative More Complex Queries are both Declarative and Procedural (e.g., joins, nested queries) Find the names of employees working on the CAD/CAM project SELECT EMP.ENAME FROM EMP, WORKS, PROJ WHERE (EMP.ENO= WORKS.ENO) AND (WORKS.PNO = PROJ.PNO) AND (PROJ.PNAME = “CAD/CAM”) SQL Defines a Programming Language and Associated Semantics for Usage and Processing CH10.27 SQL Components CSE 4100 Data Definition Language (DDL) For External and Conceptual Schemas Views - DDL for External Schemas Data Manipulation Language (DML) Interactive DML Against External and Conceptual Schemas Embedded DML in Host PLs (EQL, JDBC, etc.) Note: Separation of Definition (DDL) from Usage (DML) – Is there Something Similar in PLs? Others Integrity (Allowable Values/Referential) Transaction Control (Long-Duration and Batch) Authorization (Who can Do What When) CH10.28 SQL DDL and DML CSE 4100 Data Definition Language (DDL) - Declarations Defining the Relational Schema - Relations, Attributes, Domains - The Meta-Data CREATE TABLE Student: Name(CHAR(30)),SSN(CHAR(9)),GPA(FLOAT(2)) CREATE TABLE Courses: Course#(CHAR(6)), Title(CHAR(20)), Descrip(CHAR(100)), PCourse#(CHAR(6)) Data Manipulation Language (DML) - Code Defining the Queries Against the Schema SELECT Name, SSN From Student Where GPA > 3.00 CH10.29 Data Definition Language - DDL CSE 4100 A Pre-Defined set of Primitive Types Numeric Character-string Bit-string Additional Types Defining Domains Defining Schema Defining Tables Defining Views Note: Each DBMS May have their Own DBMS Specific Data Types - Is this Good or Bad? What is this Similar to re. Different C++ Compilers? These are Akin to PL Data Types! CH10.30 DDL - Primitive Types CSE 4100 Numeric INTEGER (or INT), SMALLINT REAL, DOUBLE PRECISION FLOAT(N) Floating Point with at Least N Digits DECIMAL(P,D) (DEC(P,D) or NUMERIC(P,D)) have P Total Digits with D to Right of Decimal Note that INTs and REALs are Machine Dependent (Based on Hardware/OS Platform) Again – this is Similar to PLs/Compilers and Code Generation – Data Layout CH10.31 DDL - Primitive Types CSE 4100 Character-String CHAR(N) or CHARACTER(N) - Fixed VARCHAR(N), CHAR VARYING(N), or CHARACTER VARYING(N) Variable with at Most N Characters Bit-Strings BIT(N) Fixed VARBIT(N) or BIT VARYING(N) Variable with at Most N Bits CH10.32 DDL - Primitive Types CSE 4100 These Specialized Primitive Types are Used to: Simplify Modeling Process Include “Popular” Types Reduce Composite Attributes/Programming DATE : YYYY-MM-DD TIME: HH-MM-SS TIME(I): HH-MM-SS-F....F - I Fraction Seconds TIME WITH TIME ZONE: HH-MM-SS-HH-MM TIME-STAMP: YYYY-MM-DD-HH-MM-SS-F...F{-HH-MM} PLs also have Specialized Types! Problem: Different Database Systems Sometime Implement these Types very Differently This Impacts Portability! CH10.33 What is a SQL Schema? CSE 4100 A Schema in SQL is the Major Meta-Data Construct Supports the Definition of: Relation - Table with Name Attributes - Columns and their Types Identification - Primary Key Constraints - Referential Integrity (FK) Two Part Definition CREATE Schema - Named Database or Conceptually Related Tables CREATE Table - Individual Tables of the Schema CH10.34 DDL-Create/Drop a Schema CSE 4100 Creating a Schema: CREATE SCHEMA MY_COMPANY AUTHORIZATION Demurjian; Schema MY_COMPANY bas Been Created and is Owner by the User “Demurjian” Tables can now be Created and Added to Schema Dropping a Schema: DROP SCHEMA MY_COMPANY RESTRICT; DROP SCHEMA MY_COMPANY CASCADE; Restrict: Drop Operation Fails If Schema is Not Empty Cascade: Drop Operation Removes Everything in the Schema CH10.35 DDL - Create Tables CSE 4100 CREATE TABLE EMPLOYEE ( FNAME VARCHAR(15) NOT NULL , MINIT CHAR , LNAME VARCHAR(15) NOT NULL , SSN CHAR(9) NOT NULL , BDATE DATE ADDRESS VARCHAR(30) , SEX CHAR , SALARY DECIMAL(10,2) , SUPERSSN CHAR(9) , DNO INT NOT NULL , PRIMARY KEY (SSN) , FOREIGN KEY (SUPERSSN) REFERENCES EMPLOYEE(SSN) , FOREIGN KEY (DNO) REFERENCES DEPARTMENT(DNUMBER) ) ; CH10.36 DDL - Create Tables (continued) CSE 4100 CREATE TABLE DEPARTMENT ( DNAME VARCHAR(15) NOT NULL , DNUMBER INT NOT NULL , MGRSSN CHAR(9) NOT NULL , MGRSTARTDATE DATE , PRIMARY KEY (DNUMBER) , UNIQUE (DNAME) , FOREIGN KEY (MGRSSN) REFERENCES EMPLOYEE(SSN) ) ; CREATE TABLE DEPT_LOCATIONS (DNUMBER INT NOT NULL , DLOCATION VARCHAR(15) NOT NULL , PRIMARY KEY (DNUMBER, DLOCATION) , FOREIGN KEY (DNUMBER) REFERENCES DEPARTMENT(DNUMBER) ) ; CH10.37 DDL - Create Tables (continued) CSE 4100 CREATE TABLE PROJECT (PNAME VARCHAR(15) NOT NULL , PNUMBER INT NOT NULL , PLOCATION VARCHAR(15) , DNUM INT NOT NULL , PRIMARY KEY (PNUMBER) , UNIQUE (PNAME) , FOREIGN KEY (DNUM) REFERENCES DEPARTMENT(DNUMBER) ) ; CREATE TABLE WORKS_ON (ESSN CHAR(9) NOT NULL , PNO INT NOT NULL , HOURS DECIMAL(3,1) NOT NULL , PRIMARY KEY (ESSN, PNO) , FOREIGN KEY (ESSN) REFERENCES EMPLOYEE(SSN) , FOREIGN KEY (PNO) REFERENCES PROJECT(PNUMBER) ) ; CH10.38 DDL - Create Tables with Constraints CSE 4100 CREATE TABLE EMPLOYEE (..., DNO INT NOT NULL DEFAULT 1, CONSTRAINT EMPPK PRIMARY KEY (SSN) , CONSTRAINT EMPSUPERFK FOREIGN KEY (SUPERSSN) REFERENCES EMPLOYEE(SSN) ON DELETE SET NULL ON UPDATE CASCADE , CONSTRAINT EMPDEPTFK FOREIGN KEY (DNO) REFERENCES DEPARTMENT(DNUMBER) ON DELETE SET DEFAULT ON UPDATE CASCADE ); CH10.39 DDL - Create Tables with Constraints CSE 4100 CREATE TABLE DEPARTMENT (..., MGRSSN CHAR(9) NOT NULL DEFAULT '888665555' , ..., CONSTRAINT DEPTPK PRIMARY KEY (DNUMBER) , CONSTRAINT DEPTSK UNIQUE (DNAME), CONSTRAINT DEPTMGRFK FOREIGN KEY (MGRSSN) REFERENCES EMPLOYEE(SSN) ON DELETE SET DEFAULT ON UPDATE CASCADE ); Is there an Equivalent to Keys and Constraints in PLs? What Does Java Have Internally? Constraints Facilitate Type Checking at Data Level! CH10.40 Data Manipulation Language - DML CSE 4100 SQL has the SELECT Statement for Retrieving Info. from a Database (Not Relational Algebra Select) SQL vs. Formal Relational Model SQL Allows a Table (Relation) to have Two or More Identical Tuples in All Their Attribute Values Hence, an SQL Table is a Multi-set (Sometimes Called a Bag) of Tuples; it is Not a Set of Tuples SQL Relations Can Be Constrained to Sets by PRIMARY KEY or UNIQUE Attributes Using the DISTINCT Option in a Query Implied Processing and Procedural Semantics SQL Queries have Specific Semantics These Semantics Dictate Processing Includes Code Generation, Optimization, etc. CH10.41 Interactive DML - Main Components CSE 4100 Select-from-where Statement Contains: Select Clause - Chosen Attributes/Columns From Clause - Involved Tables Where Clause - Constrain Tuple Values Tuple Variables - Distinguish Among Same Names in Different Tables String Matching - Detailed Matching Including Exact Starts With Near Ordering of Rows - Sorting Tuple Results CH10.42 Recall Prior Schema CSE 4100 CH10.43 …and Corresponding DB Tables CSE 4100 Which Represent Tuples/Instances of Each Relation A S C null W B null null 1 4 5 5 CH10.44 …and Corresponding DB Tables CSE 4100 CH10.45 Simple SQL Queries CSE 4100 Query 0: Retrieve the Birthdate and Address of the Employee whose Name is 'John B. Smith'. SELECT BDATE, ADDRESS FROM EMPLOYEE WHERE FNAME='John' AND MINIT='B’ AND LNAME='Smith’ Which Row(s) are Selected? B S C null W B null null Note: While All of these Next Queries are from Chapter 8, Some are From “Earlier” Edition CH10.46 Simple SQL Queries CSE 4100 Query 1: Retrieve Name and Address of all Employees who work for the 'Research' Department SELECT FNAME, MINIT, LNAME, ADDRESS, DNAME FROM EMPLOYEE, DEPARTMENT WHERE DNAME='Research' AND DNUMBER=DNO What Action is Being Performed? Join! Cartesian Product! CH10.47 Simple SQL Queries - Result CSE 4100 Theta Join on DNO=DNUMBER CH10.48 Simple SQL Queries CSE 4100 Query 2: For Every Project in 'Stafford', list the Project Number, the Controlling Dept. Number, and the Dept. Manager's Last Name, Address, and Birthdate SELECT PNUMBER, DNUM, LNAME, BDATE,ADDRESS FROM PROJECT, DEPARTMENT, EMPLOYEE WHERE DNUM=DNUMBER AND MGRSSN=SSN AND PLOCATION='Stafford' In Q2, there are Two Join Conditions: The Join Condition DNUM=DNUMBER Relates a Project to its Controlling Department The Join Condition MGRSSN=SSN Relates the Controlling Department to the Employee who Manages that Department CH10.49 Query Results CSE 4100 SELECT PNUMBER, DNUM, LNAME, BDATE,ADDRESS FROM PROJECT, DEPARTMENT, EMPLOYEE WHERE DNUM=DNUMBER AND MGRSSN=SSN AND PLOCATION='Stafford' A S C null W B null null CH10.50 Qualification of Attributes CSE 4100 In SQL, the Same Name for Two (or More) Attributes is Allowed if Attributes are in Different Relations In Those Cases, Query Must Qualify by Prefixing the Relation Name to the Attribute Name EMPLOYEE.LNAME, DEPARTMENT.DNAME Aliases: When Queries Must Refer to the Same Relation Twice Alias is Akin to a Variable – Reference in PL! In These Situations, it is Considered that there are Two Different Copies of the Same Relation Let’s See Examples of Both Concepts CH10.51 Attribute Qualification CSE 4100 Query 8: For Each Employee, Retrieve the Employee's Name, and Name of his or her Immediate Supervisor SELECT E.FNAME, E.LNAME, S.FNAME, S.LNAME FROM EMPLOYEE E S WHERE E.SUPERSSN=S.SSN E and S are aliases for the EMPLOYEE relation E Represents Employees in the Role of Supervisees S Represents Employees in the Role of Supervisor Another Form of Query 8 is: SELECT E.FNAME, E.LNAME, S.FNAME, S.LNAME FROM EMPLOYEE AS E, EMPLOYEE AS S WHERE E.SUPERSSN=S.SSN CH10.52 Query Results CSE 4100 SELECT FROM WHERE E.FNAME, E.LNAME, S.FNAME, S.LNAME EMPLOYEE AS E, EMPLOYEE AS S E.SUPERSSN=S.SSN A S C null W B null null CH10.53 Nested Queries CSE 4100 SQL SELECT Nested Query is Specified within WHERE-clause of another Query (the Outer Query) Query 1A: Retrieve the Name and Address of all Employees who Work for the 'Research' Department SELECT FNAME, LNAME, ADDRESS FROM EMPLOYEE WHERE DNO IN (SELECT DNUMBER FROM DEPARTMENT WHERE DNAME='Research' ) Note: This Reformulates Earlier Query 1 The End Result is Essentially: Outer and Inner For/While Loops! CH10.54 How Does Nested Query Work? CSE 4100 The Nested Query Selects Number of 'Research' Dept. The Outer Query Selects an EMPLOYEE Tuple If Its DNO Value Is in the Result of Either Nested Query IN represents Set Inclusion of Result Set We Can Have Several Levels of Nested Queries SELECT FNAME, LNAME, ADDRESS FROM EMPLOYEE WHERE DNO IN (SELECT DNUMBER FROM DEPARTMENT WHERE Dname=’Research' ) CH10.55 NULLS in SQL Queries CSE 4100 SQL Allows Queries that Check if a value is NULL (Missing or Undefined or not Applicable) SQL uses IS or IS NOT to compare NULLs since it Considers each NULL value Distinct from other NULL Values, so Equality Comparison is not Appropriate Query 18: Retrieve the names of all employees who do not have supervisors. SELECT FNAME, LNAME FROM EMPLOYEE WHERE SUPERSSN IS NULL Why Would Such a Capability be Useful? Downloading/Crossloading a Database Promoting a Attribute to PK/FK CH10.56 Aggregate Functions in SQL Queries CSE 4100 Query 19: Find Maximum Salary, Minimum Salary, and Average Salary among all Employees SELECT FROM MAX(SALARY), MIN(SALARY), AVG(SALARY) EMPLOYEE Query 20: Find maximum and Minimum Salaries among 'Research' Department Employees SELECT MAX(SALARY), MIN(SALARY) FROM EMPLOYEE, DEPARTMENT WHERE DNAME='Research' AND DNUMBER=DNO What Does Query 22 Do? SELECT COUNT(*) FROM EMPLOYEE, DEPARTMENT WHERE DNAME='Research' AND DNUMBER=DNO CH10.57 Grouping in SQL Queries CSE 4100 Query 24: For Each Department, Retrieve the DNO, Number of Employees, and Their Average Salary SELECT DNO, COUNT (*), AVG (SALARY) FROM EMPLOYEE GROUP BY DNO EMPLOYEE tuples are Divided into Groups; each group has the Same Value for Grouping Attribute DNO COUNT and AVG functions are applied to each Group of Tuples Aeparately SELECT-clause Includes only the Grouping Attribute and the Functions to be Applied on each Tuple Group Are there PL Equivalents to these Data Oriented Actions? Yes – in Specific APIs but Not PL Itself! CH10.58 Results of Query 24: CSE 4100 SELECT DNO, COUNT (*), AVG (SALARY) FROM EMPLOYEE GROUP BY DNO CH10.59 INSERT SQL Queries CSE 4100 Add one or more Tuples to a Relation, with Attribute values Listed in the order specified in the CREATE Update 1: INSERT INTO EMPLOYEE VALUES ('Richard','K','Marini', '653298653', '30-DEC-52', '98 Oak Forest,Katy,TX', 'M', 37000,'987654321', 4 ) Another Form of Update 1: INSERT INTO EMPLOYEE (FNAME, LNAME, SSN) VALUES ('Richard','K','Marini') All PK and FK Values must be Provided Nulls are Allowed DDL Constraints are Enforced Another form of “Type Checking” at Instance Level This is Akin to Dynamic Type Checking! CH10.60 DELETE SQL Queries CSE 4100 Sample Deletes Include DELETE FROM EMPLOYEE WHERE LNAME='Brown' DELETE FROM EMPLOYEE WHERE SSN='123456789’ DELETE FROM EMPLOYEE WHERE DNO IN (SELECT DNUMBER FROM DEPARTMENT WHERE DNAME='Research') DELETE FROM EMPLOYEE No. of Tuples Deleted Dependent on WHERE Clause Referential Integrity (Type Checking!) is Enforced During DELETE CH10.61 UPDATE SQL Queries CSE 4100 Give all Employees in the 'Research' Dept. a 10% raise UPDATE EMPLOYEE SET SALARY = SALARY *1.1 WHERE DNO IN (SELECT DNUMBER FROM DEPARTMENT WHERE DNAME='Research') Modified SALARY Value Depends on the Original SALARY Value in each Tuple SALARY = SALARY *1.1 - Use PL Interpretation CH10.62 Query Processing and Optimization CSE 4100 What are the Processing Issues for DBs? Database Applications of Today and Tomorrow Require High Volumes of Information! Increase of Information Still Requires High Performance! Throughput and Response Time Where's the Bottleneck in DBS? CPU ?? Main Memory Size/Speed ?? Virtual Memory Limitations ?? Communications Bus ?? I/O Channel ?? How Does this Relate to Compilers/PLs? CH10.63 90-10 Rule for Database Processing CSE 4100 Load (Transaction per second) vs. Performance (Response Time of Transactions) Processing of Large Amounts of Raw Data Addressed in Secondary Storage Staged to Main Memory Identifying Relevant Data Large Amounts of Raw Data Discarded Focus on Data Most Likely to Contain Answers Possible Loss of CPU and Main Memory Cycles This is Double Jeopardy! Load of DBS Must be Reduced Performance of DBS Degrades CH10.64 90-10 Rule for Conventional DBS CSE 4100 Only 10% of Relevant Data has Answers Application Programs Operating System Database Functions Only 10% of Raw Data is Relevant On-Line I/O Disk I/O Note: Naive Approach to Database Searching Often Occurs (Little or No Indexing in Practice!) CH10.65 Query Optimization Goal CSE 4100 Limit Costly Join Operation by Reducing Data to be Scanned or that Participates in the Join While Improving Selection and Projection can Help, the Main Objective is Join In Worst Case - Cartesian Product Can Improve by Introducing Indices on the Join Attributes (R.B and S.C) to Limit “Product” Can Further Improve by Sorting on the Join Attributes (R.B and S.C) This Reduces Block Accesses by Limiting the Number of Blocks that Must be Examined in a Join If B’s Values Range from 0 to 100 and C from 50 to 150, only need to Compare from 50 to 100 Focus is on Reducing Costly Ops – Same as PL Optimization to Replace * with + CH10.66 Query Processing CSE 4100 Internal Data Structure Memory Hierarchy Main Memory + Secondary Memory Information Must be Staged from Secondary to Primary Memory for Database Operation Sequential Search Brute force Approach Direct Access (Indexed Search) Hash, Inverted Index file, Binary Search Tree, B-tree, B+-tree Improves Selection by Focusing on Subset of Tuples that are Involved in the Answer and Equijoin by Not Having to Compare All Blocks in Two Relations CH10.67 Algorithms for Database Query Operators CSE 4100 Largely Fall into Three Classes: Sorting-Based Methods, Hash-Based Methods, Index-Based Methods Such Algorithms are Divided into Three Degrees of Difficulty and Cost (Limiting Factor is Size of Data) One Pass Algorithms Where Data is Only Read Once From Disk Two-pass Algorithms Data is Read from Disk, Processed in Some Way, Written Back to Disk, Read Again for Processing, etc. Multi-pass Algorithms Where 3 or More Passes Are Required, i.e., Recursive Generalization of the Two-pass Algorithms Akin to Multiple Pass Compilers at Data Level CH10.68 Database Join and Sort are External CSE 4100 Suppose that your DBS has 1,000 1K Blocks of Memory Available for Performing Operations (e.g., Select, Project, Join, Union, Aggregation, etc.) Suppose Sort R by R.B R Contains 5000 Blocks In order to Perform a Sort/Merge - You Must Use External Algorithm since all 5000 Blocks Can Fit Into Memory at the Same Time Suppose Join R (500 Blocks) and S (800 Blocks) Again - their Total Exceeds Memory - Hence you Must Take an Approach that Compares One Block of R with All Blocks of S, etc. (Slides 22,23) 1 2 3 1000 CH10.69 Database Join and Sort are External CSE 4100 What’s True about Today’s DBMS Like Oracle? Oracle Recommends 2 Gigabytes of Primary Memory That 2 Gigabytes Must be Shared by: Operating System Other Applications Running on “Same” Server (Web Server, etc.) Database Management Software Even if there was 1.5 Gigabytes Available, Modern DBs can Exceed that size Very Easily Moreover, Cartesian Product Could Exceed Available Mem. Join Could Require External Approach Since All Tables Involved in Join Can’t fit in 1.5 Gigabytes External Sorting/Block Oriented Processing is Norm CH10.70 The System Catalog CSE 4100 Store the Meta Information that Describes Each Database, Including a Description of Conceptual Database Schema (Logical Data Model) Relations, Attributes, Keys, Indexes, Views Internal Schema External Schema Store Information Needed by Specific DBMS Modules Query Optimization Module Security and Authorization CH10.71 Example of Catalog Information CSE 4100 CH10.72 Relational DBMS Catalog CSE 4100 All Metadata Stored as Relations Example of Metadata Tables are: CH10.73 Uses of System Catalog CSE 4100 DDL Compilers: Correct Definition of Relations and Attributes DML (Query) Compiler: DML Parser SELECT EMP.ENAME FROM EMP, WORKS, PROJ WHERE (EMP.ENO= WORKS.ENO) AND(WORKS.PNO = PROJ.PNO) AND(PROJ.PNAME = “CAD/CAM”) Guided by the Description of DML Syntax and the Schema Information in the Catalog, Generates a Query Tree after Parser Optimizer Generates Access Paths that is Relatively Optimal for Executing a Query/ DML Command, by Accessing the Database Structure Information (Schemas), and Mapping High-level SQL Queries Into Low-level File Access Commands CH10.74 Revisit Typical Database Processing CSE 4100 Parsed and Optimized User Trans. Pre-Processing - Parser/Lexical - Optimizer/Views Concurrency Control Lock Request Response User Transaction Errors Post-Processing - Collection of Results - Aggregation Operations - Security Checks Low-Level Processing - Enqueue Trans. - Request Locks - Issue I/Os - Process Returned Data - Integrity Checks - Security Checks - Logging for Recovery - Release Locks - Dequeue Trans. High-Level Processing - Enqueue Trans. - Request Locks - Release Locks -Dequeue Trans. Response to User I/O Request Errors Results Lock Request Results Disk I/O Recovery CH10.75 Typical Database Processing CSE 4100 Pre-Processing Actions Taken Upon Receipt of a Query from User SQL Query via Query Tool or JDBC Call “Compilation” of DB Query Check Syntax, Semantics, Optimize, Develop RunTime Strategy (Similar to PL Compilation) Query is Translated to DB Transaction A Transaction Contains Multiple DB Operations Transaction has Explicit Order of Operations Database Transaction Must Succeed or Fail There is no Intermediate State Completely Executed and Committed or Aborts at any Point and Undone New State or Previous State of DB CH10.76 Typical Database Processing CSE 4100 High-Level Processing Enqueue Transaction from Pre-Processing Transaction Must Wait for “Earlier” Transactions Remember - Shared DB State! Request Locks from Concurrency Control All Locks Before Proceeding vs. Locks as Needed Avoid Deadlock and Livelock Release Locks As Use of Data Completes to Increase Availability What Happens if Failure of Later Step in Transaction Dequeue Transaction Completes Transaction Processing Return “Result” to Post-Processing CH10.77 Typical Database Processing CSE 4100 Low-Level Processing Enqueue Transaction - Do Actual DB Operations Request Locks - Lower Granularity Level Issue I/Os - Based on Operations to Access “Correct” and “Relevant” DB Records Process Returned Data - Aggregation, Sorting Integrity Checks: Do I/D/U Satisfy Constraints? Security Checks: Is DB R/I/D/U Allowed? Logging for Recovery - Commit the Transaction Release Locks - Available to Others Dequeue Transaction - Return Results to HighLevel Processing Note: The Multiple Operations of Each DB Transaction All Must be Successful CH10.78 Typical Database Processing CSE 4100 Post Processing Collection of Results May be Passed Portions of Results as they Complete For Example, Sorted Blocks of Data that are then Merged in a Final Step Aggregation Operations May be Passed Aggregate Intermediate Results Sum for Different Departments to be Totaled Security Checks Last Step Filtering to Insure Only Allowed Data is Returned May Execute Query but Only see Aggregate Result Send Results to User CH10.79 Typical Database Processing CSE 4100 Concurrency Control Control Access to Information Data and Metadata Prevent Simultaneous Updates Ensure Database Always Correct and Consistent Serial Schedule vs. Serializable Transaction Two Types Pessimistic - Locking-Based - Assume Collisions Will Occur - e.g., Peoplesoft Course Registration Optimistic - Time-Based - Fix Problems After the Fact e.g., ATM Machines Example CC Manages Locks at Different Granularity Levels (Table, Attribute, View, Tuple, Metadata, etc.) CH10.80 Typical Database Processing CSE 4100 Disk I/O Performs the Actual Disk I/O for Read/Writes Block Oriented Activity Maintain Queue of All I/O Requests Ordering is Critical Related to Concurrency Control and Consistency Single DB Transactions can have Multiple DB Operations Disk I/Os for Different Operations at Different Times High and Low Level Processing will Determine What Operations Needed When Disk I/O - Relatively “Dumb” CH10.81 Typical Database Processing CSE 4100 Recovery Tightly Tied to DB Transaction Concept Transactions Must be: Atomic - Happens or Doesn’t Durable - Once Committed, Results Survive Failure Consistent - Follows Protocol/Correct DB State When Failure Occurs, Can we: Recover to a Correct “Earlier” State Reconcile all “Active” Transactions that were Executing at Failure Time Involves Logging of Database Actions Objective: High Availability and Reliability CH10.82 Query Optimization CSE 4100 Not Really Optimizing, but Planning to Avoid Bad Execution Strategies Models Heuristics-Based Apply Transformation Rules According to a General Strategy Focus on Relational Algebra that Underlies Each Query Improve the “Order” of Relational Operations Cost-Based Minimize a Cost Function I/O Cost + CPU Cost Subject to a Set of Constraints CH10.83 Query Processing Methodology CSE 4100 High-level Calculus-based Query EXTERNAL SCHEMA Query Preprocessing Algebraic Query (a tree structure) LOGICAL SCHEMA Query Optimization INTERNAL SCHEMA Execution Schedule (file access plan) CH10.84 Refute Incorrect Queries CSE 4100 Example: E(ENAME, ENO), P(JNO,JNAME), W(ENO,PNO,DUR) SELECT ENAME, PNAME FROM E, P, W WHERE DUR > 27 AND DUR < 25 Incorrect Disjoint Components are Useless Multiple Relations, Missing Joins, may not be incorrect, but may indicate Cartesian product Contradictory Qualification can not be Satisfied by any Tuple DUR > 27 AND DUR < 25 CH10.85 Simplification CSE 4100 Why Simplify? The Simpler the Query, the Less Work there is and the Better the Performance How? Use transformation rules Elimination of Redundancy Idempotency Rules Application of Transitivity Use of Integrity Rules Example x > a and x > b DUR > 27 AND DUR > 25 CH10.86 Restructuring CSE 4100 Convert Relational Calculus to Relational Algebra ENAME Make use of Query Trees Example Find the names of employees other than J. Doe who worked (DUR=12 OR DUR=24) AND JNAME=“CAD/CAM” AND on the CAD/CAM project for ENAME°“J. DOE” either 1 or 2 years. SELECT ENAME FROM E, W, P WHERE E.ENO=W.ENO AND W.JNO=P.JNO AND E.ENAME°"J. Doe" AND P.JNAME="CAD/CAM" AND (W.DUR=12 OR W.DUR=24) Project Select JNO Join ENO P W E CH10.87 Query Optimization Objectives CSE 4100 Improving Performance Arriving at a Query Plan of Execution Analyzing the Relational Algebra Query Replace Costly Operations Do Selections and Projections Early Optimization Heuristics for the Relational Algebra Performing Selection and Projection Before Join Combining Several Selections Over a Single Relation Into One Selection Find Common Subexpressions Algebraic Rewriting/transformation Rules General Transformation Rules for Relational Algebra (Equivalence-preserving Algebraic Rewriting Rules) CH10.88 Query Optimization: An Example CSE 4100 Why is it important? SELECT FROM WHERE AND ENAME E,W E.ENO = W.ENO W.RESP = "Manager" Strategy 1 ENAME(RESP="Manager"E.ENO=G.ENO(E W)) Strategy 2 ENAME( E ENO(RESP="Manager"(W))) CH10.89 Cost of Alternatives CSE 4100 Assume : card(E) = 4,000; card(W)=10,000 10% of tuples in W satisfy RESP="Manager" (selection generates 1,000 tuples) Execution time Proportional to the Sum of the Cardinalities of the Temporary Relations Searching is Done by Sequential Scanning Strategy 1 Cartesian prod. = 40,000,000 Search over all = 40,000,000 80,000,000 Strategy 2 Selection over W = 10,000 Join(4000*1000) = 4,000,000 4,010,000 CH10.90 General Query Optimization Strategy CSE 4100 Perform Selections Early Yields Smaller Intermediate Results Direct Impact on Subsequent Join/Cartesian Prod. Combine Selections with a Prior Cartesian Product into a Theta or Equi Join Join is a Cheaper Operation Combine (Cascade) Selections and Projections AB(B (R)) AB(R) p1 ( p2 (R)) p1 ^ p2 (R) This Results in One Pass Instead of Two over Table CH10.91 General Query Optimization Strategy CSE 4100 Identify Common Subexpressions Compute Once and Store use Stored Version for Subsequent Times Often Useful When Views are Employed Preprocess Data via Sorts and Indexes Speeds up Searches and Joins by Limiting Scope Evaluate and Assess Different Options For Cartesian Product, Use Smaller Relation for Comparison Use System Catalog (Meta-data) to Effect Order in Query Execution Plan CH10.92 Relational Algebra Transformations CSE 4100 Cascade of Selection p1 ^ p2 ^ …^ pn(R)p1(p2(...(pn(R))...)) Commutativity of Selection p1(p2(R))p2(p1(R)) p1 or p2(R )p1(R p2(R) Cascade of Projection A1,A2, … An(R)A1(A2(...(An(R))...)) A1(R) if A1 A2 ... An Commuting Selection with Projection A1,A2,...,An(p(R))p(A1,A2,...,An(R) CH10.93 Relational Algebra Transformations CSE 4100 Commutativity of Theta Join and Cartesian Product R A SS AR R SS R Commuting Selection with Theta Join (Cartesian) p(A)(R S) p(A)(R)) S A defined on R only p(A)^p(B)(R S) p(A)(R)) p(B)(S)) (A defined on R, B defined on S) Also Holds for Theta Join as Well Commuting Projection with Theta Join (Cartesian) C(R S) A(R) B(S) where AB=C A are Attributes in C for R and B are Attributes in C for S CH10.94 Relational Algebra Transformations CSE 4100 Commutativity of Set Operations R S S R R S S R Associativity of Set Operations (R S) T R S T) (R S) T R (S T) (R S) S R (S T) (R S) S R (S T) Commuting Select with Set Operations p(Ai)(R T) p(Ai)(R) p(Ai)(T) where Ai is defined on both R and T p(Ai)(R T) p(Ai)(R) p(Ai)(T) where Ai is defined on both R and T CH10.95 Relational Algebra Transformations CSE 4100 11. Commuting Projection with Union C(R q(Aj,Bk) S) A’(R) q(Aj,Bk) B’(S) C(R S) A’ (R) B’ (S) where R[A] and S[B] C = A' B' where A' A, B’ B 12. Converting Selection/Cartesian Into Theta Join C (R S) R S C CH10.96 Heuristic Optimization: Example CSE 4100 Canonical query tree at the end of query preprocessing phase ENAME (DUR=12 OR DUR=24) AND JNAME=“CAD/CAM” AND ENAME= “J. DOE” E(ENAME, ENO) P(JNO,JNAME) W(ENO,PNO,DUR) JNO ENO P W E CH10.97 Heuristic Optimization– Example ENAME CSE 4100 DUR=12 OR DUR=24 JNAME=“CAD/CAM” ENAME = “J. DOE” Use cascading of selections rule to decompose selections JNO P ENO W E CH10.98 Heuristic Optimization– Example ENAME CSE 4100 DUR=12 OR DUR=24 JNAME=“CAD/CAM” Push selection down using commutativity of selection over join JNO ENO ENAME = "J. Doe" P W E CH10.99 Heuristic Optimization–Example CSE 4100 ENAME DUR=12 OR DUR=24 Push selection down JNO JNAME = "CAD/CAM" using commutativity of selection over join ENO ENAME = "J. Doe" P W E CH10.100 Heuristic Optimization–Example CSE 4100 ENAME JNO Push selection down ENO JNAME = "CAD/CAM" P DUR =12 DUR=24 W ENAME = "J. Doe" E CH10.101 Heuristic Optimization–Example CSE 4100 ENAME JNO JNO,ENAME Do early projection ENO JNO JNAME = "CAD/CAM" P JNO,ENO DUR =12 DUR=24 W JNO,ENAME ENAME = "J. Doe" E CH10.102 Heuristic Optimization–Example ENAME CSE 4100 Identify subtrees that can be implemented in one algorithm JNO JNO,ENAME ENO JNO JNAME = "CAD/CAM" JNO,ENO JNO,ENAME DUR =12 DUR=24 ENAME = "J. Doe" P W E CH10.103 Heuristic Optimization: A Second Example CSE 4100 Title What is the Final Step? Combine Select and Cartesian Product Borrower.Card_No = Loans.Card_No Result: Equijoins! Loans.LC_No X Books.LC_No, Title Books Books.LC_No = Loans.LC_No Loans.LC_No, X Borr.Card_No Loans.Card_No Date 1/1/88 Borrower Loans CH10.104 Cost-Based Optimization CSE 4100 Reduce Defined Cost of Executing Queries What is Involved in the Cost of Executing a Query? Access Cost to Secondary Storage Search for Data Block (Index) Read/Write Index and Data Blocks Storage Cost Index and Data Blocks Intermediate Files Computation Cost Query Planning - Optimization Effort Record Search, Sort, Merge Actual Transaction/Query Operations Communications Cost Transfer of Results to the User CH10.105 Complexity of Relational Operations CSE 4100 Assuming Relations of Cardinality n Sequential Scan of Data in each Relation Complexity of Each Operation is Indicated Avoid Cartesian Product at All Costs! Operation Select Project (w/o duplicate elimination) Project (with duplicate elimination) Group Complexity O(n) O(nlog n) Join Division O(nlog n) Set Operators Cartesian Product O(n2) CH10.106 Cost-Based Optimization CSE 4100 To Understand Cost-Based Operations, we Must Focus on Implementation Strategy of: Select Project Join For Select and Project - There is a Fixed Cost that we Must Live With For Join Implementation Strategy Different Join Strategies Objective: Minimize the Number of Blocks Involved Note that Cost-Based and Relational Algebra Heuristic Optimization Can Complement One Another CH10.107 Optimization Summary CSE 4100 Most Systems Implement Only a Few Strategies The Number of Strategies that are Considered by Any Query Optimizer is Limited Some Systems Reduce the Number of Strategies by Making a Heuristic Guess of Strategy for Each Query The Optimizer Considers Every Possible Strategy, but Terminates as Soon as it Determines the Cost is Greater than the Pre-chosen Strategy Thus Only a Few Competing Strategies Require Full Analysis of the Cost The Overhead of Query Optimization is Reduced Remember - Trade off in Optimization Time For PL - Optimization is Pre-Execution (Compile) For DB - Optimization is Part of Execution (Run) CH10.108