Chapter 19 6e - 17 & 18 5: System Catalog and Query Optimization CSE 4701 Prof. Steven A. Demurjian, Sr. Computer Science & Engineering Department The University of Connecticut 191 Auditorium Road, Box U-155 Storrs, CT 06269-3155 steve@engr.uconn.edu http://www.engr.uconn.edu/~steve (860) 486 - 4818 A portion of these slides are being used with the permission of Dr. Ling Lui, Associate Professor, College of Computing, Georgia Tech. Other slides have been adapted from the AWL web site for the textbook. Remaining slides represent new material. Chapter 19-1 Overview of Material CSE 4701 Key Background Topics: What are Typical Database Processing Actions? Disk Drives and Disk Storage Database Processing/Architectures Motivating Query Optimization Query Processing Chapter 17 - System Catalog What is it? How is it Used? Chapter 18 - Query Optimization in RDBMS High-level Query Optimization (Algebraic) Low-level Query Optimization (Cost-based) Chapter 19-2 Typical Database Processing CSE 4701 Parsed and Optimized User Trans. Pre-Processing - Parser/Lexical - Optimizer/Views Concurrency Control Lock Request Response User Transaction Errors Post-Processing - Collection of Results - Aggregation Operations - Security Checks Low-Level Processing - Enqueue Trans. - Request Locks - Issue I/Os - Process Returned Data - Integrity Checks - Security Checks - Logging for Recovery - Release Locks - Dequeue Trans. High-Level Processing - Enqueue Trans. - Request Locks - Release Locks -Dequeue Trans. Response to User I/O Request Errors Results Lock Request Results Disk I/O Recovery Chapter 19-3 What are the Processing Issues for DBs? CSE 4701 Database Applications of Today and Tomorrow Require High Volumes of Information! Increase of Information Still Requires High Performance! Throughput and Response Time Where's the Bottleneck in DBS? CPU ?? Main Memory Size/Speed ?? Virtual Memory Limitations ?? Communications Bus ?? I/O Channel ?? Chapter 19-4 90-10 Rule for Database Processing CSE 4701 Load (Transaction per second) vs. Performance (Response Time of Transactions) Processing of Large Amounts of Raw Data Addressed in Secondary Storage Staged to Main Memory Identifying Relevant Data Large Amounts of Raw Data Discarded Focus on Data Most Likely to Contain Answers Possible Loss of CPU and Main Memory Cycles This is Double Jeopardy! Load of DBS Must be Reduced Performance of DBS Degrades Chapter 19-5 90-10 Rule for Conventional DBS CSE 4701 Only 10% of Relevant Data has Answers Application Programs Operating System Database Functions Only 10% of Raw Data is Relevant On-Line I/O Disk I/O Note: Naive Approach to Database Searching Often Occurs (Little or No Indexing in Practice!) Chapter 19-6 Randomly Accessed Storage Devices CSE 4701 Popular Media (Hard Drives, CDs, DVDs, etc.) Access to Information in Any Order Sequential Access Not Typically Supported or Needed, Since “Files” Not Stored Sequentially Recall, Disk Defragmentation on PC Platform Block-Oriented Utilization of Device Block Access to Optimize Transfer Block Size is Device/Controller Dependent Linear/Non-Linear Byte Orders with Blocks Key Concepts … Platter Track Sector Cylinder Read/Write Heads Chapter 19-7 Rotating Storage CSE 4701 Track R/W Heads Platters Cylinder Top View of a Surface Note: Parallel Read/Write Drives Activate All Heads Simultaneously Chapter 19-8 Disk Drive Components CSE 4701 Chapter 19-9 Disk Characteristics and Access CSE 4701 Transfer Time: Time to Copy Bits From Disk Surface to Primary Memory Disk Latency Time: Rotational Delay Waiting for Proper Sector to Rotate Under R/W Head Rotate to Next Sector to Process Next Request Disk Seek Time: Delay While R/W Head Moves to the Destination Track/Cylinder Move Head In/Out to Seek Next Track/Cylinder Access = Seek (In/Out) + Latency (Around) + Transfer (Bytes) For DBMS - Key is Moving Data To/From Disk ASAP w.r.t. Performance and Response Time Improve on 90-10 via Processing/Optimization Chapter 19-10 Historical DB Architecture - Mainframe CSE 4701 Chapter 19-11 Client/Server DBS Architecture CSE 4701 Chapter 19-12 Mixed Architecture CSE 4701 Chapter 19-13 Three and Four Tier Architectures CSE 4701 From: http://java.sun.com/javaone/javaone98/sessions/T400/index.html Chapter 19-14 What is MBDS? CSE 4701 MBDS is Multi-Process, Multi-Computer, Parallel Database System MBDS Composed of … Host for Issuing User Requests Controller to Interact with Host (and User) One or More Backend Database Processors Goals of MBDS Suppose Request Takes 4 Minutes with One Backend Improve Response Time by Increasing Backends Two Backends - Request 2+ Minutes Four Backends - Request 1+ Minutes Chapter 19-15 What is MBDS Architecture? CSE 4701 Database Blocks are Distributed Across All Backends Backend (BE) DB Processors are Replicated Database Controller Sends Same Query in Parallel to all BEs Host User Database Controller Backend Database Processor Backend Database Processor BEs work in Parallel on Each Query and Communicate for Join Results are Sent to and Collected by the DB Controller - then to the User Backend Database Processor Chapter 19-16 Approach Distributes Data Across Backends CSE 4701 Suppose System has 10 Backends Consider a Number of Tables Inventory Customers Employees … What Happens if Place One Table/Backend? What Happens if you Distribute … Table Across 10 Backends? Backend Database Processor 2 Backend Database Processor 1 Backend Database Processor 10 Chapter 19-17 What are MBDS Processes? CSE 4701 Database Controller Request Preparation Post Processing Put Msg. Get Msg. Get Msg. Put Msg. Directory Management Record Processing Concurrency Control Disk I/O Backend Database Processor Chapter 19-18 What are MBDS Messages? CSE 4701 No. 1 2 3 4 6 12 15 16 21 22 23 Type New Request Results of Request Number of Reqs in Transaction Aggregate Operators (Sum, etc.) Parsed Request to Backends Backend Aggregate Operator Results Ids for Accessing Database Indexes Request and Disk Addresses Ids for Accessing Database Records Locks Obtained: Okay to Execute Request ID of Finished Request SRC Host PoPr ReqP ReqP ReqP RecP DM DM DM CC RecP DST ReqP Host PoPr PoPr DM PoPr DMs RecP CC RecP CC Chapter 19-19 Sample Processing of Retrieve Request CSE 4701 A1 F15 From Other Backend Request Preparation D6 Put Msg. B3 C4 K12 Post Processing K12 Get Msg. E15 To Backend(s) Get Msg. Put Msg. D6,F15 E15 Directory Management G21 K12 H22 Record Processing I16 Concurrency Control J23 Disk I/O Chapter 19-20 What are Synchronization Issues in MBDS? CSE 4701 Coordination of Synchronous Behavior … Within Controller and Backend to Allow Multiple Active Requests within Each Process Requests at Different Stages in Different Processes Between Controller and Backends to Allow A Request to be Processed by All Backends A Request to be Processed by One Backend Among Multiple Backends to Allow a Backend to Synchronize its Work on one Request with Other Backends to Forward Results to Another Backend Chapter 19-21 Introduction to Query Processing CSE 4701 Query optimization: The process of choosing a suitable execution strategy for processing a query. Two internal representations of a query: Query Tree Query Graph Chapter 19-22 Introduction to Query Processing CSE 4701 Chapter 19-23 How Does this Relate to Compilers? Source Program CSE 4701 1 2 3 Symbol-table Manager Lexical Analyzer Syntax Analyzer Semantic Analyzer Error Handler 4 5 6 Intermediate Code Generator Code Optimizer Code Generator Target Program 1, 2, 3 : Analysis - Our Focus 4, 5, 6 : Synthesis Chapter 19-24 Translating SQL Queries into Relational Algebra CSE 4701 Query block: The basic unit that can be translated into the algebraic operators and optimized. A query block contains a single SELECT-FROMWHERE expression, as well as GROUP BY and HAVING clause if these are part of the block. Nested queries within a query are identified as separate query blocks. Aggregate operators in SQL must be included in the extended algebra. Chapter 19-25 Translating SQL Queries into Relational Algebra CSE 4701 SELECT FROM WHERE SELECT FROM WHERE LNAME, FNAME EMPLOYEE SALARY > ( SELECT FROM WHERE LNAME, FNAME EMPLOYEE SALARY > C πLNAME, FNAME MAX (SALARY) EMPLOYEE DNO = 5); SELECT FROM WHERE MAX (SALARY) EMPLOYEE DNO = 5 ℱMAX SALARY (σDNO=5 (EMPLOYEE)) (σSALARY>C(EMPLOYEE)) Chapter 19-26 Why is Query Optimization Needed? CSE 4701 Data Volume in any Type of Join or Cartesian Product has the Potential to be Very Large! Consider R(A, B) = {r1, r2 , ..., rn} Consider S(C, D) = {s1, s2 , ..., sm} R x S = {r1 s1, r1 s2, r1 s3, r1 s4, … r2 s1, r2 s2, r2 s3, r2 s4, … } which contains n x m tuples! What is the Issue? If n is 10,000 and m is 20,000 then Cartesian Product has 200,000,000 Tuples Join must Perform 200,000,000 Comparisons Chapter 19-27 Aside – What is an External Sort? CSE 4701 Traditional – All Algorithm/Programming Classes Focus on Internal Searches/Sorts Internal – All Data Loaded into Main Memory Data Searched/Sorted in Main Memory Results in Main Memory What Happens When Data Source to be Searched Exceeds Main Memory (or Virtual Memory)? External Search Stages Blocks of Data from Disk into Memory Sorts/Searches with Blocks in Memory Writes Intermediate results to Disk Need to Reread Results from Disk for Final Search Results, Merging for Sort, etc. Chapter 19-28 Why is Query Optimization Needed? CSE 4701 n/m - Number of Tuples of R/S Respectively bR / bS - Number of Tuples/Block of Memory Assume that K Blocks Fit into Primary Memory 1 Block of R K-1 Blocks of S n / bR (m / bS ) Number of Blocks for R/S 1 2 3 (m / bS )/(K-1) Number of Times that K-1 Memory Chunk Filled by S (n / bR )[(m / bS )/(K-1)] Which if Filled for Each Block of R (n / bR ) + (n / bR )[(m / bS )/(K-1)] K-1 Total Block Reads Must also Read Blocks of R Chapter 19-29 Why is Query Optimization Needed? 1CSE Block 4701 of R K-1 Blocks of S n / bR (m / bS ) Number of Blocks for R/S 1 2 3 (m / bS )/(K-1) Number of Times that K-1 Memory Chunk Filled by S (n / bR )[(m / bS )/(K-1)] Which if Filled for Each Block of R (n / bR ) + (n / bR )[(m / bS )/(K-1)] K-1 Total Block Reads Must also Read Blocks of R If n = m = 10,000 and bR = bS = 5, and K= 100 (10,000/5)+(10,000/5)[(10,000/5)/99] = 42,400 Blocks to Read At 20 Blocks/Second - 35 Minutes! Chapter 19-30 Observation CSE 4701 Cartesian Product Yields Unwanted Data SELECT R.A FROM R, S WHERE R.B = S.C and S.C = 99 In Relational Algebra: A ( B=C and D=99 (R x S)) = A ( B=C (R x D=99 (S) )) = A (R x B=C ( D=99 (S))) Has Performance Improved? How? Chapter 19-31 Evaluation CSE 4701 Cartesian Product for SELECT - 40,000 Blocks SELECT R.A FROM R, S WHERE R.B = S.C and S.C = 99 Relational Algebra with Equijoin: A (R x B=C ( D=99 (S))) The D=99 (S) Limits the Size of S Dramatically As a Result, the Equijoin of R and D=99 (S) Would Likely Reduce the Total Blocks Required to 4,000 Thus, a “Smart” Query Execution Strategy Can Dramatically Reduce the Amount of I/Os Chapter 19-32 Query Optimization Goal CSE 4701 Limit Costly Join Operation by Reducing Data to be Scanned or that Participates in the Join Query Optimization is Strategy to Achieve Goal While Improving Selection and Projection can Help, the Main Objective is Join In Worst Case - Cartesian Product Can Improve by Introducing Indices on the Join Attributes (R.B and S.C) to Limit “Product” Can Further Improve by Sorting on the Join Attributes (R.B and S.C) This Reduces Block Accesses by Limiting the Number of Blocks that Must be Examined in a Join If B’s Values Range from 0 to 100 and C from 50 to 150, only need to Compare from 50 to 100 Chapter 19-33 Query Processing CSE 4701 Internal Data Structure Memory Hierarchy Main Memory + Secondary Memory Information Must be Staged from Secondary to Primary Memory for Database Operation Sequential Search Brute force Approach Direct Access (Indexed Search) Hash, Inverted Index file, Binary Search Tree, B-tree, B+-tree Improves Selection by Focusing on Subset of Tuples that are Involved in the Answer and Equijoin by Not Having to Compare All Blocks in Two Relations Chapter 19-34 Algorithms for Database Query Operators CSE 4701 Largely Fall into Three Classes Sorting-Based Methods Hash-Based Methods Index-Based Methods Such Algorithms are Divided into Three Degrees of Difficulty and Cost (Limiting Factor is Size of Data) One Pass Algorithms Where Data is Only Read Once From Disk Two-pass Algorithms Data is Read from Disk, Processed in Some Way, Written Back to Disk, Read Again for Processing, etc. Multi-pass Algorithms Where 3 or More Passes Are Required, i.e., Recursive Generalization of the Two-pass Algorithms Chapter 19-35 Database Join and Sort are External CSE 4701 Suppose that your DBS has 1,000 1K Blocks of Memory Available for Performing Operations (e.g., Select, Project, Join, Union, Aggregation, etc.) Suppose Sort R by R.B R Contains 5000 Blocks In order to Perform a Sort/Merge - You Must Use External Algorithm since all 5000 Blocks Can Fit Into Memory at the Same Time Suppose Join R (500 Blocks) and S (800 Blocks) Again - their Total Exceeds Memory - Hence you Must Take an Approach that Compares One Block of R with All Blocks of S, etc. 1 2 3 1000 Chapter 19-36 Database Join and Sort are External CSE 4701 What’s True about Today’s DBMS Like Oracle? Oracle Recommends 2 Gigabytes of Primary Memory That 2 Gigabytes Must be Shared by: Operating System Other Applications Running on “Same” Server (Web Server, etc.) Database Management Software Even if there was 1.5 Gigabytes Available, Modern DBs can Exceed that size Very Easily Moreover, Cartesian Product Could Exceed Available Mem. Join Could Require External Approach Since All Tables Involved in Join Can’t fit in 1.5 Gigabytes External Sorting/Block Oriented Processing is Norm Chapter 19-37 Algorithms for DB Query Operators CSE 4701 Relational Algebra Operators can be Classified into Three Groups Tuple-at-a-time Unary Operators Selection and Projection No Need to Bring Entire Relation into Memory at One Time Full-Relation Unary Operators Duplicate Elimination and Grouping Requires Seeing All or Most of the Tuples in Memory at Once Full-Relation Binary Operators Set and Bag Versions of Union, Intersection, and Difference, Joins, and Cartesian Products Requires Seeing the Tuples of Both Relations in Memory Chapter 19-38 Query Access CSE 4701 Application Programs Application Interfaces Dbms DML Preprocessor Object Code of Aps Database Schema Query Query Processor Database Manager DDL Preprocessor File Manager Data Files SELECT EMP.ENAME FROM EMP, WORKS, PROJ Disk Storage System Catalog WHERE (EMP.ENO= WORKS.ENO) AND (WORKS.PNO = PROJ.PNO) AND (PROJ.PNAME = “CAD/CAM”) Chapter 19-39 SELECT EMP.ENAME FROM EMP, WORKS, PROJ WHERE (EMP.ENO= WORKS.ENO) AND (WORKS.PNO = PROJ.PNO) AND (PROJ.PNAME = “CAD/CAM”) Database Access CSE 4701 User Program A DBMS System Buffer 7 9 Language User Work Area (UWA) 10 2 1 8 (DBMS) 3 6 5 Database Operating System external schema used by user program A Schema 4 Physical/Internal Data Schema Chapter 19-40 Database Access 1. CSE 4701 2. 3. 4. 5. 6. 7. 8. 9. User program A sends to DBMS an invoke command to retrieve a (set of) record DBMS analyzes the external schema of the user program A and finds the database description of the record. DBMS checks with the schema to get the data types and location information of record DBMS checks with the physical schema to find out which device the record is in and what access methods can be used. According to 4, DBMS sends OS a read command to execute the search. OS issues the page invoke command to the correspond device, and then puts the page fetched into the system buffer. DBMS uses the schema and the external schema to infer the logical structure of the retrieving record. DBMS places the relevant data to the UWA, and provides the status information at the program invocation exit Chapter 19-41 The System Catalog CSE 4701 Store the Meta Information that Describes Each Database, Including a Description of Conceptual Database Schema (Logical Data Model) Relations, Attributes, Keys, Indexes, Views Internal Schema External Schema Store Information Needed by Specific DBMS Modules Query Optimization Module Security and Authorization Chapter 19-42 Metadata - What is it? CSE 4701 System metadata: Where data came from How data were changed How data are stored How data are mapped Who owns data Who can access data Data usage history Data usage statistics System metadata are critical in a DBMS Application metadata: What data are available Where data are located What the data mean How to access the data Predefined reports Predefined queries How current the data are Application metadata are critical in a database system Chapter 19-43 Metadata v.s. Data CSE 4701 Meta schema Data Dictionary Schema contains copy of metaschema; schema for format definitions; schema for data about application data Data Dictionary Data describes all schemata that can be defined in the data model schema for application data; metadata about application data Data raw formatted application data relations rel-name att-name dom-name access-rights user relation operation relations rel-name att-name dom-name (u1, supplier, insert) (u2, supplier, delete) supplier s# sname location (s1, smith, london) (s2, jones, boston) Chapter 19-44 Example of Catalog Information CSE 4701 Chapter 19-45 Relational DBMS Catalog CSE 4701 All Metadata Stored as Relations Example of Metadata Tables are: Chapter 19-46 EER Diagram for Relational Catalog CSE 4701 Chapter 19-47 Metadata in Oracle CSE 4701 Complex Data Dictionary All Schema Objects (Tables,Views, Indices, …) User, All, and DBA Views SELECT * FROM ALL_CATALOG WHERE OWNER=‘SMITH’; SELECT COLUMN_NAME, DATA_TYPE, DATA_LENGTH, NUM_DISTINCT, LOW_VALUE, HIGH_VALUE FROM USER_TAB_COLUMS WHERE TABLE_NAME=‘ORDERS’; Chapter 19-48 Metadata in Oracle CSE 4701 SELECT PCT_FREE, INITIAL_EXTENT, NUM_ROWS, BLOCKS, EMPTY_BLOCKS, AVG_ROW_LENGTH FROM USER_TABLES WHERE TABLE_NAME = ‘ORDERS’; SELECT INDEX_NAME, UNIQUENESS, BLEVEL, LEAF_BLOCKS, DISTINCT_KEYS, AVG_LEAF_BLOCKS_PER_KEY, AVG_DATA_BLOCKS_PER_KEY FROM USER_INDEXES WHERE TABLE_NAME = ‘ORDERS’; Chapter 19-49 Uses of System Catalog CSE 4701 DDL Compilers: SELECT EMP.ENAME FROM EMP, WORKS, PROJ Correct Definition of Relations and Attributes WHERE (EMP.ENO= WORKS.ENO) AND(WORKS.PNO = PROJ.PNO) DML (Query) Compiler: AND(PROJ.PNAME = “CAD/CAM”) DML Parser Guided by the Description of DML Syntax and the Schema Information in the Catalog, Generates a Query Tree after Parser Optimizer Generates Access Paths that is Relatively Optimal for Executing a Query/ DML Command, by Accessing the Database Structure Information (Schemas), and Mapping High-level SQL Queries Into Low-level File Access Commands Chapter 19-50 Revisit Typical Database Processing CSE 4701 Parsed and Optimized User Trans. Pre-Processing - Parser/Lexical - Optimizer/Views Concurrency Control Lock Request Response User Transaction Errors Post-Processing - Collection of Results - Aggregation Operations - Security Checks Low-Level Processing - Enqueue Trans. - Request Locks - Issue I/Os - Process Returned Data - Integrity Checks - Security Checks - Logging for Recovery - Release Locks - Dequeue Trans. High-Level Processing - Enqueue Trans. - Request Locks - Release Locks -Dequeue Trans. Response to User I/O Request Errors Results Lock Request Results Disk I/O Recovery Chapter 19-51 Typical Database Processing CSE 4701 Pre-Processing Actions Taken Upon Receipt of a Query from User SQL Query via Query Tool or JDBC Call “Compilation” of DB Query Check Syntax, Optimize, Develop Run-Time Strategy (Similar to PL Compilation) Query is Translated to DB Transaction A Transaction Contains Multiple DB Operations Transaction has Explicit Order of Operations Database Transaction Must Succeed or Fail There is no Intermediate State – All or Nothing Completely Executed and Committed or Aborts at any Point and Undone New State or Previous State of DB Chapter 19-52 Typical Database Processing CSE 4701 High-Level Processing Enqueue Transaction from Pre-Processing Transaction Must Wait for “Earlier” Transactions Remember - Shared DB State! Request Locks from Concurrency Control All Locks Before Proceeding vs. Locks as Needed Avoid Deadlock and Livelock Release Locks As Use of Data Completes to Increase Availability What Happens if Failure of Later Step in Transaction Dequeue Transaction Completes Transaction Processing Return “Result” to Post-Processing Chapter 19-53 What are Deadlock and Livelock? CSE 4701 Deadlock Query 1 Gets Access to Table A Needs Table B Query 2 Gets Access to Table B Needs Table A Query 1 Won’t Release A until it Gets B Query 2 Won’t Release B until it Gets A This is Deadlock! Livelock Query 1 Gets A, Seeks B Can’t so Releases A Query 2 Gets B, Seeks A, Can’t so Releases B Process Keeps Repeating Can Lead to Starvation Analogy – Two People Trying to Pass in Narrow Hall Chapter 19-54 Typical Database Processing CSE 4701 Low-Level Processing Enqueue Transaction - Do Actual DB Operations Request Locks - Lower Granularity Level Issue I/Os - Based on Operations to Access “Correct” and “Relevant” DB Records Process Returned Data - Aggregation, Sorting Integrity Checks: Do I/D/U Satisfy Constraints? Security Checks: Is DB R/I/D/U Allowed? Logging for Recovery - Commit the Transaction Release Locks - Available to Others Dequeue Transaction - Return Results to HighLevel Processing Note: The Multiple Operations of Each DB Transaction All Must be Successful Chapter 19-55 Typical Database Processing CSE 4701 Post Processing Collection of Results May be Passed Portions of Results as they Complete For Example, Sorted Blocks of Data that are then Merged in a Final Step Aggregation Operations May be Passed Aggregate Intermediate Results Sum for Different Departments to be Totaled Security Checks Last Step Filtering to Insure Only Allowed Data is Returned May Execute Query but Only see Aggregate Result Send Results to User Chapter 19-56 Typical Database Processing CSE 4701 Concurrency Control Control Access to Information Data and Metadata Prevent Simultaneous Updates Ensure Database Always Correct and Consistent Serial Schedule vs. Serializable Transaction Two Types Pessimistic - Locking-Based - Assume Collisions Will Occur - e.g., Peoplesoft Course Registration Optimistic - Time-Based - Fix Problems After the Fact e.g., ATM Machines Example CC Manages Locks at Different Granularity Levels (DB, Table, Attribute, View, Tuple, Metadata, etc.) Chapter 19-57 Typical Database Processing CSE 4701 Disk I/O Performs the Actual Disk I/O for Read/Writes Block Oriented Activity Maintain Queue of All I/O Requests Ordering is Critical Related to Concurrency Control and Consistency Single DB Transactions can have Multiple DB Operations with Multiple Disk I/Os Disk I/Os for Different Operations at Different Times High and Low Level Processing will Determine What Operations Needed When Disk I/O - Relatively “Dumb” Chapter 19-58 Typical Database Processing CSE 4701 Recovery Tightly Tied to DB Transaction Concept Transactions Must be: Atomic - Happens or Doesn’t Durable - Once Committed, Results Survive Failure Consistent - Follows Protocol/Correct DB State When Failure Occurs, Can we: Recover to a Correct “Earlier” State Reconcile all “Active” Transactions that were Executing at Failure Time Involves Logging of Database Actions Objective: High Availability and Reliability Chapter 19-59 Query Optimization CSE 4701 Not Really Optimizing, but Planning to Avoid Bad Execution Strategies Models Heuristics-Based Apply Transformation Rules According to a General Strategy Focus on Relational Algebra that Underlies Each Query Improve the “Order” of Relational Operations Cost-Based Minimize a Cost Function I/O Cost + CPU Cost Subject to a Set of Constraints Chapter 19-60 Query Processing Methodology CSE 4701 High-level Calculus-based Query EXTERNAL SCHEMA Query Preprocessing Algebraic Query (a tree structure) LOGICAL SCHEMA Query Optimization INTERNAL SCHEMA Execution Schedule (file access plan) Chapter 19-61 Query Preprocessing CSE 4701 Input: Calculus Query on Base Relations Normalization Manipulate Query Quantifiers and Qualification Analysis Detect and Reject Incorrect Queries Possible for Only a Subset of Relational Calculus Simplification Eliminate Redundant Predicates Restructuring Calculus Query Algebraic Query More Than One Translation is Possible Use Transformation Rules Chapter 19-62 Normalization CSE 4701 Lexical and Syntactic Analysis (Similar to Compilers) Check Validity Check for Attributes and Relations Type Checking on the Qualification Put into Normal Form Conjunctive Normal Form (p11p12…p1n) …pm1pm2…pmn) Disjunctive Normal Form (p11p12…p1n) …pm1pm2…pmn) OR's Mapped into Union AND's Mapped into Join or Selection Chapter 19-63 Refute Incorrect Queries Example: E(ENAME, ENO), P(JNO,JNAME), W(ENO,PNO,DUR) SELECT ENAME, PNAME FROM E, P, W WHERE DUR > 27 AND DUR < 25 CSE 4701 Incorrect Disjoint Components are Useless Multiple Relations, Missing Joins, may not be incorrect, but may indicate Cartesian product Contradictory Qualification can not be Satisfied by any Tuple DUR > 27 AND DUR < 25 Chapter 19-64 Simplification CSE 4701 Why Simplify? The Simpler the Query, the Less Work there is and the Better the Performance How? Use transformation rules Elimination of Redundancy Idempotency Rules p1 ¬(p1) = false ¬(p1 p2) = ¬(p1) ¬(p2) p1 false = p1 – … Application of Transitivity Use of Integrity Rules Example x > a and x > b DUR > 27 AND DUR > 25 Chapter 19-65 Restructuring Convert Relational Calculus to Relational Algebra ENAME Make use of Query Trees Example Find the names of employees other than J. Doe who worked (DUR=12 OR DUR=24) AND JNAME=“CAD/CAM” AND on the CAD/CAM project for ENAME°“J. DOE” either 1 or 2 years. CSE 4701 SELECT ENAME FROM E, W, P WHERE E.ENO=W.ENO AND W.JNO=P.JNO AND E.ENAME°"J. Doe" AND P.JNAME="CAD/CAM" P AND (W.DUR=12 OR W.DUR=24) Project Select JNO Join ENO W E Chapter 19-66 Query Optimization Objectives CSE 4701 Improving Performance Arriving at a Query Plan of Execution Analyzing the Relational Algebra Query Replace Costly Operations Do Selections and Projections Early Optimization Heuristics for the Relational Algebra Performing Selection and Projection Before Join Combining Several Selections Over a Single Relation Into One Selection Find Common Subexpressions Algebraic Rewriting/transformation Rules General Transformation Rules for Relational Algebra (Equivalence-preserving Algebraic Rewriting Rules) Chapter 19-67 Query Optimization: An Example CSE 4701 Why is it important? SELECT FROM WHERE AND ENAME E,W E.ENO = W.ENO W.RESP = "Manager" Strategy 1 ENAME(RESP="Manager"E.ENO=G.ENO(E W)) Strategy 2 ENAME( E ENO(RESP="Manager"(W))) Chapter 19-68 Cost of Alternatives Assume : card(E) = 4,000; card(W)=10,000 10% of tuples in W satisfy RESP="Manager" (selection generates 1,000 tuples) Execution time Proportional to the Sum of the Cardinalities of the Temporary Relations Searching is Done by Sequential Scanning CSE 4701 Strategy 1 Cartesian prod. = 40,000,000 Search over all = 40,000,000 80,000,000 Strategy 2 Selection over W = 10,000 Join(4000*1000) = 4,000,000 4,010,000 Chapter 19-69 General Query Optimization Strategy CSE 4701 Perform Selections Early Yields Smaller Intermediate Results Direct Impact on Subsequent Join/Cartesian Prod. Combine Selections with a Prior Cartesian Product into a Theta or Equi Join Join is a Cheaper Operation Combine (Cascade) Selections and Projections AB(B (R)) AB(R) p1 ( p2 (R)) p1 ^ p2 (R) This Results in One Pass Instead of Two over Table Chapter 19-70 General Query Optimization Strategy CSE 4701 Identify Common Subexpressions Compute Once and Store use Stored Version for Subsequent Times Often Useful When Views are Employed Preprocess Data via Sorts and Indexes Speeds up Searches and Joins by Limiting Scope Evaluate and Assess Different Options For Cartesian Product, Use Smaller Relation for Comparison Use System Catalog (Meta-data) to Effect Order in Query Execution Plan Chapter 19-71 Relational Algebra Transformations 1. CSE 4701 Cascade of Selection 2. p1 ^ p2 ^ …^ pn(R)p1(p2(...(pn(R))...)) Commutativity of Selection p1(p2(R))p2(p1(R)) p1 or p2(R )p1(R p2(R) Cascade of Projection 3. 4. A1,A2, … An(R)A1(A2(...(An(R))...)) A1(R) if A1 A2 ... An Commuting Selection with Projection (A’s not in p) A1,A2,...,An(p(R))p(A1,A2,...,An(R) Chapter 19-72 Relational Algebra Transformations 5. CSE 4701 6. Commutativity of Theta Join and Cartesian Product R A SS AR R SS R Commuting Selection with Theta Join (Cartesian) p(A)(R S) p(A)(R)) S A defined on R only p(A)^p(B)(R S) p(A)(R)) p(B)(S)) (A defined on R, B defined on S) Also Holds for Theta Join as Well Commuting Projection with Theta Join (Cartesian) C(R S) A(R) B(S) where AB=C A are Attributes in C for R and B are Attributes in C for S 7. Chapter 19-73 Relational Algebra Transformations 8. CSE 4701 9. 10. Commutativity of Set Operations R S S R R S S R Associativity of Set Operations (R S) T R S T) (R S) T R (S T) (R S) S R (S T) (R S) S R (S T) Commuting Select with Set Operations p(Ai)(R T) p(Ai)(R) p(Ai)(T) where Ai is defined on both R and T p(Ai)(R T) p(Ai)(R) p(Ai)(T) where Ai is defined on both R and T Chapter 19-74 Relational Algebra Transformations CSE 4701 11. Commuting Projection with Union C(R q(Aj,Bk) S) A(R) q(Aj,Bk) B(S) C(R S) A’ (R) B’ (S) where R[A] and S[B] C = A' B' where A' A, B’ B 12. Converting Selection/Cartesian Into Theta Join C (R S) R S C Chapter 19-75 Using Heuristics in Query Optimization CSE 4701 Process for heuristics optimization 1. The parser of a high-level query generates an initial internal representation; 2. Apply heuristics rules to optimize the internal representation. 3. A query execution plan is generated to execute groups of operations based on the access paths available on the files involved in the query. The main heuristic is to apply first the operations that reduce size of intermediate results E.g., Apply SELECT and PROJECT operations before applying the JOIN or other operations. Chapter 19-76 Using Heuristics in Query Optimization (2) CSE 4701 Query tree: A tree data structure that corresponds to a relational algebra expression. It represents the input relations of the query as leaf nodes of the tree, and represents the relational algebra operations as internal nodes. An execution of the query tree consists of executing an internal node operation whenever its operands are available and then replacing that internal node by the relation that results from executing the operation. Query graph: A graph data structure that corresponds to a relational calculus expression. It does not indicate an order on which operations to perform first. There is only a single graph corresponding to each query. Chapter 19-77 Using Heuristics in Query Optimization CSE 4701 Heuristic Optimization of Query Trees: The same query could correspond to many different relational algebra expressions — and hence many different query trees. Remember – Not One Soln to Each Query on Exam The task of heuristic optimization of query trees is to find a final query tree that is efficient to execute. Example: Q: SELECT FROM WHERE LNAME EMPLOYEE, WORKS_ON, PROJECT PNAME = ‘AQUARIUS’ AND PNMUBER=PNO AND ESSN=SSN AND BDATE > ‘1957-12-31’; Chapter 19-78 Heuristics Algebraic Optimization Concepts CSE 4701 Using Cascade of Selections Rule, Break up Any Selections With Conjunctive Conditions Into a Cascade of Selections Allows More Freedom in Moving Selections Down Different Branches of the Tree Using Commutativity of Selections with Other Operations Rules, Move Each Selection Down the Query Tree as far as Possible If Possible, Combine a Cartesian Product With a Selection Into a Join Chapter 19-79 Heuristics Algebraic Optimization Concepts CSE 4701 Using Associativity of Binary Operations, Rearrange the Leaf Nodes So That the Most Restrictive Selections Are Executed First The Fewer Tuples the Resulting Relation Contains, the More Restrictive the Selection Reducing the Size of Intermediate Results Improves Performance Using Cascade of Projections and Commutativity of Projections with Other Operations, Move Projections Down the Query Tree as Far as Possible Identify Subtrees that Represent Groups of Operations that can be Executed by a Single Algorithm Chapter 19-80 Summary of All Rules CSE 4701 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. Cascade of Selection Commutativity of Selection Cascade of Projection Commuting Selection with Projection (A’s not in p) Commutativity of Theta Join and Cartesian Product Commuting Selection with Theta Join (Cartesian) Commuting Projection with Theta Join (Cartesian) Commutativity of Set Operations Associativity of Set Operations Commuting Select with Set Operations Commuting Projection with Union Converting Selection/Cartesian Into Theta Join Chapter 19-81 Heuristic Algebraic Optimization Algorithm CSE 4701 Use Rule 1 to Break up Selects with Conjunctions into a Cascade to Move them Down the Query Tree Use Rules 2, 4, 6, and 10 to Commute Select with Project, Join, Cart. Prod., Union, and Intersection Use Rule 5 (Commute) and 9 (Associative) to Rearrange the Leaf Nodes of Query Tree to: Most Restrictive Select Executed First Avoid Cartesian Product in Leaf Nodes Use Rule 12 to Convert a Select/Cart Prod to Join Use Rules 3, 4, 7, and 11 to Cascade and Commute Project - Pushing Down Tree as Far as Possible Identify Subtrees that Can Execute as Independent Algorithms (Set of Operations) Chapter 19-82 Heuristic Optimization: Example CSE 4701 Canonical query tree at the end of query preprocessing phase ENAME (DUR=12 OR DUR=24) AND JNAME=“CAD/CAM” AND ENAME= “J. DOE” E(ENAME, ENO) P(JNO,JNAME) W(ENO,PNO,DUR) JNO ENO P W E Chapter 19-83 Heuristic Optimization– Example ENAME CSE 4701 DUR=12 OR DUR=24 JNAME=“CAD/CAM” ENAME = “J. DOE” Use cascading of selections rule to decompose selections JNO P ENO W E Chapter 19-84 Heuristic Optimization– Example ENAME CSE 4701 DUR=12 OR DUR=24 JNAME=“CAD/CAM” Push selection down using commutativity of selection over join JNO ENO ENAME = "J. Doe" P W E Chapter 19-85 Heuristic Optimization–Example CSE 4701 ENAME DUR=12 OR DUR=24 Push selection down JNO JNAME = "CAD/CAM" using commutativity of selection over join ENO ENAME = "J. Doe" P W E Chapter 19-86 Heuristic Optimization–Example CSE 4701 ENAME JNO Push selection down ENO JNAME = "CAD/CAM" P DUR =12 DUR=24 W ENAME = "J. Doe" E Chapter 19-87 Heuristic Optimization–Example ENAME CSE 4701 JNO JNO,ENAME Do early projection ENO JNO JNAME = "CAD/CAM" P JNO,ENO DUR =12 DUR=24 W ENO,ENAME ENAME = "J. Doe" E Chapter 19-88 Heuristic Optimization–Example ENAME CSE 4701 Identify subtrees that can be implemented in one algorithm JNO JNO,ENAME ENO JNO JNAME = "CAD/CAM" JNO,ENO JNO,ENAME DUR =12 DUR=24 ENAME = "J. Doe" P W E Chapter 19-89 Heuristic Optimization: A Second Example CSE 4701 BOOKS(Title, Author, Pname, LC_No) PUBLISHERS(Pname, Paddr, Pcity) BORROWERS(Name, Addr, City, Card_No) LOANS(Card_No, LC_No, Date) Let XLOANS = S(F(Loans x Borrowers x Books)) where: S ={Title, Author, Pname, LC_No, Name, Addr, City, Card_No, Date} and F = {Borrower.Card_No = Loans.Card_No ^ Books.LC_No = Loans.LC_No} Chapter 19-90 Heuristic Optimization: A Second Example CSE 4701 Title, Author, Pname, LC_No, Name, Addr, City, Card_No, Date Borrower.Card_No = Loans.Card_No ^ Books.LC_No = Loans.LC_No XLOANS X Books X Loans Borrower BOOKS(Title, Author, Pname, LC_No) PUBLISHERS(Pname, Paddr, Pcity) BORROWERS(Name, Addr, City, Card_No) LOANS(Card_No, LC_No, Date) Chapter 19-91 Heuristic Optimization: A Second Example Title CSE 4701 Date 1/1/88 Title, Author, Pname, LC_No, Name, Addr, City, Card_No, Date Borrower.Card_No = Loans.Card_No ^ Books.LC_No = Loans.LC_No X Books X Loans Query= TITLE(Date 1/1/88 (XLOANS)) Borrower BOOKS(Title, Author, Pname, LC_No) PUBLISHERS(Pname, Paddr, Pcity) BORROWERS(Name, Addr, City, Card_No) LOANS(Card_No, LC_No, Date) Chapter 19-92 Heuristic Optimization: A Second Example Title Try to Cascade CSE 4701 Date 1/1/88 Date 1/1/88 Title, Author, Pname, LC_No, Name, Addr, City, Card_No, Date Borrower.Card_No = Loans.Card_No ^ Books.LC_No = Loans.LC_No X Books X Loans Borrower BOOKS(Title, Author, Pname, LC_No) PUBLISHERS(Pname, Paddr, Pcity) BORROWERS(Name, Addr, City, Card_No) LOANS(Card_No, LC_No, Date) Chapter 19-93 Heuristic Optimization: A Second Example Title CSE 4701 Title, Author, Pname, LC_No, Name, Addr, City, Card_No, Date Date 1/1/88 Commute Select and Project Borrower.Card_No = Loans.Card_No ^ Books.LC_No = Loans.LC_No X Books X Loans Borrower BOOKS(Title, Author, Pname, LC_No) PUBLISHERS(Pname, Paddr, Pcity) BORROWERS(Name, Addr, City, Card_No) LOANS(Card_No, LC_No, Date) Chapter 19-94 Heuristic Optimization: A Second Example Title CSE 4701 Title, Author, Pname, LC_No, Name, Addr, City, Card_No, Date Borrower.Card_No = Loans.Card_No ^ Books.LC_No = Loans.LC_No Date 1/1/88 Commute Select and Select X Books X Loans Borrower BOOKS(Title, Author, Pname, LC_No) PUBLISHERS(Pname, Paddr, Pcity) BORROWERS(Name, Addr, City, Card_No) LOANS(Card_No, LC_No, Date) Chapter 19-95 Heuristic Optimization: A Second Example Title CSE 4701 Title, Author, Pname, LC_No, Name, Addr, City, Card_No, Date Borrower.Card_No = Loans.Card_No ^ Books.LC_No = Loans.LC_No X Books X Date 1/1/88 Loans Borrower Commute Select and Cartesian Product Two Levels Down BOOKS(Title, Author, Pname, LC_No) PUBLISHERS(Pname, Paddr, Pcity) BORROWERS(Name, Addr, City, Card_No) LOANS(Card_No, LC_No, Date) Chapter 19-96 Heuristic Optimization: A Second Example Title Try to Cascade CSE 4701 Borrower.Card_No = Loans.Card_No Title, Author, Pname, LC_No, Name, Addr, City, Card_No, Date Borrower.Card_No = Loans.Card_No ^ Books.LC_No = Loans.LC_No X Books X Date 1/1/88 Loans Borrower BOOKS(Title, Author, Pname, LC_No) PUBLISHERS(Pname, Paddr, Pcity) BORROWERS(Name, Addr, City, Card_No) LOANS(Card_No, LC_No, Date) Chapter 19-97 Heuristic Optimization: A Second Example Title CSE 4701 Title, Author, Pname, LC_No, Name, Addr, City, Card_No, Date Books.LC_No = Loans.LC_No X Books Borrower.Card_No = Loans.Card_No Commute Select and Cartesian Product One Level Down X What’s Next? Date 1/1/88 Loans Borrower BOOKS(Title, Author, Pname, LC_No) PUBLISHERS(Pname, Paddr, Pcity) BORROWERS(Name, Addr, City, Card_No) LOANS(Card_No, LC_No, Date) Chapter 19-98 Heuristic Optimization: A Second Example Title CSE 4701 Title, Author, Pname, LC_No, Name, Addr, City, Card_No, Date Books.LC_No = Loans.LC_No X Books Combine Projections Borrower.Card_No = Loans.Card_No X Date 1/1/88 Loans Borrower BOOKS(Title, Author, Pname, LC_No) PUBLISHERS(Pname, Paddr, Pcity) BORROWERS(Name, Addr, City, Card_No) LOANS(Card_No, LC_No, Date) Chapter 19-99 Heuristic Optimization: A Second Example BOOKS(Title, Author, Pname, LC_No) PUBLISHERS(Pname, Paddr, Pcity) BORROWERS(Name, Addr, City, Card_No) LOANS(Card_No, LC_No, Date) Title CSE 4701 Books.LC_No = Loans.LC_No X Books Borrower.Card_No = Loans.Card_No X Date 1/1/88 Loans Borrower What is Still a Problem? We are Not Projecting so All Attributes are Still Collected Until the Final Project! Chapter 19-100 Heuristic Optimization: A Second Example Title CSE 4701 Books.LC_No = Loans.LC_No Loans.LC_No Books.LC_No, Title X Books Borrower.Card_No = Loans.Card_No Loans.LC_No, X Borr.Card_No Loans.Card_No Date 1/1/88 Loans Borrower Add Strategic Projections to Send Only the Minimum Up the Tree as Needed for Join/Result Set Chapter 19-101 Heuristic Optimization: A Second Example CSE 4701 Title What is the Final Step? Combine Select and Cartesian Product Books.LC_No = Loans.LC_No Result: Equijoins! Loans.LC_No X Loans.LC_No, Books.LC_No, Title Books Borrower.Card_No = Loans.Card_No X Borr.Card_No Loans.Card_No Date 1/1/88 Borrower Loans Chapter 19-102 Heuristic Optimization: A Second Example CSE 4701 FINAL TREE with Equijoins! Title LC_No Loans.LC_No Books.LC_No, Title Books Card_No Loans.LC_No, Borr.Card_No Loans.Card_No Date 1/1/88 Borrower Loans Chapter 19-103 Heuristic Optimization: A Third Example CSE 4701 Heuristic Optimization of Query Trees: The same query could correspond to many different relational algebra expressions — and hence many different query trees. The task of heuristic optimization of query trees is to find a final query tree that is efficient to execute. Example: Q: SELECT FROM WHERE LNAME EMPLOYEE, WORKS_ON, PROJECT PNAME = ‘AQUARIUS’ AND PNMUBER=PNO AND ESSN=SSN AND BDATE > ‘1957-12-31’; Chapter 19-104 Heuristic Optimization: A Third Example CSE 4701 What’s one Approach? Chapter 19-105 Heuristic Optimization: A Third Example CSE 4701 Moving Selects Down Is this Optimal? Chapter 19-106 Heuristic Optimization: A Third Example CSE 4701 No! Prior Version Retrieved All Employees Without First Apply Pname Select Chapter 19-107 Heuristic Optimization: A Third Example CSE 4701 Replace CART PRODUCT Plus SELECT with JOIN! What’s left to do? Chapter 19-108 Heuristic Optimization: A Third Example CSE 4701 Chapter 19-109 Heuristic Optimization: A Fourth Example CSE 4701 Sailors (sid, sname, rating, age) Boats (bid, bname, color) Reserves (sid, bid, day, rname) Query: Find all Sailors that have Reserved red Boats that are younger who are younger than 30 and have a rating of at least 11. SELECT S.sid, S.sname, S.age FROM Sailors S, Boats B, Reserves R WHERE B.bid=R.bid AND S.sid=R.sid AND S.Rating >= 11 AND B.color = “Red” AND S.age < 30; πS.sid, S.sname, S.age(σ B.bid=R.bid^S.sid=R.sid^S.age<30^ B.color=“Red”^S.rating≥11(B×S×R) Chapter 19-110 Heuristic Optimization: A Fourth Example CSE 4701 S.sid, S.sname, S.age B.bid=R.bid^S.sid=R.sid^ S.age < 30 ^ S.Rating >= 11 ^ B.color = “Red” X Boats Sailors (sid, sname, rating, age) Boats (bid, bname, color) Reserves (sid, bid, day, rname) X Reserves Sailors Step 1 - Break up Selects Chapter 19-111 Heuristic Optimization: A Fourth Example CSE 4701 S.sid, S.sname, S.age B.bid=R.bid^S.sid=R.sid S.age < 30 ^ S.Rating >= 11 B.color = “Red” Sailors (sid, sname, rating, age) Boats (bid, bname, color) Reserves (sid, bid, day, rname) X Boats X Step 2 – Move that Boats Select Reserves Sailors Chapter 19-112 Heuristic Optimization: A Fourth Example CSE 4701 S.sid, S.sname, S.age B.bid=R.bid^S.sid=R.sid S.age < 30 ^ S.Rating >= 11 X Sailors (sid, sname, rating, age) Boats (bid, bname, color) Reserves (sid, bid, day, rname) B.color = “Red” Boats Step 3 – Move that Sailor Select X Sailors Reserves Chapter 19-113 Heuristic Optimization: A Fourth Example CSE 4701 S.sid, S.sname, S.age Sailors (sid, sname, rating, age) Boats (bid, bname, color) Reserves (sid, bid, day, rname) B.bid=R.bid^S.sid=R.sid X X B.color = “Red” Boats Reserves S.age < 30 ^ S.Rating >= 11 Sailors Step 4 – Introduce Projections Chapter 19-114 Heuristic Optimization: A Fourth Example CSE 4701 S.sid, S.sname, S.age Sailors (sid, sname, rating, age) Boats (bid, bname, color) Reserves (sid, bid, day, rname) B.bid=R.bid^S.sid=R.sid Step 5 – What’s Next Step? X X B.bid B.color = “Red” Boats R.sid,R.bid Reserves S.sid,S.name,S.age S.age < 30 ^ S.Rating >= 11 Sailors Chapter 19-115 Heuristic Optimization: A Fourth Example CSE 4701 S.sid, S.sname, S.age B.bid=R.bid Sailors (sid, sname, rating, age) Boats (bid, bname, color) Reserves (sid, bid, day, rname) Step 6 - Move Down S.sid=R.sid X S.sid=R.sid Step 7 – What’s Next Step? X B.bid B.color = “Red” Boats R.sid,R.bid Reserves S.sid,S.name,S.age S.age < 30 ^ S.Rating >= 11 Sailors Chapter 19-116 Heuristic Optimization: A Fourth Example CSE 4701 S.sid, S.sname, S.age B.bid=R.bid Sailors (sid, sname, rating, age) Boats (bid, bname, color) Reserves (sid, bid, day, rname) Step 7 – Combined for Equi Join X Step 8 – What’s Final Step? S.sid=R.sid B.bid B.color = “Red” Boats R.sid,R.bid Reserves S.sid,S.name,S.age S.age < 30 ^ S.Rating >= 11 Sailors Chapter 19-117 Heuristic Optimization: A Fourth Example CSE 4701 Sailors (sid, sname, rating, age) Boats (bid, bname, color) Reserves (sid, bid, day, rname) S.sid, S.sname, S.age Step 8 – Introduce Final EquiJoin B.bid=R.bid S.sid=R.sid B.bid B.color = “Red” Boats R.sid,R.bid Reserves S.sid,S.name,S.age S.age < 30 ^ S.Rating >= 11 Sailors Chapter 19-118 Converting Relational Algebra to Query Tree Movies1997 = CSE Lname,Fname,State( 4701 Person.PersonID = AllActors.PersonID ^ Movies1997.ShowID=MovieRoles.ShowID ^ Year=1997 (Person x Movies x MovieRoles)) Lname,Fname,State Person.PersonID = AllActors.PersonID ^ Movies1997.ShowID=MovieRoles.ShowID ^ Year=1997 X X Person Movies MovieRoles Chapter 19-119 Converting Relational Algebra to Query Tree Lname,Fname,RLName,RFName FriendsActors = CSE ( 4701 ShowName=Friends ^ TVRoles.ShowID = Friends.ShowID ^ EpisodeID>10 ^ EpisodeId<26 ^ Person.PersonID = RoleNames.PersonID(TVShows x TVRoles x Roles x Person)) ShowID ShowName=Friends ^ TVRoles.ShowID = Friends.ShowID ^ EpisodeID>10 ^ EpisodeId<26 ^ Person.PersonID = RoleNames.PersonID X X TVShows TVRoles Roles X Person Chapter 19-120 Heuristics Query Optimization: Summary CSE 4701 First Apply Operations that Reduce the Size of Intermediate Results Move Selections and Projections Down the Tree as far as Possible Early Selections Reduce the Number of Tuples Early Projections Reduce the Number of Attributes Selection and Join Should be Executed Before Other Similar Operations. This is Accomplished by Reordering the Leaf Nodes of the Tree Among Themselves and Adjusting the Rest of the Tree Appropriately Chapter 19-121 Cost-Based Optimization CSE 4701 Reduce Defined Cost of Executing Queries What is Involved in the Cost of Executing a Query? Access Cost to Secondary Storage Search for Data Block (Index) Read/Write Index and Data Blocks Storage Cost Index and Data Blocks Intermediate Files Computation Cost Query Planning - Optimization Effort Record Search, Sort, Merge Actual Transaction/Query Operations Communications Cost Transfer of Results to the User Chapter 19-122 Complexity of Relational Operations CSE 4701 Assuming Relations of Cardinality n Sequential Scan of Data in each Relation Complexity of Each Operation is Indicated Avoid Cartesian Product at All Costs! Operation Select Project (w/o duplicate elimination) Project (with duplicate elimination) Group Complexity O(n) O(nlog n) Join Division O(nlog n) Set Operators Cartesian Product O(n2) Chapter 19-123 Cost-Based Optimization CSE 4701 To Understand Cost-Based Operations, we Must Focus on Implementation Strategy of: Select Project Join For Select and Project - There is a Fixed Cost that we Must Live With For Join Implementation Strategy Different Join Strategies Objective: Minimize the Number of Blocks Involved Note that Cost-Based and Relational Algebra Heuristic Optimization Can Complement One Another Chapter 19-124 Implementation of SELECT CSE 4701 Principles Equality Eliminates Many Tuples Index Focuses and Limits Search Scope Sequential Scan Brute Force Search All Records to Find Matching Ones Binary Search Equality Comparison on a Key Attribute Primary Index or Hash Key for Single Record Equality Comparison on a Key Attribute With Primary Index or Hash Key Go Directly to Record; No Need to Scan Entire Table Cost to Maintain Index/Hash Chapter 19-125 Implementation of SELECT CSE 4701 Primary Index for Multiple Records Use Primary Key to Find the Equality Attribute Go Forward (> or ) or Backward (< or ) According to the Comparison Operator Clustering Index for Multiple Records Equality Comparison on a Non-key Attribute With a Clustering Index (e.g., Sort-Merge Algorithm) Secondary Index Equality or Range Queries Primary Indexes Play a Role Similar to Searching Sorted Array We’ll Discuss Indexing Techniques at a Later Time Chapter 19-126 Recall B+ Tree – Find Leaf and Go L or R CSE 4701 Chapter 19-127 Recall B+ Tree – Find Leaf and Go L or R CSE 4701 Chapter 19-128 Implementation of SELECT CSE 4701 Conjunctive Selection (C1 C2 … CN) If One of the Conjuncts has a Good Access Path, Use it and Check the Other Conjuncts for Each of These Records Pick the one that is based on “concrete” value Composite Index If an Index has Been Established Jointly for a Number of Attributes in the Conjunct Equality Condition Intersection of Pointers If Secondary Indexes Exist on All or Most of the Attributes in the Conjunct and the Indexes Include Record Pointers Retrieve Each Attribute Using These Indexes and Then Take Their Intersection Chapter 19-129 Implementing PROJECT CSE 4701 If <Attribute List> Includes Key Simple Since the Cardinality of the Result is the Same as the Cardinality of the Original Relation No Need to Remove Duplicates - Key Attribute If <Attribute List> Does Not Include Key Duplicates Allowed Duplicate Elimination Sort After Projection and then Eliminate Consecutively Appearing Duplicates See Textbook for Algorithms Use Hashing: Hash Each Record Into a Bucket and Check Against Records Already in That Bucket Size Estimation: card(A(R))=card(R) Chapter 19-130 Implementing JOIN CSE 4701 Nested Loop Simple Iteration and Block-Oriented Iteration For Each Block in R do Retrieve Every Record from S and Test Join Condition An Index for S may Speed up the Inner Loop Smaller Relation should be Outer Loop Calculation of I/O Let bo (bi) be the Number of Blocks taken up by Outer (Inner) Relation Let nB (>1) the Buffer Size (in blocks) Devoted to Arguments Let bR be the size of the Resulting Relation (in blocks) Total no. of Block Access = bo+ bo/(nB-1)bi+ bR Chapter 19-131 Implementing JOIN CSE 4701 Sort-Merge Join Physically Sort Relations R and S Scan R and S in the Sorted Order and Merge See Algorithm in Textbook If Files are Not Physically Sorted, but Sorted on the Join Attributes, a Variation May be used Quite Inefficient Since Records are Scattered Over the Disk Total number of block access = b + bi+ bolog2bo + bilog2bi + bR Chapter 19-132 Implementing JOIN CSE 4701 Hash Join Hash R and S Using the Same Hash Function If Hash File Can Be Memory-Resident, it is Efficient and Easy to Implement If Buffer Space is Insufficient, then Part of the Hash File has to be on Disk Various Optimizations for this Case Hybrid Hash Join is Described in the Book Again - Biggest Problem is Overhead Associated with Maintaining Hash Index Over Time Chapter 19-133 Access Using Indices: Estimation of Costs CSE 4701 Example: Given a bank database consisting of the following three relation schemas: Branch(bank-name, assets, bank-city) Deposit(bank-name, account-number, customer-name, balance) Customer(customer-name, street, zipcode, customer-city) Consider the SQL query for the bank database: Select account-number From Deposit Where bank-name = “BofA” and customer-name = “Bill” and balance > 1000; Chapter 19-134 Heuristic Optimization CSE 4701 Use Cascading of Selections Rule to Decompose, Three Logical Query Plan Alternatives Are Obtained Objective - Choose the “Best” Alternative in Terms of Execution Time (Block Reads) What should be the Focus in Select Order? Account-Number bank-name = “BofA” customer-name = “Bill” balance > 1000; Deposit Account-Number balance > 1000; Account-Number bank-name = “BofA” customer-name = “Bill” balance > 1000; bank-name = “BofA” customer-name = “Bill” Deposit Deposit Chapter 19-135 Access Using Indices: Estimation of Costs CSE 4701 Assumptions: 100 Different Banks (bank-name) 1000 Customers (on average) per bank Balance could range from 0 to 10,000 dollars Branch(bank-name, assets, bank-city) Deposit(bank-name, account-number, customer-name, balance) Customer(customer-name, street, zipcode, customer-city) Select account-number From Deposit Where bank-name = “BofA” and customer-name = “Bill” and balance > 1000; Chapter 19-136 Estimation of Cost of Access - Version 1 CSE 4701 Branch(bank-name, assets, bank-city) Deposit(bank-name, account-number, customer-name, balance) Customer(customer-name, street, zipcode, customer-city) Account-Number bank-name = “BofA” customer-name = “Bill” balance > 1000; Deposit Recall Assumptions 100 Banks 1000 Customers/Bank 0 to 10,000 dollars/account that are Distributed Evenly Across Accts. Tuples in Deposit? 100,000 What Does balance > 1000 do? Retrieve 90% of Accounts All Banks, All Customers What Does customer-name = “bill” do? All Customers Named Bill Regardless of the Bank Is this a Good Strategy? Chapter 19-137 Estimation of Cost of Access - Version 2 CSE 4701 Branch(bank-name, assets, bank-city) Deposit(bank-name, account-number, customer-name, balance) Customer(customer-name, street, zipcode, customer-city) Account-Number balance > 1000; customer-name = “Bill” bank -name = “BofA” Deposit Recall Assumptions 100 Banks 1000 Customers/Bank 0 to 10,000 dollars/account that are Distributed Evenly Across Accts. Tuples in Deposit 100,000 What Does bank-name = “BofA” do? Retrieves 1000 Tuples for BofA on Average What Does customer-name = “bill” do? The Customer “Bill” What Does balance > 1000 do? Is this a Good Strategy? Chapter 19-138 Estimation of Cost of Access - Version 3 CSE 4701 Branch(bank-name, assets, bank-city) Deposit(bank-name, account-number, customer-name, balance) Customer(customer-name, street, zipcode, customer-city) Account-Number bank -name = “BofA” balance > 1000; customer-name = “Bill” Deposit Recall Assumptions 100 Banks 1000 Customers/Bank 0 to 10,000 dollars/account that are Distributed Evenly Across Accts. Tuples in Deposit 100,000 What Does customer-name = “bill” do? Retrieves 100 Tuples One per Bank What Does balance > 1000 do? Do they Have Enough Money? What Does bank-name= “BofA” do? Is this a Good Strategy? Chapter 19-139 Join Strategies CSE 4701 Several Factors Influence the Selection of an Optimal The Physical Order of Tuples in a Relation The Presence of Indices and the Type of Index (Clustering or Nonclustering) The Cost of Computing a Temporary Index for the Sole Purpose of Processing One Query Example: Consider the natural join Deposit Customer • nDeposit = 10,000. • nCustomer = 200. • 20 tuples fit in one block for both relations • buffersize = 2 blocks Chapter 19-140 Join Strategies: Block-Oriented Iteration CSE 4701 Block-oriented Iteration: • Process the relations on a per-block basis rather on a per-tuple basis • Using this approach, a major saving in block accesses results Example: Consider the natural join Deposit Customer nDeposit = 10,000. nCustomer = 200. 20 tuples fit in one block for both relations Case 1: outerloop: Deposit , inner loop: Customer • reading Customer once for every block of Deposit tuples | requires (200/20) * (10,000/20) = 10 * 500 • reading Deposit relation requires 10000/20 = 500 block reads • the total cost in terms of block accesses is 5500 -> 5000 blocks accesses to Customer and -> 500 blocks accesses to Deposit Chapter 19-141 Join Strategies: Block-oriented Iteration CSE 4701 Case 2: outerloop: Customer, inner loop: Deposit • Reading Deposit once for every block of Customer tuples requires (10,000/20) * (200/20) = 5000 • Reading Customer relation | requires 200/20 = 10 block reads • The total cost in terms of block accesses is 5010 ==>5000 accesses to Deposit blocks and ==>10 accesses to Customer blocks Case 3: If Customer relation is smaller enough to fit in main memory, our strategy requires only ==>500 blocks to read Deposit relation and ==>10 blocks to read Customer relation. The total comes to 510 blocks Chapter 19-142 Query Execution Cost: Summary CSE 4701 Access Cost to Secondary Storage Search for Data Block (Index) Read/write Index and Data Blocks Storage Cost Index and Data Blocks Intermediate Files Computation Cost Query Planning Record Search, Sort, Merge Actual Transaction/query Operations Communications Cost Data Transfer Across a Network Chapter 19-143 Access Plan CSE 4701 Access Plan is a Concrete Query Processing Plan which Presents a Detailed Strategy for Processing a Query The Main Cost Factors to Be Considered Include The Relational Operations to be Performed Indices to be Used The Order in Which Tuples are to be Accessed The Order in Which Operations are to be Performed Typical Focus is on Join and Optimizing its Execution, Particularly when Multiple Tables are Involved Chapter 19-144 Statistics CSE 4701 The Following are Kept in the System Catalog for Optimization Purposes File Parameters: Block Size Number of Tuples in Each Relation Size of Tuples Key Fields, Indices Number of Levels in an Index Highest Key, Lowest Key Number of Distinct Values (Maybe) Others: Frequency of Operations, Join Keys, Etc. All DBMSs Keep the First Four, Many Keep All Chapter 19-145 Join Ordering CSE 4701 Given R S T W Determine the Best Ordering Alternative ((R S) T) W (R (S T)) W R (S (T W)) ((R T) S) W ((R W) S) T … (R S) (T W) Ordering is Critical to Arrive at “Best” Strategy for Execution, Particularly as Number of Relations Increase Size of Relation (Tuples/Blocks) Increase Chapter 19-146 Query Optimization Search Strategies CSE 4701 Exhaustive Search “Optimal” Combinatorial Complexity in the Number of Relations Heuristics Not Optimal Group Common Sub-expressions Perform Selection, Projection First Replace a Join by a Series of Semi-joins Reorder Operations to Reduce Intermediate Relation Size Optimize Individual Operations Chapter 19-147 Query Optimization Timing Issues CSE 4701 Static Compilation ==> Optimize Prior to the Execution Difficult to Estimate the Size of the Intermediate Results ==> Error Propagation Can Amortize Over Many Executions Dynamic Run Time Optimization Exact Information on the Intermediate Relation Sizes Have to Reoptimize for Multiple Executions Hybrid Compile Using a Static Algorithm If the Error in Estimate Sizes > Threshold, Reoptimize at Run Time Chapter 19-148 Concluding Remarks CSE 4701 Most Systems Implement Only a Few Strategies The Number of Strategies that are Considered by Any Query Optimizer is Limited Some Systems Reduce the Number of Strategies by Making a Heuristic Guess of Strategy for Each Query The Optimizer Considers Every Possible Strategy, but Terminates as Soon as it Determines the Cost is Greater than the Pre-chosen Strategy Thus Only a Few Competing Strategies Require Full Analysis of the Cost The Overhead of Query Optimization is Reduced Remember - Trade off in Optimization Time For PL - Optimization is Pre-Execution (Compile) For DB - Optimization is Part of Execution (Run) Chapter 19-149