Query Processing Chapter 23 – Textbook 2 Lisa A Wharwood 1 Query Processing • The activity of choosing an efficient execution strategy for processing a query. • The aims of query processing are to • transform a query written in a high-level language, typically SQL, into a correct and efficient execution strategy expressed in a low-level language • to execute the strategy to retrieve the required • An important aspect of query processing is query optimization. • the aim of query optimization is to choose the better strategy that minimizes resource usage. 2 Techniques • An attempt to reduce the total execution time of the query, which is the sum of the execution times of all individual operations that make up the query (Selinger et al., 1979). • However, resource usage may also be viewed as the response time of the query, in which case we concentrate on maximizing the number of parallel operations (Valduriez and Gardarin, 1984). • There are two main techniques • The first uses heuristic rules that order the operations in a query • The second compares different strategies based on their relative cost and selects the one that minimizes resource usage 3 Note! • Both methods of query optimization depend on database statistics to evaluate properly the different options that are available. • The accuracy and currency of these statistics have a significant bearing on the efficiency of the execution strategy chosen. 4 Transformations • SELECT * • FROM Staff s, Branch b • WHERE s.branchNo = b.branchNo AND • (s.position = ‘Manager’ AND b.city = ‘London’); • σ(position=‘Manager’)∧(city=‘London’)∧(Staff.branchNo=Branch.branchNo)(Staff × Branch) or • σ(position=‘Manager’) ∧ (city=‘London’)(Staff Staff.branchNo=Branch.branchNo • or σ(σposition=‘Manager’(Staff)) (σcity=‘London’(Branch)) Branch) Staff.branchNo=Branch.branchNo 5 Four Phases of QP • Query processing can be divided into four main phases: 1. 2. 3. 4. decomposition (consisting of parsing and validation) optimization code generation execution 6 Types of QP Dynamic query optimization Static query optimization • carry out decomposition and optimization every time the query is run • the query is parsed, validated, and optimized once • All information required to select an optimum strategy is up to date. • Runtime overhead is removed • Performance is affected as query has to be parsed, validated, and optimized before it can be executed. • it may be necessary to reduce the number of execution strategies to be analyzed to achieve an acceptable overhead, which may have the effect of selecting a less than optimum strategy. • more time available to evaluate a larger number of execution strategies, thereby increasing the chances of finding a more optimum strategy. • the execution strategy chosen as being optimal when the query is compiled may no longer be optimal when the query is run • Hybrid approaches can deal with this disadvantage. 7 First Phase - Query decomposition Query decomposition is the first phase of query processing. 8 Stages of Query Decomposition • The typical stages of query decomposition are 1. 2. 3. 4. 5. analysis, normalization, semantic analysis, simplification, and query restructuring 9 Analysis The query is lexically and syntactically analyzed using the techniques of programming language compilers This query would be rejected on two grounds: • In the select list, the attribute staffNumber is not defined for the Staff relation (should be staffNo). • In the WHERE clause, the comparison ‘>10’ is incompatible with the data type position, which is a variable character string. • On completion of this stage, the high-level query has been transformed into some internal representation that is more suitable for processing. SELECT staffNumber FROM Staff WHERE position > 10; 10 The Tree • The query tree, is constructed as follows: • • • • A leaf node is created for each base relation in the query. A non-leaf node is created for each intermediate relation produced by a relational algebra operation. The root of the tree represents the result of the query. The sequence of operations is directed from the leaves to the root. 11 Normalization • The normalization stage of query processing converts the query into a normalized form that can be more easily manipulated. • The predicate (in SQL, the WHERE condition), which may be arbitrarily complex, can be converted into one of two forms by applying a few transformation rules (Jarke and Koch, 1984): • Conjunctive • Disjunctive 12 Conjunctive Normal Form • A sequence of conjuncts that are connected with the ∧ (AND) operator. • Each conjunct contains one or more terms connected by the ∨ (OR) operator. • A conjunctive selection contains only those tuples that satisfy all conjuncts. • For example (position = ‘Manager’ ∨ salary > 20000) ∧ branchNo = ‘B003’ 13 Disjunctive Normal Form • A sequence of disjuncts that are connected with the ∨ (OR) operator. • Each disjunct contains one or more terms connected by the ∧ (AND) operator. • A disjunctive selection contains those tuples formed by the union of all tuples that satisfy the disjuncts. • For example (position = ‘Manager’ ∧ branchNo = ‘B003’ ) ∨ (salary > 20000 ∧ branchNo = ‘B003’) 14 Semantic Analysis • Must reject normalized queries that are incorrectly formulated or contradictory. • if components do not contribute to the generation of the result, which may happen if some join specifications are missing.(incorrectly formulated) • if its predicate cannot be satisfied by any tuple (contradictory) 15 Simplification • to detect redundant qualifications, eliminate common sub-expressions, and transform the query to a semantically equivalent but more easily and efficiently computed form. • access restrictions, view definitions, and integrity constraints are considered at this stage • If the user does not have the appropriate access to all the components of the query, the query must be rejected. 16 Query restructuring • The final stage • the query is restructured to provide a more efficient implementation 17 Heuristical Approach to Query Optimization See transformation rules page 640(691in pdf) of text. 18 Canonical Tree 19 Rules a) canonical relational algebra tree; b) relational algebra tree formed by pushing Selections down; c) relational algebra tree formed by changing Selection/Cartesian products to Equijoins; d) relational algebra tree formed using associativity of Equijoins; e) relational algebra tree formed by pushing Projections down; f) final reduced relational algebra tree formed by pushing resulting Selections down tree. 20 SELECT sname, coursename FROM Student, Course WHERE s-id > 0 AND year = 2007 AND Student.course-id = Course.course-id AND duration = 4 AND intake-no = 60. 21 22 Query Processing & Optimization • Task: • Find an efficient physical query plan (aka execution plan) for an SQL query • Goal: • Minimize the evaluation time for the query, i.e., compute query result as fast as possible • Cost Factors: • Disk accesses, read/write operations, [I/O, page transfer] (CPU time is typically ignored) How does this happen? • The Query Parser checks the validity of the query and then translates it into an internal form, usually a relational calculus expression or something equivalent. • The Query Optimizer examines all algebraic expressions that are equivalent to the given query and chooses the one that is estimated to be the cheapest. • The Code Generator or the Interpreter transforms the access plan generated by the optimizer into calls to the query processor. • The Query Processor actually executes the query. Dynamic versus static optimization • Within the first three (3) phases of query processing, we can refer to this as the compile time, we can choose the option of • dynamically carrying out decomposition and optimization every time the query is run. • information will usually be update, however the performance of the query is affected • the alternative option of • static query optimization, where the query is parsed, validated, and optimized once when query is first submitted. • the removal of the runtime overhead boost performance but the chosen execution strategy may no longer be optimal at eventual query run time Query Decomposition • The first phase of query processing. • The typical stages of query decomposition are: 1. Analysis • 2. Normalization • 3. This is the process of verifying that all queries are formulated properly and no contradictory. Simplification • • • 5. The normalization stage of query processing converts the query into a normalized form that can be more easily manipulated. Semantic analysis • 4. The query is lexically and syntactically analysed using the techniques of programming language compilers. Identifies redundant qualifications Eliminates common sub-expressions Transforms the query to a semantically equivalent form that is more efficient Query restructuring • In the final stage of query decomposition, the query is restructured to provide a more efficient implementation Techniques • There are two main techniques that are employed during query optimization. • A technique based on heuristic rules ( the Heuristical Approach ) • The use of transformation rules to convert one relational algebra expression into an equivalent form that is known to be more efficient. • Use these rules to restructure a (canonical) relational algebra tree generated during query decomposition. • A techniques involving systematic evaluation • Cost-based query optimization • A query optimizer does not depend solely on heuristic rules; it also estimates and compares the costs of executing a query using different execution strategies and algorithms, and it then chooses the strategy with the lowest cost estimate Basic Steps in Query Processing : Optimization • A relational algebra expression may have many equivalent expressions • E.g., salary75000(salary(instructor)) is equivalent to salary(salary75000(instructor)) • Each relational algebra operation can be evaluated using one of several different algorithms • Correspondingly, a relational-algebra expression can be evaluated in many ways. • Annotated expression specifying detailed evaluation strategy is called an evaluation-plan. • E.g., can use an index on salary to find instructors with salary < 75000, • or can perform complete relation scan and discard instructors with salary 75000 Query Optimization • Amongst all equivalent evaluation plans choose the one with lowest cost. • Cost is estimated using statistical information from the database catalog • e.g. number of tuples in each relation, size of tuples, etc. • Cost is generally measured as total elapsed time for answering query • Many factors contribute to time cost • disk accesses, CPU, or even network communication • Typically disk access is the predominant cost, and is also relatively easy to estimate. Measured by taking into account • Number of seeks * average-seek-cost • Number of blocks read * average-block-read-cost • Number of blocks written * average-block-write-cost • Cost to write a block is greater than cost to read a block • data is read back after being written to ensure that the write was successful Measures of Query Cost (Cont.) • For simplicity we just use the number of block transfers from disk and the number of seeks as the cost measures • tT – time to transfer one block • tS – time for one seek • Cost for b block transfers plus S seeks b * tT + S * tS • We ignore CPU costs for simplicity • Real systems do take CPU cost into account • We do not include cost to writing output to disk in our cost formulae Measures of Query Cost (Cont.) • Several algorithms can reduce disk IO by using extra buffer space • Amount of real memory available to buffer depends on other concurrent queries and OS processes, known only during execution • We often use worst case estimates, assuming only the minimum amount of memory needed for the operation is available • Required data may be buffer resident already, avoiding disk I/O • But hard to take into account for cost estimation Join Operation • Several different algorithms to implement joins • Nested-loop join • Block nested-loop join • Indexed nested-loop join • Merge-join • Hash-join • Choice based on cost estimate • Examples use the following information • Number of records of student: 5,000 • Number of blocks of student: 100 takes: 10,000 takes: 400 Nested-Loop Join (Cont.) • In the worst case, if there is enough memory only to hold one block of each relation, the estimated cost is nr bs + br block transfers, plus nr + br seeks • If the smaller relation fits entirely in memory, use that as the inner relation. • Reduces cost to br + bs block transfers and 2 seeks • Assuming worst case memory availability cost estimate is • with student as outer relation: • 5000 400 + 100 = 2,000,100 block transfers, • 5000 + 100 = 5100 seeks • with takes as the outer relation • 10000 100 + 400 = 1,000, • 400 block transfers and 10,400 seeks • If smaller relation (student) fits entirely in memory, the cost estimate will be 500 block transfers. • Block nested-loops algorithm is preferable. Block Nested-Loop Join (Cont.) • Worst case estimate: br bs + br block transfers + 2 * br seeks • Each block in the inner relation s is read once for each block in the outer relation • Best case: br + bs block transfers + 2 seeks. • Improvements to nested loop and block nested loop algorithms: • In block nested-loop, use M — 2 disk blocks as blocking unit for outer relations, where M = memory size in blocks; use remaining two blocks to buffer inner relation and output Cost = br / (M-2) bs + br block transfers + 2 br / (M-2) seeks • If equi-join attribute forms a key or inner relation, stop inner loop on first match • Scan inner loop forward and backward alternately, to make use of the blocks remaining in buffer (with LRU replacement) • Use index on inner relation if available • Evaluation of Expressions • So far: we have seen algorithms for individual operations • Alternatives for evaluating an entire expression tree • Materialization: generate results of an expression whose inputs are relations or are already computed, materialize (store) it on disk. Repeat. • Pipelining: pass on tuples to parent operations even as an operation is being executed Query trees • • • • Tree that represents a relational algebra expression. Leaves = base tables. Internal nodes = relational algebra operators applied to the node’s children. The tree is executed from leaves to root. • Example: List the last name of the employees born after 1957 who work on a project named ”Aquarius”. SELECT E.LNAME FROM EMPLOYEE E, WORKS_ON W, PROJECT P WHERE P.PNAME = ‘Aquarius’ AND P.PNUMBER = W.PNO AND W.ESSN = E.SSN AND E.BDATE > ‘1957-12-31’ Canonial query tree πattributes SELECT attributes FROM A, B, C WHERE condition σcondition Construct the canonical query tree as follows • Cartesian product of the FROM-tables • Select with WHERE-condition • Project to the SELECT-attributes A X X C B Equivalent query trees Query processing Real world User 4 User 3Queries Answers Updates User 2Queries Answers Updates User 1Queries Answers Updates Model Updates Database management system Queries Answers Processing of queries and updates Access to stored data Physical database Query processing StarsIn( movieTitle, movieYear, starName ) MovieStar( name, address, gender, birthdate ) SELECT movieTitle FROM StarsIn WHERE starName IN ( SELECT name FROM MovieStar WHERE birthdate LIKE ’%1960’); Canonical query tree (usually very inefficient) Query optimizer • Compare the estimate cost estimate of different execution plans and choose the cheapest. • The cost estimate decomposes into the following components. • Access cost to secondary storage. • Depends on the access method and file organization. Leading term for large databases. • Storage cost . • Storing intermediate results on disk. • Computation cost. • In-memory searching, sorting, computation. Leading term for small databases. • Memory usage cost. • Memory buffers needed in the server. • Communication cost. • Remote connection cost, network transfer cost. Leading term for distributed databases. • The costs above are estimated via the information in the DBMS catalog (e.g. #records, record size, #blocks, primary and secondary access methods, #distinct values, selectivity, etc.). References Textbook 1: FUNDAMENTALS OF Database Systems SIXTH or SEVENTH EDITION by Ramez Elmasri and Shamkant B. Navathe ISBN-13: 978-0133970777 - seventh edition ISBN-10: 0133970779 - seventh edition Textbook 2: Database Systems: A Practical Approach to Design, Implementation, and Management SIXTH EDITION Global Edition Thomas M. Connolly z Carolyn E. Begg Lisa A Wharwood 41