Query Optimization Techniques: Overview As companies began to rely more heavily on computerized business data, it became increasingly clear that the traditional file-based methods of storing and retrieving data were both inflexible and cumbersome to maintain. Because application code for accessing the data contained hard-coded pointers to the underlying data structures, a new report could take months to produce. Even minor changes were complicated and expensive to implement. In many cases, there was simply no method available for producing useful analysis of the data. These real business needs drove the relational database revolution. The true power of a relational database resides in its ability to break the link between data access and the underlying data itself. Using a high-level access language such as SQL (structured query language), users can access all of their corporate data dynamically without any knowledge of how the underlying data is actually stored. To maintain both system performance and throughput, the relational database system must accept a diverse variety of user input queries and convert them to a format that efficiently accesses the stored data. This is the task of the query optimizer. This article presents the steps involved in the query transformation process, discusses the various methods of query optimization currently being used, and describes the query optimization techniques commonly used Query Transformation Whenever a data manipulation language (DML) such as SQL is used to submit a query to a relational database management system (RDBMS), distinct process steps are invoked to transform the original query. Each of these steps must occur before the query can be processed by the RDBMS and a result set returned. This technical article deals solely with queries sent to RDBMS for the purpose of returning results; however, these steps are also used to handle DML statements that modify data and data definition language (DDL) statements that maintain objects within the RDBMS. Although many texts on the subject of query processing disagree about how each process is differentiated, they do agree that certain distinct process steps must occur. The Parsing Process The parsing process has two functions: It checks the incoming query for correct syntax. It breaks down the syntax into component parts that can be understood by the RDBMS. These component parts are stored in an internal structure such as a graph or, more typically, a query tree. (This technical article focuses on a query tree structure.) A query tree is an internal representation of the component parts of the query that can be easily manipulated by the RDBMS. After this tree has been produced, the parsing process is complete. The Standardization Process Unlike a strictly hierarchical system, one of the great strengths of an RDBMS is its ability to accept high-level dynamic queries from users who have no knowledge of the underlying data structures. As a result, as individual queries become more complex, the system must be able to accept and resolve a large variety of combinational statements submitted for the purpose of retrieving the same data result set. The purpose of the standardization process is to transform these queries into a useful format for optimization. The standardization process applies a set of tree manipulation rules to the query tree produced by the parsing process. Because these rules are independent of the underlying data values, they are correct for all operations. During this process, the RDBMS rearranges the query tree into a more standardized, canonical format. In many cases, it completely removes redundant syntax clauses. This standardization of the query tree produces a structure that can be used by the RDBMS query optimizer. The Query Optimizer The goal of the query optimizer is to produce an efficient execution plan for processing the query represented by a standardized, canonical query tree. Although an optimizer can theoretically find the "optimal" execution plan for any query tree, an optimizer really produces an acceptably efficient execution plan. This is because the possible number of table join combinations increases combinatorially as a query becomes more complex. Without using pruning techniques or other heuristical methods to limit the number of data combinations evaluated, the time it takes the query optimizer to arrive at the best query execution plan for a complex query can easily be longer than the time required to use the least efficient plan. Various RDBMS implementations have used differing optimization techniques to obtain efficient execution plans. This section discusses some of these techniques. Heuristic Optimization Heuristic optimization is a rules-based method of producing an efficient query execution plan. Because the query output of the standardization process is represented as a canonical query tree, each node of the tree maps directly to a relational algebraic expression. The function of a heuristic query optimizer is to apply relational algebraic rules of equivalence to this expression tree and transform it into a more efficient representation. Using relational algebraic equivalence rules ensures that no necessary information is lost during the transformation of the tree. These are the major steps involved in heuristic optimization: Break conjunctive selects into cascading selects. Move selects down the query tree to reduce the number of returned "tuples." ("Tuple" rhymes with "couple." In a database table (relation), a set of related values, one for each attribute (column). A tuple is stored as a row in a relational database management system. It is the analog of a record in a nonrelational file. [ Move projects down the query tree to eliminate the return of unnecessary attributes. Combine any Cartesian product operation followed by a select operation into a single join operation. When these steps have been accomplished, the efficiency of a query can be further improved by rearranging the remaining select and join operations so that they are accomplished with the least amount of system overhead. Heuristic optimizers, however, do not attempt this further analysis of the query. Syntactical OptimizationSyntactical optimization relies on the user's understanding of both the underlying database schema and the distribution of the data stored within the tables. All tables are joined in the original order specified by the user query. The optimizer attempts to improve the efficiency of these joins by identifying indexes that are useful for data retrieval. This type of optimization can be extremely efficient when accessing data in a relatively static environment. Using syntactical optimization, indexes can be created and tuned to improve the efficiency of a fixed set of queries. Problems occur with syntactical optimization whenever the underlying data is fairly dynamic. Query access schemas can be degraded over time, and it is up to the user to find a more efficient method of accessing the data. Another problem is that applications using embedded SQL to query dynamically changing data often need to be recompiled to improve their data access performance. Costbased optimization was developed to resolve these problems. Cost-Based Optimization To perform cost-based optimization, an optimizer needs specific information about the stored data. This information is extremely system-dependent and can include information such as file size, file structure types, available primary and secondary indexes, and attribute selectivity (the percentage of tuples expected to be retrieved for a given equality selection). Because the goal of any optimization process is to retrieve the required information as efficiently as possible, a cost-based optimizer uses its knowledge of the underlying data and storage structures to assign an estimated cost in terms of numbers of tuples returned and, more importantly, physical disk I/O for each relational operation. By evaluating various orderings of the relational operations required to produce the result set, a cost-based optimizer then arrives at an execution plan based on a combination of operational orderings and data access methods that have the lowest estimated cost in terms of system overhead. As mentioned earlier, the realistic goal of a cost-based optimizer is not to produce the "optimal" execution plan for retrieving the required data, but is to provide a reasonable execution plan. For complex queries, the cost estimate is based on the evaluation of a subset of all possible orderings and on statistical information that estimates the selectivity of each relational operation. These cost estimates can be only as accurate as the available statistical data. Due to the overhead of keeping this information current for data that can be altered dynamically, most relational database management systems maintain this information in system tables or catalogs that must be updated manually. The database system administrator must keep this information current so that a cost-based optimizer can accurately estimate the cost of various operations. Semantic Optimization Although not yet an implemented optimization technique, semantic optimization is currently the focus of considerable research. Semantic optimization operates on the premise that the optimizer has a basic understanding of the actual database schema. When a query is submitted, the optimizer uses its knowledge of system constraints to simplify or to ignore a particular query if it is guaranteed to return an empty result set. This technique holds great promise for providing even more improvements to query processing efficiency in future relational database systems. This optimization is accomplished in three phases: Query analysis Index selection Join selection Query Analysis In the query analysis phase, the SQL Server optimizer looks at each clause represented by the canonical query tree and determines whether it can be optimized. SQL Server attempts to optimize clauses that limit a scan; for example, search or join clauses. However, not all valid SQL syntax can be broken into optimizable clauses, such as clauses containing the SQL relational operator <> (not equal). Because <> is an exclusive rather than an inclusive operator, the selectivity of the clause cannot be determined before scanning the entire underlying table. When a relational query contains non-optimizable clauses, the execution plan accesses these portions of the query using table scans. If the query tree contains any optimizable SQL syntax, the optimizer performs index selection for each of these clauses. Index Selection For each optimizable clause, the optimizer checks the database system tables to see if there is an associated index useful for accessing the data. An index is considered useful only if a prefix of the columns contained in the index exactly matches the columns in the clause of the query. This must be an exact match, because an index is built based on the column order presented at creation time. For a clustered index, the underlying data is also sorted based on this index column order. Attempting to use only a secondary column of an index to access data would be similar to attempting to use a phone book to look up all the entries with a particular first name: the ordering would be of little use because you would still have to check every row to find all of the qualifying entries. If a useful index exists for a clause, the optimizer then attempts to determine the clause's selectivity. In the earlier discussion on cost-based optimization, it was stated that a cost-based optimizer produces cost estimates for a clause based on statistical information. This statistical information is used to estimate a clause's selectivity (the percentage of tuples in a table that are returned for the clause). This statistical information is updated only at the following two times: During the initial creation of the index (if there is existing data in the table) Whenever the UPDATE STATISTICS command is executed for either the index or the associated table To provide SQL Server with accurate statistics that reflect the actual tuple distribution of a populated table, the database system administrator must keep the statistical information for the table indexes reasonably current. If no statistical information is available for the index, a heuristic based on the relational operator of the clause is used to produce an estimate of selectivity. Information about the selectivity of the clause and the type of available index is used to calculate a cost estimate for the clause. SQL Server estimates the amount of physical disk I/O that occurs if the index is used to retrieve the result set from the table. If this estimate is lower than the physical I/O cost of scanning the entire table, an access plan that employs the index is created. Join Selection When index selection is complete and all clauses have an associated processing cost based on their access plan, the optimizer performs join selection. Join selection is used to find an efficient order for combining the clause access plans. To accomplish this, the optimizer compares various orderings of the clauses and then selects the join plan with the lowest estimated processing costs in terms of physical disk I/O. Because the number of clause combinations can grow combinatorially as the complexity of a query increases, the SQL Server query optimizer uses tree pruning techniques to minimize the overhead associated with these comparisons. When this join selection phase is complete, the SQL Server query optimizer provides a costbased query execution plan that takes advantage of available indexes when they are useful and accesses the underlying data in an order that minimizes system overhead and improves performance. Improving a logical query plan A query plan improvement is very essential as it helps a lot in the reduction of the data access (example by removing a lot of unwanted data). Assume that there are 100 rows out of which the final selection leads to an output of only 10 rows with two columns. By a proper logical query plan the number of rows in the intermediate stage can be reduced such that the data retrieval is minimum at every stage. This aspect is further true when there is one or more join operation in the system. Certain rules are followed which are called algebraic laws that tend to improve logical query plans. The following are the commonly used. Selection can be pushed down the expression tree as far as they can go. If the selection statement is a part of an AND condition then we can split the condition and push each piece down the tree separately. This strategy is probably one of the most effective techniques. Similarly projections can be pushed down the tree. Even projection operators can be considered column wise selection operators. The pushing of projection operators should be done with care. And so is the case with the selection operators. Duplicate eliminations can sometimes be done (by the usage of the distinct operator). This can be done in certain cases only. Say we are looking for the MAX value in a column then a distinct operator can be used to eliminate duplicates this way making the search smaller. Where as if the operator were AVG or SUM the values with and without Distinct would be very different. Certain select statements can be turned into a product that in turn simplifies the whole expression. Let us consider an example that covers all the above-mentioned points. ∏title Җ Starname = name StarsIn σbirthdate LIKE ‘%1960’ MovieStar The above plan is one form of the query plan that can be provided for the given SQL query. An improvement over the Query plan is given below. ∏title Җ Starname = name ∏Starname,Title ∏name σbirthdate LIKE ‘%1960’ StarsIn MovieStar