chakravarthy

advertisement
Query Optimization Techniques:
Overview
As companies began to rely more heavily on computerized business data, it became
increasingly clear that the traditional file-based methods of storing and retrieving
data were both inflexible and cumbersome to maintain. Because application code for
accessing the data contained hard-coded pointers to the underlying data structures,
a new report could take months to produce. Even minor changes were complicated
and expensive to implement. In many cases, there was simply no method available
for producing useful analysis of the data. These real business needs drove the
relational database revolution.
The true power of a relational database resides in its ability to break the link
between data access and the underlying data itself. Using a high-level access
language such as SQL (structured query language), users can access all of their
corporate data dynamically without any knowledge of how the underlying data is
actually stored. To maintain both system performance and throughput, the
relational database system must accept a diverse variety of user input queries and
convert them to a format that efficiently accesses the stored data. This is the task of
the query optimizer.
This article presents the steps involved in the query transformation process,
discusses the various methods of query optimization currently being used, and
describes the query optimization techniques commonly used
Query Transformation
Whenever a data manipulation language (DML) such as SQL is used to submit a
query to a relational database management system (RDBMS), distinct process steps
are invoked to transform the original query. Each of these steps must occur before
the query can be processed by the RDBMS and a result set returned. This technical
article deals solely with queries sent to RDBMS for the purpose of returning results;
however, these steps are also used to handle DML statements that modify data and
data definition language (DDL) statements that maintain objects within the
RDBMS.
Although many texts on the subject of query processing disagree about how each
process is differentiated, they do agree that certain distinct process steps must
occur.
The Parsing Process
The parsing process has two functions:
It checks the incoming query for correct syntax.
It breaks down the syntax into component parts that can be understood by the
RDBMS.
These component parts are stored in an internal structure such as a graph or, more
typically, a query tree. (This technical article focuses on a query tree structure.) A
query tree is an internal representation of the component parts of the query that can
be easily manipulated by the RDBMS. After this tree has been produced, the
parsing process is complete.
The Standardization Process
Unlike a strictly hierarchical system, one of the great strengths of an RDBMS is its
ability to accept high-level dynamic queries from users who have no knowledge of
the underlying data structures. As a result, as individual queries become more
complex, the system must be able to accept and resolve a large variety of
combinational statements submitted for the purpose of retrieving the same data
result set.
The purpose of the standardization process is to transform these queries into a
useful format for optimization. The standardization process applies a set of tree
manipulation rules to the query tree produced by the parsing process. Because these
rules are independent of the underlying data values, they are correct for all
operations. During this process, the RDBMS rearranges the query tree into a more
standardized, canonical format. In many cases, it completely removes redundant
syntax clauses. This standardization of the query tree produces a structure that can
be used by the RDBMS query optimizer.
The Query Optimizer
The goal of the query optimizer is to produce an efficient execution plan for
processing the query represented by a standardized, canonical query tree. Although
an optimizer can theoretically find the "optimal" execution plan for any query tree,
an optimizer really produces an acceptably efficient execution plan. This is because
the possible number of table join combinations increases combinatorially as a query
becomes more complex. Without using pruning techniques or other heuristical
methods to limit the number of data combinations evaluated, the time it takes the
query optimizer to arrive at the best query execution plan for a complex query can
easily be longer than the time required to use the least efficient plan.
Various RDBMS implementations have used differing optimization techniques to
obtain efficient execution plans. This section discusses some of these techniques.
Heuristic Optimization
Heuristic optimization is a rules-based method of producing an efficient query
execution plan. Because the query output of the standardization process is
represented as a canonical query tree, each node of the tree maps directly to a
relational algebraic expression. The function of a heuristic query optimizer is to
apply relational algebraic rules of equivalence to this expression tree and transform
it into a more efficient representation. Using relational algebraic equivalence rules
ensures that no necessary information is lost during the transformation of the tree.
These are the major steps involved in heuristic optimization:
Break conjunctive selects into cascading selects.
Move selects down the query tree to reduce the number of returned "tuples."
("Tuple" rhymes with "couple." In a database table (relation), a set of related
values, one for each attribute (column). A tuple is stored as a row in a relational
database management system. It is the analog of a record in a nonrelational file. [
Move projects down the query tree to eliminate the return of unnecessary
attributes. Combine any Cartesian product operation followed by a select operation
into a single join operation. When these steps have been accomplished, the efficiency
of a query can be further improved by rearranging the remaining select and join
operations so that they are accomplished with the least amount of system overhead.
Heuristic optimizers, however, do not attempt this further analysis of the query.
Syntactical OptimizationSyntactical optimization relies on the user's understanding
of both the underlying database schema and the distribution of the data stored
within the tables. All tables are joined in the original order specified by the user
query. The optimizer attempts to improve the efficiency of these joins by identifying
indexes that are useful for data retrieval. This type of optimization can be extremely
efficient when accessing data in a relatively static environment. Using syntactical
optimization, indexes can be created and tuned to improve the efficiency of a fixed
set of queries. Problems occur with syntactical optimization whenever the
underlying data is fairly dynamic. Query access schemas can be degraded over time,
and it is up to the user to find a more efficient method of accessing the data. Another
problem is that applications using embedded SQL to query dynamically changing
data often need to be recompiled to improve their data access performance. Costbased optimization was developed to resolve these problems.
Cost-Based Optimization
To perform cost-based optimization, an optimizer needs specific information about
the stored data. This information is extremely system-dependent and can include
information such as file size, file structure types, available primary and secondary
indexes, and attribute selectivity (the percentage of tuples expected to be retrieved
for a given equality selection). Because the goal of any optimization process is to
retrieve the required information as efficiently as possible, a cost-based optimizer
uses its knowledge of the underlying data and storage structures to assign an
estimated cost in terms of numbers of tuples returned and, more importantly,
physical disk I/O for each relational operation. By evaluating various orderings of
the relational operations required to produce the result set, a cost-based optimizer
then arrives at an execution plan based on a combination of operational orderings
and data access methods that have the lowest estimated cost in terms of system
overhead.
As mentioned earlier, the realistic goal of a cost-based optimizer is not to produce
the "optimal" execution plan for retrieving the required data, but is to provide a
reasonable execution plan. For complex queries, the cost estimate is based on the
evaluation of a subset of all possible orderings and on statistical information that
estimates the selectivity of each relational operation. These cost estimates can be
only as accurate as the available statistical data. Due to the overhead of keeping this
information current for data that can be altered dynamically, most relational
database management systems maintain this information in system tables or
catalogs that must be updated manually. The database system administrator must
keep this information current so that a cost-based optimizer can accurately estimate
the cost of various operations.
Semantic Optimization
Although not yet an implemented optimization technique, semantic optimization is
currently the focus of considerable research. Semantic optimization operates on the
premise that the optimizer has a basic understanding of the actual database schema.
When a query is submitted, the optimizer uses its knowledge of system constraints
to simplify or to ignore a particular query if it is guaranteed to return an empty
result set. This technique holds great promise for providing even more
improvements to query processing efficiency in future relational database systems.
This optimization is accomplished in three phases:
Query analysis
Index selection
Join selection
Query Analysis
In the query analysis phase, the SQL Server optimizer looks at each clause
represented by the canonical query tree and determines whether it can be
optimized. SQL Server attempts to optimize clauses that limit a scan; for example,
search or join clauses. However, not all valid SQL syntax can be broken into
optimizable clauses, such as clauses containing the SQL relational operator <> (not
equal). Because <> is an exclusive rather than an inclusive operator, the selectivity
of the clause cannot be determined before scanning the entire underlying table.
When a relational query contains non-optimizable clauses, the execution plan
accesses these portions of the query using table scans. If the query tree contains any
optimizable SQL syntax, the optimizer performs index selection for each of these
clauses.
Index Selection
For each optimizable clause, the optimizer checks the database system tables to see
if there is an associated index useful for accessing the data. An index is considered
useful only if a prefix of the columns contained in the index exactly matches the
columns in the clause of the query. This must be an exact match, because an index is
built based on the column order presented at creation time. For a clustered index,
the underlying data is also sorted based on this index column order. Attempting to
use only a secondary column of an index to access data would be similar to
attempting to use a phone book to look up all the entries with a particular first
name: the ordering would be of little use because you would still have to check every
row to find all of the qualifying entries. If a useful index exists for a clause, the
optimizer then attempts to determine the clause's selectivity.
In the earlier discussion on cost-based optimization, it was stated that a cost-based
optimizer produces cost estimates for a clause based on statistical information. This
statistical information is used to estimate a clause's selectivity (the percentage of
tuples in a table that are returned for the clause).
This statistical information is updated only at the following two times:
During the initial creation of the index (if there is existing data in the table)
Whenever the UPDATE STATISTICS command is executed for either the index or
the associated table
To provide SQL Server with accurate statistics that reflect the actual tuple
distribution of a populated table, the database system administrator must keep the
statistical information for the table indexes reasonably current. If no statistical
information is available for the index, a heuristic based on the relational operator of
the clause is used to produce an estimate of selectivity.
Information about the selectivity of the clause and the type of available index is used
to calculate a cost estimate for the clause. SQL Server estimates the amount of
physical disk I/O that occurs if the index is used to retrieve the result set from the
table. If this estimate is lower than the physical I/O cost of scanning the entire table,
an access plan that employs the index is created.
Join Selection
When index selection is complete and all clauses have an associated processing cost
based on their access plan, the optimizer performs join selection. Join selection is
used to find an efficient order for combining the clause access plans. To accomplish
this, the optimizer compares various orderings of the clauses and then selects the
join plan with the lowest estimated processing costs in terms of physical disk I/O.
Because the number of clause combinations can grow combinatorially as the
complexity of a query increases, the SQL Server query optimizer uses tree pruning
techniques to minimize the overhead associated with these comparisons. When this
join selection phase is complete, the SQL Server query optimizer provides a costbased query execution plan that takes advantage of available indexes when they are
useful and accesses the underlying data in an order that minimizes system overhead
and improves performance.
Improving a logical query plan
A query plan improvement is very essential as it helps a lot in the reduction of the
data access (example by removing a lot of unwanted data). Assume that there are
100 rows out of which the final selection leads to an output of only 10 rows with two
columns. By a proper logical query plan the number of rows in the intermediate
stage can be reduced such that the data retrieval is minimum at every stage. This
aspect is further true when there is one or more join operation in the system.
Certain rules are followed which are called algebraic laws that tend to improve
logical query plans. The following are the commonly used.
Selection can be pushed down the expression tree as far as they can go. If the
selection statement is a part of an AND condition then we can split the condition and
push each piece down the tree separately. This strategy is probably one of the most
effective techniques.
Similarly projections can be pushed down the tree. Even projection operators can
be considered column wise selection operators. The pushing of projection operators
should be done with care. And so is the case with the selection operators.
Duplicate eliminations can sometimes be done (by the usage of the distinct
operator). This can be done in certain cases only. Say we are looking for the MAX
value in a column then a distinct operator can be used to eliminate duplicates this
way making the search smaller. Where as if the operator were AVG or SUM the
values with and without Distinct would be very different.
Certain select statements can be turned into a product that in turn simplifies the
whole expression.
Let us consider an example that covers all the above-mentioned points.
∏title
Җ
Starname = name
StarsIn
σbirthdate LIKE ‘%1960’
MovieStar
The above plan is one form of the query plan that can be provided for the given
SQL query. An improvement over the Query plan is given below.
∏title
Җ
Starname = name
∏Starname,Title
∏name
σbirthdate LIKE ‘%1960’
StarsIn
MovieStar
Download