LiteratureReviewAndDesign

advertisement
Literature Review
Query Translator Tool
CSSE433 Advanced Database Systems
Dr. Sriram Mohan
Chandler Kent and Derek Hammer
Table of Contents
Introduction .................................................................................................................................................. 1
Top-Level Description of Problem ................................................................................................................ 1
“Query Evaluation Techniques for Large Databases” ............................................................................... 1
“Logic-Based Approach to Semantic Query Optimization” ...................................................................... 1
“Practical Selectivity Estimation through Adaptive Sampling” ................................................................. 1
Translating SQL Queries into Relational Algebra .......................................................................................... 2
“A Rule-Based View of Query Optimization” ............................................................................................ 2
Relational Algebra Transformation Rules ..................................................................................................... 4
Heuristic Optimizing Algorithm ............................................................................................................. 4
Using Selectivity and Cost Estimates in Query Optimization ........................................................................ 4
“Randomized Algorithms for Optimizing Large Join Queries” .................................................................. 4
Iterative Improvement Algorithm ......................................................................................................... 5
“Fundamental Techniques for Order Optimization” ................................................................................ 5
Semantic Query Optimization ....................................................................................................................... 5
Conclusion ..................................................................................................................................................... 6
The Current Form of our Project............................................................................................................... 6
References .................................................................................................................................................... 7
Glossary ......................................................................................................................................................... 8
1
Introduction
This milestone report will describe the referenced papers for our project that we have been
researching for the past two weeks. Each of these papers will have an influence on our project either
directly or indirectly; some of these reference materials are merely for the sake of understanding the
project in context and some of the reference materials will have specifics that are directly implemented
into our project. Our first point of reference was Fundamentals of Database Systems by Elmasri and
Navathe that we used in an introductory course. One particular chapter was of interest to us and gave
us a basic overview for the project. The literature review is presented in a similar format as Chapter 15,
“Algorithms for Query Processing and Optimization,” in that we follow along with the chapter,
supplementing external sources as they fit.
Top-Level Description of Problem
In relational database management systems (RDBMS), queries are presented by the client or
client programs in the form of SQL-like queries: that is, queries in the form of describing the results
rather than describing the execution (declarative rather than procedural). Much of our research helped
in describing and understanding the problem: “Query Evaluation Techniques for Large Databases,”
“Logic-Based Approach to Semantic Query Optimization,” “Practical Selectivity Estimation through
Adaptive Sampling,” and several others that will be mentioned later in the review. Each of these listed
had no direct influence on our project (as in, they do not provide algorithms we will implement), but did
provide either good background or context information.
“Query Evaluation Techniques for Large Databases”
In this paper, Goetz surveys the techniques used for query processing. The majority of
this paper introduces concepts that we have already learned in either CSSE333 or CSSE433.
However, the paper introduces all of the concepts in relation to each other, providing examples
and more in depth analysis of the work in the field. A very long paper, this survey is a good place
to get our bearings on how our project would fit into a RDBMS system.
“Logic-Based Approach to Semantic Query Optimization”
Semantic query optimization is the optimization of queries using semantics (integrity
constraints) to parse queries into simpler, more efficient queries before ever translating them
into relational algebra. The methods used essentially analyzed the integrity constraints of a
database and the queried tables, using that information to select a specific algorithm that would
possibly reduce the query. We will most likely not be able to use this paper or its contents
because we do not plan to focus on Semantic Query Optimization or complex queries, in
general.
“Practical Selectivity Estimation through Adaptive Sampling”
In this paper, the three authors discuss the ability to estimate the size of selections and
joins through sampling the data sets. Their proposed method of adaptive sampling amplifies the
effects of basic sampling algorithms, creating far more accurate estimations of all sizes of
databases and queries. Although this would be useful in many database implementations, it will
2
not be useful for us to study. We are attempting to create a program that demonstrates a
database’s usage of heuristics.
Translating SQL Queries into Relational Algebra
The first stage of our project is to translate user-submitted queries into relational algebra. The
intent of the project is to simulate the power of a DMBS by allow the user the power to insert his or her
own queries into our system. We must then translate those queries into relational algebra that is
represented by query trees. The book provides relatively little information on the subject of generalizing
query transformations and we could only find one paper that dealt specifically with the transformation
of SQL into Relational Algebra.
“A Rule-Based View of Query Optimization”
In this paper, Freytag presents a clear, concise method for
optimization of queries. The paper’s focus is the transformation of
user-submitted queries into algebra-based Query Evaluation Plans
(QEPs) that can be used to retrieve the desired data from the
database. In Figure 1, Freytag demonstrates the basic necessities of a
LDBP, the data-retrieving portion of a RDBMS. Currently, our project
focuses on the region from User-Submitted Query to QEP and, even
then, will only “skim” the surface of complexity that RDBMSs must
accomplish.
The author goes on to describe a rule-based path toward
translating queries from SQL-like statements (SELECT (…) (…) (…)) to
algebra-like statements (PROJECT (x) (LJOIN y z)). He gives a few
examples and describes the rules as in Figure 2.
Figure 1: Overview of
steps from SQL to
executable code. The
pertinent steps to this
project stop at QEP
generation.
3
Figure 2: The target language of query optimization. This is what we will convert the SQL into and how we will represent our
query trees internally.
4
Using these rules, it will be an easy task to translate standard SQL queries into query
trees in our database.
Relational Algebra Transformation Rules
The book describes twelve rules that will assist in optimizing the query trees that are produced
in the translation process. It also lays the ground work for our algorithm that uses these basic heuristic
rules to optimize the trees before they are translated to QEPs. We will not describe the rules in this
document. However, we will describe the algorithm that the book uses to optimize queries:
Heuristic Optimizing Algorithm
Step 1: Break up any select operations with conjunctive conditions into a cascade of
select operations.
Step 2: Move each select operation as far down the query tree as is permitted by the
attributes involved in the select condition.
Step 3: Rearrange the leaf nodes of the tree using the following criteria. First, position
the leaf node relations with the most restrictive select operations so they are
executed first in the query tree representation. Second, make sure that the
ordering of leaf nodes does not cause Cartesian product operations; it may be
desirable to change the order of leaf nodes to avoid Cartesian products.
Step 4: Combine Cartesian product operations with a subsequent select operation in
the tree into a join operation if the condition represents a join condition.
Step 5: Break down and move lists of projection attributes down the tree as far as
possible by creating new project operations as needed. Only those attributes
needed in the query result and in subsequent operations in the query tree
should be kept after each project operation.
Step 6: Identify subtrees that represent groups of operations that can be executed by a
single algorithm.
Using Selectivity and Cost Estimates in Query Optimization
Once the optimal query tree is produced, we need to generate a QEP (that we will represent as
a query tree to facilitate learning). The “optimal” QEP will be generated by a time limited use of an
algorithm described in the following paper.
“Randomized Algorithms for Optimizing Large Join Queries”
In this paper, the two authors describe and analyze the use of randomization in
algorithms that search for the most efficient optimization of QEPs. In doing so, these authors
provide a good analysis of what a good optimizing algorithm would be, following the form:
5
Iterative Improvement Algorithm
procedure IterativeImprovement () {
minS = S;
while not (stopping_condition) do {
S = random_state;
while not (local_minimum(S)) do {
S’ = random_state in neighbors(S);
If cost(S’) < cost(S) then minS = S;
}
return minS;
}
}
In conjunction with the cost formulas provided by Fundamentals of Database Systems,
this algorithm will allow us to process a good QEP (not the most optimal, necessarily) to return
to the user.
“Fundamental Techniques for Order Optimization”
In this paper, the authors discuss the techniques for determining the optimal QEP of a
particular query. They use several methods, described in prior papers, to optimize the QEP.
However, they introduce a novel method to optimize QEPs: order optimization. Order
optimization drastically improves the execution times of optimization for queries. They use four
different properties to determine their optimizations: the order property, the predicate
property, the key property and the FD property. They determine how each of these propagates
through the query tree and determine their overall effect on the cost of the execution. By
selecting an order that reduces the cost, these techniques can have a significant effect.
We can use these techniques in our own implementation by examining the four
properties and accounting for them in our own algorithm. We will implement a very similar form
of this optimization algorithm and perhaps extend upon it with the works of other authors,
including the authors of the book and the prior algorithm discussed.
Semantic Query Optimization
Several DMBS perform semantic query optimization on their user-defined inputs. However, we
feel that the scope of this project should not include this complex and difficult part of query
optimization. In addition, we are assuming simple queries will be used, instead, to give good examples
to students at an introductory level.
6
Conclusion
In conclusion, we have reviewed many papers that present the state of the art in the area of
query processing, converting, and optimizing. We found Fundamentals of Database Systems to be the
most useful source, both because it combines many of the sources into a single source and because it
provides a great overview and groundwork for our project.
The Current Form of our Project
After reviewing these resources, our project has begun to take shape conceptually. The
goal of this project is still to provide an educational tool for students to better understand query
processing, converting, and optimization. Our current vision for implementing this tool is that
we will follow the basic outline of Figure 1. A user will input SQL which we will then validate and
parse, converting the SQL into a form as presented in Figure 2. Then we will optimize the query
using the Heuristic Optimizing Algorithm from Fundamentals of Database Systems and then use
then generate a QEP using cost-base estimation techniques also described in Fundamentals of
Database Systems. The steps of the Heuristic Optimizing Algorithm as well as the different cost
calculations will be displayed to the user in an easy-to-follow tree structure, also as in
Fundamentals of Database Systems.
7
References
Chakravarthy, Upen S., John Grant, and Jack Minker. "Logic-Based Approach to Semantic Query
Optimization." ACM 15.2 (1990): 162-207. ACM. Logan Library, Terre Haute.
Elmasri, Ramez, and Shamkant B. Navathe. Fundamentals of Database Systems. 5th ed. Boston:
Addison Wesley, 2006.
Freytag, Johann C. "A Rule-Based View Sf Query Optimization." ACM (1987). ACM. Logan
Library, Terre Haute.
Graefe, Goetz. "Query Evaluation Techniques for Large Databases." ACM Computing Surveys
25.2 (1993). ACM. Logan Library, Terre Haute.
Ioannidis, Yannis E. "RANDOMIZED ALGORITHMS FOR OPTIMIZING LARGE JOIN QUERIES." ACM
(1990). ACM. Logan Library, Terre Haute.
Klug, Anthony. "Equivalence of Relational Algebra." Journal of the Association for Computing
Machinery 29.3 (1982): 699-717. ACM. Logan Library, Terre Haute.
Lipton, Richard J., Jeffrey F. Naughton, and Donovan A. Schneider. "Practical Selectivity
Estimation Through Adaptive Sampling." ACM (1990). ACM. Logan Library, Terre Haute.
Negri, M., G. Pelagatti, and L. Sbattella. "Formal Semantics of SQL Queries." ACM Transactions
on Database Systems 17.3 (1991): 513-534. ACM. Logan Library, Terre Haute.
Simmen, David, Eugene Shekita, and Timothy Malkemus. "Fundamental Techniques for Order
Optimization." ACM (1996). ACM. Logan Library, Terre Haute.
8
Glossary
Integrity Constraints: a constraint (rule) that must remain true for a database to preserve integrity;
integrity constraints are specified at database creation time and enforced by the database
management system
Heuristics: relating to general strategies or methods for solving problems; especially, of a method that is
not certain to arrive at an optimal solution
Query Evaluation Plan (QEP): a specific plan that a DBMS generates to execute a command
Query Optimization: the steps a DBMS performs to find an optimal QEP to execute a given SQL query
Relational Algebra: an offshoot of first-order logic (and of algebra of sets), deals with a set of relations
closed under operators
Relational Database Management System (RDBMS): a database management system (DBMS) that is
based on the relational model as introduced by E. F. Codd
SQL (Query): a database computer language designed for the retrieval and management of data in
relational database management systems (RDBMS), database schema creation and modification,
and database object access control management
Download