Literature Review Query Translator Tool CSSE433 Advanced Database Systems Dr. Sriram Mohan Chandler Kent and Derek Hammer Table of Contents Introduction .................................................................................................................................................. 1 Top-Level Description of Problem ................................................................................................................ 1 “Query Evaluation Techniques for Large Databases” ............................................................................... 1 “Logic-Based Approach to Semantic Query Optimization” ...................................................................... 1 “Practical Selectivity Estimation through Adaptive Sampling” ................................................................. 1 Translating SQL Queries into Relational Algebra .......................................................................................... 2 “A Rule-Based View of Query Optimization” ............................................................................................ 2 Relational Algebra Transformation Rules ..................................................................................................... 4 Heuristic Optimizing Algorithm ............................................................................................................. 4 Using Selectivity and Cost Estimates in Query Optimization ........................................................................ 4 “Randomized Algorithms for Optimizing Large Join Queries” .................................................................. 4 Iterative Improvement Algorithm ......................................................................................................... 5 “Fundamental Techniques for Order Optimization” ................................................................................ 5 Semantic Query Optimization ....................................................................................................................... 5 Conclusion ..................................................................................................................................................... 6 The Current Form of our Project............................................................................................................... 6 References .................................................................................................................................................... 7 Glossary ......................................................................................................................................................... 8 1 Introduction This milestone report will describe the referenced papers for our project that we have been researching for the past two weeks. Each of these papers will have an influence on our project either directly or indirectly; some of these reference materials are merely for the sake of understanding the project in context and some of the reference materials will have specifics that are directly implemented into our project. Our first point of reference was Fundamentals of Database Systems by Elmasri and Navathe that we used in an introductory course. One particular chapter was of interest to us and gave us a basic overview for the project. The literature review is presented in a similar format as Chapter 15, “Algorithms for Query Processing and Optimization,” in that we follow along with the chapter, supplementing external sources as they fit. Top-Level Description of Problem In relational database management systems (RDBMS), queries are presented by the client or client programs in the form of SQL-like queries: that is, queries in the form of describing the results rather than describing the execution (declarative rather than procedural). Much of our research helped in describing and understanding the problem: “Query Evaluation Techniques for Large Databases,” “Logic-Based Approach to Semantic Query Optimization,” “Practical Selectivity Estimation through Adaptive Sampling,” and several others that will be mentioned later in the review. Each of these listed had no direct influence on our project (as in, they do not provide algorithms we will implement), but did provide either good background or context information. “Query Evaluation Techniques for Large Databases” In this paper, Goetz surveys the techniques used for query processing. The majority of this paper introduces concepts that we have already learned in either CSSE333 or CSSE433. However, the paper introduces all of the concepts in relation to each other, providing examples and more in depth analysis of the work in the field. A very long paper, this survey is a good place to get our bearings on how our project would fit into a RDBMS system. “Logic-Based Approach to Semantic Query Optimization” Semantic query optimization is the optimization of queries using semantics (integrity constraints) to parse queries into simpler, more efficient queries before ever translating them into relational algebra. The methods used essentially analyzed the integrity constraints of a database and the queried tables, using that information to select a specific algorithm that would possibly reduce the query. We will most likely not be able to use this paper or its contents because we do not plan to focus on Semantic Query Optimization or complex queries, in general. “Practical Selectivity Estimation through Adaptive Sampling” In this paper, the three authors discuss the ability to estimate the size of selections and joins through sampling the data sets. Their proposed method of adaptive sampling amplifies the effects of basic sampling algorithms, creating far more accurate estimations of all sizes of databases and queries. Although this would be useful in many database implementations, it will 2 not be useful for us to study. We are attempting to create a program that demonstrates a database’s usage of heuristics. Translating SQL Queries into Relational Algebra The first stage of our project is to translate user-submitted queries into relational algebra. The intent of the project is to simulate the power of a DMBS by allow the user the power to insert his or her own queries into our system. We must then translate those queries into relational algebra that is represented by query trees. The book provides relatively little information on the subject of generalizing query transformations and we could only find one paper that dealt specifically with the transformation of SQL into Relational Algebra. “A Rule-Based View of Query Optimization” In this paper, Freytag presents a clear, concise method for optimization of queries. The paper’s focus is the transformation of user-submitted queries into algebra-based Query Evaluation Plans (QEPs) that can be used to retrieve the desired data from the database. In Figure 1, Freytag demonstrates the basic necessities of a LDBP, the data-retrieving portion of a RDBMS. Currently, our project focuses on the region from User-Submitted Query to QEP and, even then, will only “skim” the surface of complexity that RDBMSs must accomplish. The author goes on to describe a rule-based path toward translating queries from SQL-like statements (SELECT (…) (…) (…)) to algebra-like statements (PROJECT (x) (LJOIN y z)). He gives a few examples and describes the rules as in Figure 2. Figure 1: Overview of steps from SQL to executable code. The pertinent steps to this project stop at QEP generation. 3 Figure 2: The target language of query optimization. This is what we will convert the SQL into and how we will represent our query trees internally. 4 Using these rules, it will be an easy task to translate standard SQL queries into query trees in our database. Relational Algebra Transformation Rules The book describes twelve rules that will assist in optimizing the query trees that are produced in the translation process. It also lays the ground work for our algorithm that uses these basic heuristic rules to optimize the trees before they are translated to QEPs. We will not describe the rules in this document. However, we will describe the algorithm that the book uses to optimize queries: Heuristic Optimizing Algorithm Step 1: Break up any select operations with conjunctive conditions into a cascade of select operations. Step 2: Move each select operation as far down the query tree as is permitted by the attributes involved in the select condition. Step 3: Rearrange the leaf nodes of the tree using the following criteria. First, position the leaf node relations with the most restrictive select operations so they are executed first in the query tree representation. Second, make sure that the ordering of leaf nodes does not cause Cartesian product operations; it may be desirable to change the order of leaf nodes to avoid Cartesian products. Step 4: Combine Cartesian product operations with a subsequent select operation in the tree into a join operation if the condition represents a join condition. Step 5: Break down and move lists of projection attributes down the tree as far as possible by creating new project operations as needed. Only those attributes needed in the query result and in subsequent operations in the query tree should be kept after each project operation. Step 6: Identify subtrees that represent groups of operations that can be executed by a single algorithm. Using Selectivity and Cost Estimates in Query Optimization Once the optimal query tree is produced, we need to generate a QEP (that we will represent as a query tree to facilitate learning). The “optimal” QEP will be generated by a time limited use of an algorithm described in the following paper. “Randomized Algorithms for Optimizing Large Join Queries” In this paper, the two authors describe and analyze the use of randomization in algorithms that search for the most efficient optimization of QEPs. In doing so, these authors provide a good analysis of what a good optimizing algorithm would be, following the form: 5 Iterative Improvement Algorithm procedure IterativeImprovement () { minS = S; while not (stopping_condition) do { S = random_state; while not (local_minimum(S)) do { S’ = random_state in neighbors(S); If cost(S’) < cost(S) then minS = S; } return minS; } } In conjunction with the cost formulas provided by Fundamentals of Database Systems, this algorithm will allow us to process a good QEP (not the most optimal, necessarily) to return to the user. “Fundamental Techniques for Order Optimization” In this paper, the authors discuss the techniques for determining the optimal QEP of a particular query. They use several methods, described in prior papers, to optimize the QEP. However, they introduce a novel method to optimize QEPs: order optimization. Order optimization drastically improves the execution times of optimization for queries. They use four different properties to determine their optimizations: the order property, the predicate property, the key property and the FD property. They determine how each of these propagates through the query tree and determine their overall effect on the cost of the execution. By selecting an order that reduces the cost, these techniques can have a significant effect. We can use these techniques in our own implementation by examining the four properties and accounting for them in our own algorithm. We will implement a very similar form of this optimization algorithm and perhaps extend upon it with the works of other authors, including the authors of the book and the prior algorithm discussed. Semantic Query Optimization Several DMBS perform semantic query optimization on their user-defined inputs. However, we feel that the scope of this project should not include this complex and difficult part of query optimization. In addition, we are assuming simple queries will be used, instead, to give good examples to students at an introductory level. 6 Conclusion In conclusion, we have reviewed many papers that present the state of the art in the area of query processing, converting, and optimizing. We found Fundamentals of Database Systems to be the most useful source, both because it combines many of the sources into a single source and because it provides a great overview and groundwork for our project. The Current Form of our Project After reviewing these resources, our project has begun to take shape conceptually. The goal of this project is still to provide an educational tool for students to better understand query processing, converting, and optimization. Our current vision for implementing this tool is that we will follow the basic outline of Figure 1. A user will input SQL which we will then validate and parse, converting the SQL into a form as presented in Figure 2. Then we will optimize the query using the Heuristic Optimizing Algorithm from Fundamentals of Database Systems and then use then generate a QEP using cost-base estimation techniques also described in Fundamentals of Database Systems. The steps of the Heuristic Optimizing Algorithm as well as the different cost calculations will be displayed to the user in an easy-to-follow tree structure, also as in Fundamentals of Database Systems. 7 References Chakravarthy, Upen S., John Grant, and Jack Minker. "Logic-Based Approach to Semantic Query Optimization." ACM 15.2 (1990): 162-207. ACM. Logan Library, Terre Haute. Elmasri, Ramez, and Shamkant B. Navathe. Fundamentals of Database Systems. 5th ed. Boston: Addison Wesley, 2006. Freytag, Johann C. "A Rule-Based View Sf Query Optimization." ACM (1987). ACM. Logan Library, Terre Haute. Graefe, Goetz. "Query Evaluation Techniques for Large Databases." ACM Computing Surveys 25.2 (1993). ACM. Logan Library, Terre Haute. Ioannidis, Yannis E. "RANDOMIZED ALGORITHMS FOR OPTIMIZING LARGE JOIN QUERIES." ACM (1990). ACM. Logan Library, Terre Haute. Klug, Anthony. "Equivalence of Relational Algebra." Journal of the Association for Computing Machinery 29.3 (1982): 699-717. ACM. Logan Library, Terre Haute. Lipton, Richard J., Jeffrey F. Naughton, and Donovan A. Schneider. "Practical Selectivity Estimation Through Adaptive Sampling." ACM (1990). ACM. Logan Library, Terre Haute. Negri, M., G. Pelagatti, and L. Sbattella. "Formal Semantics of SQL Queries." ACM Transactions on Database Systems 17.3 (1991): 513-534. ACM. Logan Library, Terre Haute. Simmen, David, Eugene Shekita, and Timothy Malkemus. "Fundamental Techniques for Order Optimization." ACM (1996). ACM. Logan Library, Terre Haute. 8 Glossary Integrity Constraints: a constraint (rule) that must remain true for a database to preserve integrity; integrity constraints are specified at database creation time and enforced by the database management system Heuristics: relating to general strategies or methods for solving problems; especially, of a method that is not certain to arrive at an optimal solution Query Evaluation Plan (QEP): a specific plan that a DBMS generates to execute a command Query Optimization: the steps a DBMS performs to find an optimal QEP to execute a given SQL query Relational Algebra: an offshoot of first-order logic (and of algebra of sets), deals with a set of relations closed under operators Relational Database Management System (RDBMS): a database management system (DBMS) that is based on the relational model as introduced by E. F. Codd SQL (Query): a database computer language designed for the retrieval and management of data in relational database management systems (RDBMS), database schema creation and modification, and database object access control management