OVERVIEW OF QUERY PROCESSING Structure 6.0 Objectives 6.1 Introduction 6.2 Query Processing Problem 6.3 Objectives of Query Processing 6.4 Characterization of Query Processors 6.5 Layers of Query Processing 6.5.1 Query Decomposition 6.5.2 Data Localization 6.5.2 Global Query Optimization 6.5.3 Local Query Optimization 6.6 Summary 6.0 Objectives: In this unit we learn about an overview of query processing in Distributed Data Base Management Systems (DDBMSs). This is explained with the help of Relational Calculus and Relational Algebra because of their generality and wide use in DDBMSs. In this we discuss Various problems of query processing About an ideal Query Processor The concept of layering in query processing Some related examples of query processing 6.1 Introduction: The increasing success of relational database technology in data processing is suitable, in part, to the availability of nonprocedural languages, which can significantly improve application development and end-user productivity. By hiding the low-level details about the physical organization of the data, relational database languages allow the expression of complex queries in a concise and simple fashion. In particular, to construct the answer to the query, the user does not exactly specify the procedure to follow. This procedure is actually devised by a DBMS module, called as Query Processor. This relieves the user from query optimization, a time consuming task that is handled properly by the query processor. This issue has considerably important both in Centralized and Distributed processing systems. However, the query processing problem is much more difficult in distributed environments than in the conventional systems. In exact, the relations involved in distributed queries may be fragmented and/or replicated, there by inducing communication overhead costs. So, in this unit let us discuss the different issues of query processing, about an ideal query processor for distributed environment and finally, a layered software approach for distributed query processing. 6.2 Query Processing Problem: The main duty of a relational query processor is to transform a high-level query (in relational calculus), into an equivalent lower level query (in relational algebra). The distributed database is of major importance for query processing since the definition of fragments is based on the objective of increasing reference locality, and sometimes-parallel execution for the most important queries. The role of a distributed query processor is to map a high level query on a distributed database (a set of global relations) into a sequence of database operations (of relational algebra) on relational fragments. Several important functions characterize this mapping: The calculus query must be decomposed into a sequence of relational operations called an algebraic query The data accessed by the query must be localized so that the operations on relations are translated to bear on local data (fragments) The algebraic query on fragments must be extended with communication operations and optimized with respect to a cost function to be minimized. This cost function refers to computing resources such as disk I/Os, CPUs, and communication networks. The low-level query actually implements the execution strategy for the query. The transformation must achieve both correctness and efficiency. The well-defined mapping with the above said functional characteristics makes the correctness issue easy. But producing an efficient execution strategy is more complex. A relational calculus query may have many equivalent and correct transformations into relational algebra. Since each equivalent execution strategy can lead to different consumptions of computer resources, the main problem is to select the execution strategy that minimizes the resource consumption. Example: We consider the following subset of engineering database scheme given in fig.6.0: E (ENO, ENAME, TITLE) G (ENO, JNO, RESP, DUR) and the simple user query: “ Find the names of employees who are managing a project”. E G ENO ENAME TITLE ENO JNO RESP DUR E1 A Elect. Eng. E1 J1 Manager 12 E2 B Syst. Arial, E2 J1 Analyst 24 E3 C Mech. Eng. E2 J2 Analyst 6 E4 D Programmer E3 J3 Consultant 10 E5 E Syst. Anal. E3 J4 Engineer 48 E6 F Elect. Eng. E4 J2 Programmer 18 E7 G Mech. Eng. E5 J2 Manager 24 E8 H Syst. Anal. E6 J4 Manager 48 E7 J3 Engineer 36 E8 J3 Manager 40 J S JNO JNAME BUDGET LOC TITLE SAL J1 Instrumentation 150000 Montreal Elect. Eng. 40000 J2 Database Develop. 135000 New York Syst. Anal. 34000 J3 CAD/CAM 250000 New York Mech. Eng. 27000 J4 Maintenance 310000 Paris Programmer 24000 Figure 6.0 Example Database The equivalent relational calculus using SQL syntax is: SELECT ENAME FROM E, G WHERE E.ENO = G.ENO AND RESP = “ Manager” Two equivalent relational algebra queries that are correct transformations of the above query are: PJ ENAME(SL RESP = “Manager” AND E.ENO=G.GNO (E CP G)) and PJ ENAME (E JN ENO (SL RESP = “Manager” (G))) NOTE: The following observations are made from the above example: It can be observed that the second query avoids the Cartesian product (CP) of E and G, consumes much less computing resource than the first and thus should be retained. That is, we have to avoid performing Cartesian product operation on a full table. In a centralized environment, the role of the query processor is to choose the best relational algebra query for a given query among all equivalent ones. In a distributed environment, relational algebra is not enough to express execution strategies. It must be supported with operations for exchanging data between sites. The distributed query processor has to select the best sites to process the data and the way in which the data should be transformed with the choice of ordering the relations. Example: This example illustrates the importance of site selection and communication for a chosen relational algebra query against a fragmented database. We consider the following query: PJ ENAME (E JN ENO (SL RESP = “Manager” (G))) This query is written considering the relations of the previous example. We assume that the relations E and G are horizontally fragmented as follows: E 1 = SL ENO “ E3” (E) E 2 = SL ENO > “ E3” (E) G1 = SL ENO “ E3” (G) G2 = SL ENO > “ E3” (G) Fragments G1, G2, E1 and E2 are stored at the sites 1,2,3, and 4, respectively, and the result is expected at the site 5 as shown in the fig 6.1. For simplicity, we have ignored the project operation here. In the figure two equivalent strategies for the above query are shown. Some of the observations of the Strategies: An arrow from site i to site j labeled with R indicates that relation R is transferred from Strategy site i to site j. A exploits the fact that relations E and G are fragmented in the same way in order to perform the select and join operations in parallel. Strategy B centralizes all the operations and the data at the result site before processing the query. Resource consumption of these two strategies: Assumptions made: 1. Tuple access denoted as tupacc is 1 unit. 2. A tuple transfer, denoted as tuptrans, is 10 units. 3. Relations E and G have 400 and 1000 tuples respectively. 4. There are 20 managers in relation G. 5. The data is uniformly distributed among sites. 6. E and G relations are locally clustered an attributes RESP and ENO, respectively. 7. There is direct access to tuples of G (respectively, E) based on the value of attribute RESP (respectively, ENO) The Cost Analysis: The cost of strategy A can be derived as follows: 1. Produce G' by selecting G requires 20 * tupacc = 20 2. Transfer G' to the sites of E requires 20 * tuptrans = 200 3. Produce E' by joining G' and E requires (10*10)* tupacc*2 4. Transfer E' to result site requires 20* tuptrans The total cost = 200 = 200 620 The cost of strategy B can be derived as follows: 1. Transfer E to site 5 requires 400 * tuptrans = 4000 2. Transfer G to site 5 requires 1000 * tuptrans = 10000 3. Produce G' by selecting G requires 1000 * tupacc = 1000 4. Join E and G' requires 400 * 20 * tupacc = 8000 The total cost 23000 The strategy A is better by a factor of 37, which is quite significant. Also it provides the better distribution of work among the sites. The difference would be still better if we assume slower communication and/or higher degree of fragmentation. Result = E1 UN E2 E'1 E'2 Site 3 Site 4 E2 = E2 JN ENO G2 E1 = E1 JN ENO G1 G'1 G'2 Site 5 Site 2 Site 1G1 = SLRESP = ‘Manager’ G1 G2 = SLRESP = ‘Manager’ G2 (a) Strategy A Result = (E1 UN E2 JN ENO PJ RESP ='Manager’ (G1 UN G2) G1 G2 Site 1 Site 2 E1 Site 3 E2 Site 4 (b) Strategy B Fig.6.1 Equivalent Distributed Execution Strategies 6.3 Objectives of Query Processing: The main objectives of query processing in a distributed environment is to form a high level query on a distributed database, which is seen as a single database by the users, into an efficient execution strategy expressed in a low level language on local databases. An important point of query processing is query optimization. Because many execution strategies are correct transformations of the same high-level query, the one that optimizes (minimizes) resource consumption should be retained. The good measures of resource consumption are: o The total cost that will be incurred in processing the query. It is the some of all times incurred in processing the operations of the query at various sites and intrinsic communication. o The resource time of the query. This is the time elapsed for executing the query. Since operations can be executed in parallel at different sites, the response time of a query may be significantly less than its cost. Obviously the total cost should be minimized. o In a distributed system, the total cost to be minimized includes CPU, I/O, and communication costs. These costs can be minimized by reducing the number of I/O operations through fast access methods to the data and efficient use of main memory. The communication cost is the time needed for exchanging the data between sites participating in the execution of the query. This cost is incurred in processing the messages and transmitting the data on the communication network. In distributed system, the communication cost factor is largely dominating the local processing cost, so that the other cost factors are ignored. o In centralized systems, only CPU and I/O cost have to be considered. 6.4 Characterization of Query Processors: It is very difficult to give the characteristics, which differentiates centralized and distributed query processors. Still some of them have been listed here. Out of them, the first four are common to both and the next four are particular to distributed query processors. o Languages: The input language to the query processor can be based on relational calculus or relational algebra. The former requires an additional phase to decompose a query expressed in relational calculus to relational algebra. In distributed context, the output language is generally some form of relational algebra augmented with communication primitives. That is it must perform perfect mapping between input languages with the output language. o Types of optimization: Conceptually, query optimization is to choose a best point of solution space that leads to the minimum cost. A popular approach called exhaustive search is used. This is a method where heuristic techniques are used. In both centralized and distributed systems a common heuristic is to minimize the size of intermediate relations. Performing unary operations first and ordering the binary operations by the increasing size of their intermediate relations can do this. o Optimization Timing: A query may be optimized at different times relative to the actual time of query execution. Optimization can be done statically before executing the query or dynamically as the query is executed. The main advantage of the later method is that the actual sizes of the intermediate relations are available to the query processor, thereby minimizing the probability of a bad choice. The main drawback of the dynamic method is that the query optimization, which is an expensive one, must be repeated for each and every query. So, Hybrid optimization may be better in some situation. o Statistics: The effectiveness of the query optimization is based on statistics on the database. Dynamic query optimization requires statistics in order to choose the operation that has to be done first. Static query optimization requires statistics to estimate the size of intermediate relations. The accuracy of the statistics can be improved by periodical updating. o Decision sites: Most of the systems use centralized decision approach, in which a single site generates the strategy. However, the decision process could be distributed among various sites participating in the elaboration of the best strategy. The centralized approach is simpler but requires the knowledge of the complete distributed database where as the distributed approach requires only local information. Hybrid approach is better where the major decisions are taken at one particular site and other decisions are taken locally. o Exploitation of the Network Topology: the distributed query processor exploits the network topology. With wide area networks, the cost function to be minimized can be restricted to the data communication cost, which is a dominant factor. This issue reduces the work of distributed query optimization, that can be dealt as two separate problems: Selection of the global execution strategy, based on the inter-site communication and selection of each local execution strategy, based on a centralized query processing algorithms. With local area networks, communication costs are comparable to I/O costs. Therefore, it is reasonable to the distributed query processor to increase parallel execution at the cost of increasing communication. o Exploitation of Replicated fragments: For reliability purposes it is useful to have fragments replicated at different sites. Query processors have to exploit this information either statically or dynamically for processing the query efficiently. o Use of semi- joins: The semi-join operation reduces the size of the data that are exchanged between the sites so that the communication cost can be reduced. 6.5 Layers Of Query Processing: The problem of query processing can itself be decomposed into several subprograms, corresponding to various layers. In figure 6.2, a generic layering scheme for query processing is shown where each layer solves a well-defined sub-problem. The input is a query on distributed data expressed in relational calculus. This distributed query is posed on global (distributed) relations, meaning that data distribution is hidden. Four main layers are involved to map the distributed query into an optimized sequence of local operations, each acting on a local database. These layers perform the functions of query decomposition, data localization, global query optimization, and local query optimization. The first three layers are performed by a central site and use global information; the local sites do the fourth. CALCULUS QUERY ON DISTRIBUTED RELATIONS QUERY GLOBAL SCHEMA DECOMPOSITION ALGEBRAIC QUERY ON DISTRIBUTED RELATIONS CONTROL SITE DATA LOCALIZATION FRAGMENT SCHEMA FRAGMENT QUERY GLOBAL OPTIMIZATION STATISTICS ON FRAGMENTS OPTIMIZED FRAGMENT QUERY WITH COMMUNICATION OPERATIONS LOCAL SITES LOCAL OPTIMIZATION LOCAL SCHEMA OPTIMIZED LOCAL QUERIES Figure 6.3 Generic Layering Scheme for Distributed Query Processing 6.5.1 Query Decomposition: The first layer decomposes the distributed calculus query into an algebraic query on global relations. The information needed for this transformation is found in the global conceptual schema describing the global relations. However, the information about data distribution is not used here but in the next layer. Thus the techniques used by this layer are those of a centralized DBMS. Query decomposition can be viewed as four successive steps: o The calculus query is rewritten in a normalized form that is suitable for subsequent manipulation. Normalization of a query generally involves the manipulation of the query quantifiers and of the query qualification by applying logical operator priority. o The normalized query is analyzed semantically so that incorrect queries are detected and rejected as early as possible. Techniques to detect incorrect queries exist only for a subset of relational calculus. Typically, they use some sort of graph that captures the semantics of the query. o The correct query (still expressed in relational calculus) is simplified. One way to simplify a query is to eliminate redundant predicates. o The calculus query is restructured as an algebraic query. The quality of an algebraic query is defined in terms of expected performance. The traditional way to do this transformation toward a "better" algebraic specification is to start with an initial algebraic query and transform it in order to find a "good" one. The initial algebraic query is derived immediately from the calculus query by translating the predicates and the target statement into relational operations as they appear in the query. This directly translated algebra query is then restructured through transformation rules. The algebraic query generated by this layer is good in the sense that the worse executions are avoided. 6.5.2 Data Localization: The input to the second layer is an algebraic query on distributed relations. The main role of the second layer is to localize the query’s data using data distribution information. Relations are fragmented and stored in disjoint subsets called fragments, each being stored at a different site. This layer determines which fragments are involved in the query and transforms the distributed query into a fragment query. Fragmentation is defined through fragmentations rules that can be expressed as relational operations. A distributed relation can be reconstructed by applying the fragmentation rules, and then deriving a program, called a localization program, of relational algebra operations, which then act on fragments. Generating a fragments query is done in two steps. o The distributed query is mapped into a fragment query by substituting each distributed relation by its reconstruction program (also called materialization program. o The fragment query is simplified and restructured to produce another “good” query. Simplification and restructuring may be done according to the same rules used in the decomposition layer. As in the decomposition layer, the final fragment query is generally far from optimal because information regarding fragments is not utilized. 6.5.3 Global Query Optimization: The input to the third layer is a fragment query, that is, an algebraic query on fragments. The goal of query optimization is to find an execution strategy for the query, which is close to optimal. An execution strategy for a distributed query can be described with relational algebra operations and communication primitives (send/receive operations) for transferring data between sites. The previous layers have already optimized the query for example, by eliminating redundant expressions. However, this optimization is independent of fragments characteristics such as cardinalities. In addition, communication operations are not yet specified. By permuting the ordering of operations within one fragment query, many equivalent queries may be found. Query optimization consists of finding the “best” ordering of operations in the fragments query, including communication operations, which minimize a cost function. The cost function, often defined in terms of time units, refers to computing resources such as disk space, disk I/Os, buffer space, CPU cost, communication cost and so on. An important aspect of query optimization is join ordering, since permutations of the joint within the query may lead to improvements of orders of magnitude. One basic technique for optimizing a sequence of distributed join operations is through the semi-join operator. The main value of the semi-join in a distributed system is to reduce the size of the join operands and then the communication cost. The output of the query optimization layer is an optimized algebraic query with communication operation included on fragments. 6.5.4 Local Query Optimization: The last layer us performed by all the sites having fragments involved in query. Each sub-query executing at one site, called a local query, is then optimized using the local schema of the site. At this time, the algorithms to perform the relational operations may be chosen. Local optimization uses the algorithms of centralized systems. Check Your Progress: Answer the following: a) What is a Query processor? b) State the Query processing problem. c) Explain the different characteristics of Query processor. d) Describe the layer architecture of query processing. e) Discuss Query optimization. 6.6 Summary: In this unit we have provided an overview of query processing in distributed DBMSs. The following points are discussed: We have introduced the function and objectives of query processing. The goals of the query processing are discussed. They are Given a calculus query on a distributed database, find a corresponding execution strategy that minimizes a system cost function, which includes I/O, CPU, and communication costs. An execution strategy is specified in terms of relational algebra operations and communication primitives applied to the local databases. We have described a characterization of query processors based on their implementation choices. This is useful for comparing alternative query processor designs and to understand the trade-offs between efficiency and complexity. We have proposed a generic layering scheme for describing distributed query processing. Here four main functions have been isolated: Query decomposition, Data localization, Query optimization, and Local query optimization.