Efficient Query Optimization for Distributed Join in Database Federation A Master’s Thesis Proposal by Di Wang Advisor: Prof. Murali Mani Reader: _____________ Department of Computer Science, WPI 100 Institute Road Worcester, MA 01609 November, 2008 1 Introduction 1.1 Data Integration Federation large space of feasible execution plans and choose an optimal plan based on the cost metric. and Database In a modern enterprise, it is almost inevitable that different parts of the organization will use different systems to produce, store, and search their data. Yet, it is only by integrating the information from these various systems that the enterprise can realize the full value of the data contained . In the finance industry, mergers are an almost commonplace occurrence. After the merger, the new company needs to be able to access the customer information from both sets of data stores, to analyze its new portfolio using existing and new applications, and, to use the combined resources of both institutions through a common interface. In addition, today’s company must be able to combine data with its business partners, because of the continuous creation of business relationships and partnerships . Besides the ubiquitous need for data integration in business world, there is a growing interest in the scientific community to allow disparate groups of users to share resources consisting of both data collections and programs [2, 3]. Also the World Wide Web is witnessing the need to deal with vast heterogeneous collections of structured data, which have additional data integration requirements . Database Federation, aka Database Mediation, is one approach to data integration. The key performance advantage offered by database federation is the ability to efficiently combine data from multiple sources in a single query statement. The data sources are federated into a unified middleware, called mediator. The user can submit a query that access data from multiple sources, joining and restricting, aggregating and analyzing the data at will, without knowing what exactly the sources are . Two key components of database federation are query rewriter and cost-based optimizer. The query rewriter can aggressively rewrite a user’s query into a semantically equivalent form that can be more efficiently executed across multiple sources. The cost-based optimizer can search a In this work, we focus on the cost-based optimization for distributed joins in database federation. 1.2 Query Federations Optimization in Database Database federation technology has been the subject of multiple research thrusts, including containment algorithms for conjunctive queries , schema mapping, as well as capability-based optimization  and cost-based optimization . The query optimization work is closely related to early optimization techniques developed for the distributed database systems, e.g. R*, Mariposa [9, 10]. Most recently, many costbased optimization works focus on variances of cost model and novel cost-collecting techniques, in order to adapt the optimization to specific federated environments. However, there exist two problems about the cost-based optimization in the federated database systems that haven’t been sufficiently studied. Problem 1. Firstly, it is a fairly straightforward observation that run-time conditions can significantly affect the execution cost of a query plan. Yet most of the existing distributed database optimizers fail to take run-time conditions into account. When measuring the cost of a candidate plan, often such optimizers in the mediator consider the costs of operators in each site as static. Hence, a plan has a constant estimated cost, and an input query has a fixed output ‘optimal’ execution plan consequently. On the other hand, the optimizer in mediator is not aware of the run-time system parameters (e.g. number of available buffers) of each site. Hence, when the system environment of the site largely changes, the plan that was optimal at optimization time may have bad performance at run time. Several works have considered taking run-time conditions into account at optimization time. Parametric query optimization attempts to identify several execution plans, each one of which is optimal for a subset of all possible values of the run-time parameters . Since the number of values (or the combinations of values) of parameters is large, the optimizer has to explore a huge set of alternative plans. This approach hence highly depends on an economical exploration algorithm. Even though the overhead of producing multiple plans for a query can be acceptable in centralized database system as long as the query likely is executed many times, the size of plan space is prohibitive in distributed database systems. By considering the site selection in addition to the algebraic transformation and physical methods selection, the optimization is likely to become the most costly process. The XPRS project proposes a two-phase algorithm for multi-user parallel databases . Namely, in the first phase, which performs at compile time, only sequential query execution plans are considered. In the second phase, which performs at run time, it finds the optimal parallelization of the best sequential plan chosen in the first phase, with the information of run time parameters. This algorithm has been further extended in distributed environment . The second phase then requires the exhaustive search in all possible site schedules. This simple algorithm surprisingly performs well, as long as it is assumed that communication cost forms a small fraction of the total cost and that the exhaustive search in the second phase is not very expensive. However, these assumptions cannot generally hold. The cost of data transfer of large size files through a long-haul network can be pretty high. Moreover, if the scale of data sources is large, the number of exhaustive permutations of all sites, for one static plan though, can be huge. Our work proposes a cluster-and-conquer approach to handle the run-time condition problem in database federation optimization. Firstly, we consider all data resources in the database federation as a set of several clusters of sites. This abstraction accords with many realworld facts: 1) many national-scale or globalscale data federations are built on the networks which consist of both broad, LAN paths and narrow, long-haul paths. Hence, a bunch of sites connected via LAN can be viewed as a cluster, or a bunch of sites geographically located within a certain area can be viewed as a cluster; 2) many highly-integrated systems have to access data through a great deal of databases that belong to multiple different organizations. In such a case, the set of databases that belong to the same single organization can be viewed as a cluster. Secondly, we design two layers of mediators to schedule the query plan cooperatively. The global mediator produces a high-level optimal plan over several clusters, referring to the information of physical database designs. And then the cluster mediator, which virtually resides in each cluster, will deal with the sub-plan that only need to access data sites in this cluster. The cluster mediator will collect run-time parameters from related data sites and then find the intra-cluster optimal plan. Obviously this cluster-and-conquer optimization approach reduces the plan search space, because it eliminates numerous plans that unnecessarily join tables across distinct clusters. We will explain and analyze this approach in Section 3 later, and will present experiment results to show the efficiency of this approach in future work. Problem 2. Secondly, the design space of candidate query plans in database federation is largely increased by distributed environment and finer-granularity scheduling algorithms. And various algorithms have been developed trying to explore the plan space more efficiently. Unfortunately, the cost of processing the complicated optimization algorithm correspondingly increases, which is known as ‘cost of costing’. The trade-offs in optimization overheads of some well-known optimization algorithms are studied in . Instead of analyzing the performance of various optimization algorithms, we observe this issue from a different angle. The fact that a relative simple query can incur huge plan space (commonly exponential in centralized database, even larger in distributed database) prohibits the complete emulation of all possible execution plans. Also, when handling a relative simple query, many advanced yet cumbersome optimization algorithms are unnecessarily launched. Thus, we propose to acquire the properties of incoming query, and then to customize the process of optimization based on this information, which can potentially decrease the cost of optimization. In some DBMS, like PostgreSQL , user can set certain parameters by hand to configure the Figure 1. The architecture of a database federation threshold of launching a few optimization processes. However, to the best of our knowledge, systematic study of classifying distributed queries based on a group of properties and a general approach to make use of this information at optimization time in database federation, has not been studied before. Due to time limitation, I will mainly focus on the first problem, studying and implementing the proposed cluster-and-conquer approach in my Mater Thesis study. And will continue to work on the second problem in my future Ph.D study. 1.3 Contributions In this Master thesis, we firstly argue the need for taking into consideration the run-time conditions in database federation optimization, and accordingly present the cluster-and-conquer approach to efficiently find an optimal execution plan for distributed joins. Secondly we are going to implement these approaches in a primitive federated database system which is built by ourselves, and to run experiments on a set of modified TPC-H benchmark queries. We will also present a preliminary analysis explaining the experimental results. 2 Architecture and Problem Definition The architecture of our system is shown in Figure 1. Our database federation system provides two-layer mediator and query classification discussed in the earlier section. The following assumptions are made with reason: The physical database design is known to the global mediator; The run-time condition of a site can only be necessarily known to the sites in the same cluster. The number of available buffer size is fixed during the entire query execution. The most significant parts of the system for the cluster-and-conquer approach are the optimizer, executor inside the global mediator and the set of cluster mediators. The optimizer uses a System R style algorithm, which is extended to also search through the space of bushy plans. The optimizer performs at compiling time and considers all the tables as being stored in the clustered fashion, i.e. operations that deal with the tables in the same cluster will be arranged to execute firstly, and then inter-cluster operations are executed afterwards. It employs a traditional cost model, which is discussed in Section 2.1 in detail, by making use of the physical design information of data sources. Subsequently the executor schedules the optimal plan found by the optimizer in a distributed and parallelized way (to be discussed in Section 2.2), and then assigns each sub-plan to the corresponding cluster. The cluster mediator takes a sub-plan (aka. plan fragment) as input, which is assigned by the executor. It requests the run-time parameters – in this work we only consider the size of available buffers and load conditions from data sources in this cluster. Based on this run-time information as well as static physical design, the cluster mediator can find an intra-cluster optimal plan. Every cluster mediator functions independently and potentially in parallel. And then inter-cluster operations are executed as predecided by the optimizer. 2.1 Cost Model and Optimization Goal Generally speaking, the overall performance goal of a database federation is to obtain increased throughput and decreased response time in a multiuser environment. We consider both the total resources consumed and the response time. Given a join schedule over n sites, we define the cost as: Cost= 𝑛 ∑𝑛𝑖=1 𝑅𝑒𝑠_𝐶𝑜𝑛𝑖 + ∑𝑛−1 𝑖=1 ∑𝑗=𝑖+1 𝑇𝑟𝑎𝑛𝑠_𝐷𝑎𝑡𝑎𝑖𝑗 + 𝑤 ∗ 𝑅𝑒𝑠𝑝_𝑇𝑖𝑚𝑒 Here w is a system-specific weighting factor. Our optimization problem is to find the distributed join schedule plan with minimum cost. 2.2 Parallelism and Pipelining Typically there are three forms of intra-query parallelism : Partitioned parallelism: A single operator is executed on a set of sites by partitioned its input data set. Pipelined parallelism: A sequence of operators is executed on a set of sites in a pipelined manner. Independent parallelism: multiple operators with no pipelining between them can be executed in parallel on a set of sites independent of each other. The partitioned parallelism is also called intraoperation parallelism, while the other two are called inter-operation parallelism . In this work, we consider only the independent parallelism which is a way of inter-operation parallelism, for the following reasons. Firstly, the input data partition is not often feasible among a database federation, because it may be not allowed to move data from their original location. Secondly, in a bushy plan, it is common to have two operations that do not depend on each other’s output, which is ideal to execute them concurrently. To simplify our study, we do not consider the pipelined parallelism, another form of inter-operation parallelism in this work, but we will include this form in future work. 3 Further Details of Proposed Work As mentioned earlier, it is specifically necessary to consider run-time conditions in the optimization of a database federation. However, to the best of our knowledge, there is no efficient and easy-to-implement approach that takes the run-time conditions into the optimizer’s account so far. Hence we propose the cluster-and-conquer approach to solve this problem, which has been informally introduced in this proposal. Now we revisit this approach, in order to show that it produces good plans reasonably. 3.1 Data Structure and Query Optimization We now introduce the following data structures to help us understand how this approach works. Figure 2(a) shows a clustered view of a data federation. The global mediator only needs to decide inter-cluster operations. Here we implicatively claim the join ordering that intracluster operations will be executed first, i.e. plan trees that join two leaves in distinct clusters are eliminated. Obviously this claim decreases the plan space explored at compilation time. Considering that only a subset of plans will be fully explored at run-time optimization, we may expect this approach to produce much worse plans than exhaustive algorithm. However, notice that the clustering is based on several essential properties of a database federation, such as: there exist enterprise boundaries, which forbid moving data to other enterprises’ sites; or in a global database federation, data transfer through long-haul paths is pretty costly, while data transfer within a LAN is economical. Hence joining primitive tables across distinct clusters is either infeasible or prohibitive. Moreover, our approach releases the global mediator form the cumbersome work of collecting or estimating all sites’ run-time parameters. We will do experiments to further verify that our approach does perform well for the cost model defined previously. (a) Clustered view (b) Global execution tree (c) Operation tree Figure 2 Data structures Figure 2(b) shows an execution plan tree produced by the global mediator. The intercluster join is determined, while intra-cluster joins are thrown to the cluster mediators. Subsequently the executor will assign the subplan, the left sub-tree in Figure 2(b), to Cluster 1 in our example, and assign the right sub-tree to Cluster 2. Figure 2(c) is an operation tree  produced by the cluster mediator of Cluster 1 for the left subtree in Figure 2(b). Here each node represents a physical operator, and the location where the operator is performed is also explicated. So the operator tree explicates the flow of data transfer as well. Theoretically every operator, except scan(), can be executed in any site in a cluster. This fine grain operator scheduling is desirable since it implies less resource requirement and allows possible better load balancing. In this example, every operator, except scan() which can only be executed on the original location of the relation, has site assignment accompanied. As mentioned before, the cluster mediator is responsible to collect compile-time unknown run-time parameters of data sources in the cluster. In our work, we consider the following run-time parameters: Available buffer size: the number of buffer pages allocated to an operator. This parameter determines the number of runs in a hash join and sort-merge join. CPU utilization: this parameter determines the possible speedup of operation execution . The cluster mediator’s choice of operations and site scheduling is sensitive to the values of these parameters. Having the cluster mediator handle the intra-cluster scheduling autonomously has three-fold benefits. Firstly, the communication within a cluster is time-efficient, so the value of run-time parameters collected by the cluster mediator is much fresher than that gathered by the global mediator. Secondly, each cluster mediator can deal with its own query concurrently, which implicatively employs the independent parallelism. Thirdly, the complexity of the centralized optimization of a whole query plan in distributed environment is greatly decreased, since cluster mediators can conquer every piece of less complex sub-plan respectively. 3.2 Experimentation Design We are planning to perform experiments on a primitive database federation system which is built by ourselves. One of the goals of these experiments is to motivate the need for considering run-time parameters as well as to understand the trade-offs involved in the optimization process. Moreover, we are going to implement the proposed cluster-and-conquer approach, and then verify this approach by running test cases and analyzing the results with our cost model. We also want to implement the simplistic version of exhaustive algorithm and the two-phase algorithm , in order to compare our approach with them. For query workload, we will use queries from the TPC-H benchmark. Since we want to concentrate only on the join ordering and scheduling, we will modify those queries somehow. The network will be simulated using the message cost model introduced in : A data set of size n bytes takes 𝛼 + 𝛽 ∗ 𝑛 to reach the other end, where 𝛼 is the start-up cost and 𝛽 is the transfer cost per byte. By setting the cost parameters we can simulate local area network as well as wide area network. 4 Evaluation To evaluate the correctness and efficiency of the proposed approach, we will firstly validate the cluster-and-conquer approach theoretically by analyzing its algorithm and comparing with existing related works. In addition, we will check the implementation of the approach, including checking whether the simulations are rational and the primitive database federation works well. Finally we will revisit the design of experiments and analyze their performances. 5 Schedule September – October 2008: clarify the research ideas and present the proposal. November 2008: polish the prototype of the database federation and perform motivation experiments. December 2008 – January 2009: implement the proposed approach. January – February 2009: test the approach and do experimental study. February – April 2009: revise and complement the thesis. References  L. M. Haas, E. T. Lin, M. A. Roth. Data Integration through Database Federation. IBM System Journal, VOL 41, No 4, 2002.  I. Manolescu, L. Bouganim, F. Fabret, E. Simon. Efficient Querying of Distributed Resources in Mediator Systems. CoopIS/DOA/ODBASE 2002, pp.468-485, 2002.  X. Wang, R. Burns, A. Terzis, A. Deshpande. Network-Aware Join Processing in Global Scale Database Federations. ICDE 2008.  J. Blakeley, C. Cunningham, N. Ellis, B. Rathakrishnan, M. Wu. Distributed/ Heterogeneous Query Processing in Microsoft SQL Server. ICDE 2005.  J. Madhavan, S. R. Jeffery, S. Cohen, X. Dong, D. Ko, C. Yu, A. Halevy. Web-scale Data Integration: You can only afford to Pay As You Go. CIDR 2007.  J. D. Ullman. Information Integration Using Logical Views. Proceedings of the 6th International Conference on Database Theory, 1997.  Y. Papakonstantinou, A. Gupta, L. Haas. Capabilities-Based Query Rewriting in Mediator Systems. Distributed and Parallel Databases, Volumn 6, Issue 1, 1998.  M. T. Roth, F. Ozcan, L. Hass. Cost Models Do Matter: Providing Cost Information for Diverse Data Sources in a Federated System. VLDB 1999.  L. F. Mackert, G. M. Lohman. R* Optimizer Validation and Performance Evaluation for Distributed Queries. VLDB 1986.  M. Stonebraker, P. M. Aoki, W. Litwin, A. Pfeffer, A. Sah, J. Sidell, Carl Staelin, A. Yu. Mariposa: a wide-area Distributed Database System. The VLDB Journal, 1996.  Z. G. Ives, A. Y. Halevy, D. S. Weld. Adapting to Source Properties in Processing Data Integration Queries. SIGMOD 2004.  A. Deshpande, J. M. Hellerstein. Decoupled Query Optimization for Federated Database Systems. ICDE 2002.  PostgreSQL document about Server Configuration, http://www.postgresql.org/docs/8.3/static/runtim e-config-query.html  Y. E. Ioannidis, R. T. Ng, K. Shim, T. K. Sellis. Parametric Query Optimization. VLDB 1992.  W. Hong, M. Stonebraker. Optimization of Parallel Query Execution Plans in XPRS. In Proc. Of the 1st International PDIS Conference, 1991.  M. N. Garofalakis, Y. E. Ioannidis. Parallel Query Scheduling and Optimization with Timeand Space-Shared Resources. VLDB 1997.