in Database Federation - XP

advertisement
Efficient Query Optimization for Distributed Join
in Database Federation
A Master’s Thesis Proposal
by
Di Wang
Advisor: Prof. Murali Mani
Reader: _____________
Department of Computer Science, WPI
100 Institute Road
Worcester, MA 01609
November, 2008
1 Introduction
1.1 Data Integration
Federation
large space of feasible execution plans and
choose an optimal plan based on the cost metric.
and
Database
In a modern enterprise, it is almost inevitable
that different parts of the organization will use
different systems to produce, store, and search
their data. Yet, it is only by integrating the
information from these various systems that the
enterprise can realize the full value of the data
contained [1]. In the finance industry, mergers
are an almost commonplace occurrence. After
the merger, the new company needs to be able to
access the customer information from both sets
of data stores, to analyze its new portfolio using
existing and new applications, and, to use the
combined resources of both institutions through
a common interface. In addition, today’s
company must be able to combine data with its
business partners, because of the continuous
creation of business relationships and
partnerships [4].
Besides the ubiquitous need for data integration
in business world, there is a growing interest in
the scientific community to allow disparate
groups of users to share resources consisting of
both data collections and programs [2, 3]. Also
the World Wide Web is witnessing the need to
deal with vast heterogeneous collections of
structured data, which have additional data
integration requirements [5].
Database Federation, aka Database Mediation, is
one approach to data integration. The key
performance advantage offered by database
federation is the ability to efficiently combine
data from multiple sources in a single query
statement. The data sources are federated into a
unified middleware, called mediator. The user
can submit a query that access data from
multiple sources, joining and restricting,
aggregating and analyzing the data at will,
without knowing what exactly the sources are
[1].
Two key components of database federation are
query rewriter and cost-based optimizer. The
query rewriter can aggressively rewrite a user’s
query into a semantically equivalent form that
can be more efficiently executed across multiple
sources. The cost-based optimizer can search a
In this work, we focus on the cost-based
optimization for distributed joins in database
federation.
1.2 Query
Federations
Optimization
in
Database
Database federation technology has been the
subject of multiple research thrusts, including
containment algorithms for conjunctive queries
[6], schema mapping, as well as capability-based
optimization [7] and cost-based optimization [8].
The query optimization work is closely related
to early optimization techniques developed for
the distributed database systems, e.g. R*,
Mariposa [9, 10]. Most recently, many costbased optimization works focus on variances of
cost model and novel cost-collecting techniques,
in order to adapt the optimization to specific
federated environments.
However, there exist two problems about the
cost-based optimization in the federated
database systems that haven’t been sufficiently
studied.
Problem 1. Firstly, it is a fairly straightforward
observation that run-time conditions can
significantly affect the execution cost of a query
plan. Yet most of the existing distributed
database optimizers fail to take run-time
conditions into account. When measuring the
cost of a candidate plan, often such optimizers in
the mediator consider the costs of operators in
each site as static. Hence, a plan has a constant
estimated cost, and an input query has a fixed
output ‘optimal’ execution plan consequently.
On the other hand, the optimizer in mediator is
not aware of the run-time system parameters (e.g.
number of available buffers) of each site. Hence,
when the system environment of the site largely
changes, the plan that was optimal at
optimization time may have bad performance at
run time.
Several works have considered taking run-time
conditions into account at optimization time.
Parametric query optimization attempts to
identify several execution plans, each one of
which is optimal for a subset of all possible
values of the run-time parameters [14]. Since the
number of values (or the combinations of values)
of parameters is large, the optimizer has to
explore a huge set of alternative plans. This
approach hence highly depends on an
economical exploration algorithm. Even though
the overhead of producing multiple plans for a
query can be acceptable in centralized database
system as long as the query likely is executed
many times, the size of plan space is prohibitive
in distributed database systems. By considering
the site selection in addition to the algebraic
transformation and physical methods selection,
the optimization is likely to become the most
costly process.
The XPRS project proposes a two-phase
algorithm for multi-user parallel databases [15].
Namely, in the first phase, which performs at
compile time, only sequential query execution
plans are considered. In the second phase, which
performs at run time, it finds the optimal
parallelization of the best sequential plan chosen
in the first phase, with the information of run
time parameters. This algorithm has been further
extended in distributed environment [12]. The
second phase then requires the exhaustive search
in all possible site schedules. This simple
algorithm surprisingly performs well, as long as
it is assumed that communication cost forms a
small fraction of the total cost and that the
exhaustive search in the second phase is not very
expensive. However, these assumptions cannot
generally hold. The cost of data transfer of large
size files through a long-haul network can be
pretty high. Moreover, if the scale of data
sources is large, the number of exhaustive
permutations of all sites, for one static plan
though, can be huge.
Our work proposes a cluster-and-conquer
approach to handle the run-time condition
problem in database federation optimization.
Firstly, we consider all data resources in the
database federation as a set of several clusters of
sites. This abstraction accords with many realworld facts: 1) many national-scale or globalscale data federations are built on the networks
which consist of both broad, LAN paths and
narrow, long-haul paths. Hence, a bunch of sites
connected via LAN can be viewed as a cluster,
or a bunch of sites geographically located within
a certain area can be viewed as a cluster; 2)
many highly-integrated systems have to access
data through a great deal of databases that
belong to multiple different organizations. In
such a case, the set of databases that belong to
the same single organization can be viewed as a
cluster. Secondly, we design two layers of
mediators to schedule the query plan
cooperatively. The global mediator produces a
high-level optimal plan over several clusters,
referring to the information of physical database
designs. And then the cluster mediator, which
virtually resides in each cluster, will deal with
the sub-plan that only need to access data sites in
this cluster. The cluster mediator will collect
run-time parameters from related data sites and
then find the intra-cluster optimal plan.
Obviously this cluster-and-conquer optimization
approach reduces the plan search space, because
it eliminates numerous plans that unnecessarily
join tables across distinct clusters. We will
explain and analyze this approach in Section 3
later, and will present experiment results to
show the efficiency of this approach in future
work.
Problem 2. Secondly, the design space of
candidate query plans in database federation is
largely increased by distributed environment and
finer-granularity scheduling algorithms. And
various algorithms have been developed trying
to explore the plan space more efficiently.
Unfortunately, the cost of processing the
complicated
optimization
algorithm
correspondingly increases, which is known as
‘cost of costing’. The trade-offs in optimization
overheads of some well-known optimization
algorithms are studied in [12].
Instead of analyzing the performance of various
optimization algorithms, we observe this issue
from a different angle. The fact that a relative
simple query can incur huge plan space
(commonly exponential in centralized database,
even larger in distributed database) prohibits the
complete emulation of all possible execution
plans. Also, when handling a relative simple
query, many advanced yet cumbersome
optimization algorithms are unnecessarily
launched. Thus, we propose to acquire the
properties of incoming query, and then to
customize the process of optimization based on
this information, which can potentially decrease
the cost of optimization.
In some DBMS, like PostgreSQL [13], user can
set certain parameters by hand to configure the
Figure 1. The architecture of a database federation
threshold of launching a few optimization
processes. However, to the best of our
knowledge, systematic study of classifying
distributed queries based on a group of
properties and a general approach to make use of
this information at optimization time in database
federation, has not been studied before.
Due to time limitation, I will mainly focus on
the first problem, studying and implementing the
proposed cluster-and-conquer approach in my
Mater Thesis study. And will continue to work
on the second problem in my future Ph.D study.
1.3 Contributions
In this Master thesis, we firstly argue the need
for taking into consideration the run-time
conditions in database federation optimization,
and accordingly present the cluster-and-conquer
approach to efficiently find an optimal execution
plan for distributed joins. Secondly we are going
to implement these approaches in a primitive
federated database system which is built by
ourselves, and to run experiments on a set of
modified TPC-H benchmark queries. We will
also present a preliminary analysis explaining
the experimental results.
2 Architecture and Problem
Definition
The architecture of our system is shown in
Figure 1. Our database federation system
provides two-layer mediator and query
classification discussed in the earlier section.
The following assumptions are made with
reason:
ο‚· The physical database design is known to
the global mediator;
ο‚· The run-time condition of a site can only
be necessarily known to the sites in the
same cluster.
ο‚· The number of available buffer size is
fixed during the entire query execution.
The most significant parts of the system for the
cluster-and-conquer approach are the optimizer,
executor inside the global mediator and the set
of cluster mediators. The optimizer uses a
System R style algorithm, which is extended to
also search through the space of bushy plans.
The optimizer performs at compiling time and
considers all the tables as being stored in the
clustered fashion, i.e. operations that deal with
the tables in the same cluster will be arranged to
execute firstly, and then inter-cluster operations
are executed afterwards. It employs a traditional
cost model, which is discussed in Section 2.1 in
detail, by making use of the physical design
information of data sources. Subsequently the
executor schedules the optimal plan found by the
optimizer in a distributed and parallelized way
(to be discussed in Section 2.2), and then assigns
each sub-plan to the corresponding cluster.
The cluster mediator takes a sub-plan (aka. plan
fragment) as input, which is assigned by the
executor. It requests the run-time parameters –
in this work we only consider the size of
available buffers and load conditions from data
sources in this cluster. Based on this run-time
information as well as static physical design, the
cluster mediator can find an intra-cluster optimal
plan.
Every cluster mediator functions
independently and potentially in parallel. And
then inter-cluster operations are executed as predecided by the optimizer.
2.1 Cost Model and Optimization Goal
Generally speaking, the overall performance
goal of a database federation is to obtain
increased throughput and decreased response
time in a multiuser environment. We consider
both the total resources consumed and the
response time.
Given a join schedule over n sites, we define the
cost as:
Cost=
𝑛
∑𝑛𝑖=1 𝑅𝑒𝑠_πΆπ‘œπ‘›π‘– + ∑𝑛−1
𝑖=1 ∑𝑗=𝑖+1 π‘‡π‘Ÿπ‘Žπ‘›π‘ _π·π‘Žπ‘‘π‘Žπ‘–π‘— +
𝑀 ∗ 𝑅𝑒𝑠𝑝_π‘‡π‘–π‘šπ‘’
Here w is a system-specific weighting factor.
Our optimization problem is to find the
distributed join schedule plan with minimum
cost.
2.2 Parallelism and Pipelining
Typically there are three forms of intra-query
parallelism [16]:
ο‚· Partitioned parallelism: A single operator
is executed on a set of sites by partitioned
its input data set.
ο‚· Pipelined parallelism: A sequence of
operators is executed on a set of sites in a
pipelined manner.
ο‚· Independent
parallelism:
multiple
operators with no pipelining between them
can be executed in parallel on a set of sites
independent of each other.
The partitioned parallelism is also called intraoperation parallelism, while the other two are
called inter-operation parallelism [15]. In this
work, we consider only the independent
parallelism which is a way of inter-operation
parallelism, for the following reasons.
Firstly, the input data partition is not often
feasible among a database federation, because it
may be not allowed to move data from their
original location. Secondly, in a bushy plan, it is
common to have two operations that do not
depend on each other’s output, which is ideal to
execute them concurrently. To simplify our
study, we do not consider the pipelined
parallelism, another form of inter-operation
parallelism in this work, but we will include this
form in future work.
3 Further Details of Proposed Work
As mentioned earlier, it is specifically necessary
to consider run-time conditions in the
optimization of a database federation. However,
to the best of our knowledge, there is no
efficient and easy-to-implement approach that
takes the run-time conditions into the
optimizer’s account so far. Hence we propose
the cluster-and-conquer approach to solve this
problem, which has been informally introduced
in this proposal. Now we revisit this approach,
in order to show that it produces good plans
reasonably.
3.1 Data Structure and Query
Optimization
We now introduce the following data structures
to help us understand how this approach works.
Figure 2(a) shows a clustered view of a data
federation. The global mediator only needs to
decide inter-cluster operations. Here we
implicatively claim the join ordering that intracluster operations will be executed first, i.e. plan
trees that join two leaves in distinct clusters are
eliminated. Obviously this claim decreases the
plan space explored at compilation time.
Considering that only a subset of plans will be
fully explored at run-time optimization, we may
expect this approach to produce much worse
plans than exhaustive algorithm. However,
notice that the clustering is based on several
essential properties of a database federation,
such as: there exist enterprise boundaries, which
forbid moving data to other enterprises’ sites; or
in a global database federation, data transfer
through long-haul paths is pretty costly, while
data transfer within a LAN is economical. Hence
joining primitive tables across distinct clusters is
either infeasible or prohibitive. Moreover, our
approach releases the global mediator form the
cumbersome work of collecting or estimating all
sites’ run-time parameters. We will do
experiments to further verify that our approach
does perform well for the cost model defined
previously.
(a) Clustered view
(b) Global execution tree
(c) Operation tree
Figure 2 Data structures
Figure 2(b) shows an execution plan tree
produced by the global mediator. The intercluster join is determined, while intra-cluster
joins are thrown to the cluster mediators.
Subsequently the executor will assign the subplan, the left sub-tree in Figure 2(b), to Cluster 1
in our example, and assign the right sub-tree to
Cluster 2.
Figure 2(c) is an operation tree [16] produced by
the cluster mediator of Cluster 1 for the left subtree in Figure 2(b). Here each node represents a
physical operator, and the location where the
operator is performed is also explicated. So the
operator tree explicates the flow of data transfer
as well. Theoretically every operator, except
scan(), can be executed in any site in a cluster.
This fine grain operator scheduling is desirable
since it implies less resource requirement and
allows possible better load balancing. In this
example, every operator, except scan() which
can only be executed on the original location of
the relation, has site assignment accompanied.
As mentioned before, the cluster mediator is
responsible to collect compile-time unknown
run-time parameters of data sources in the
cluster. In our work, we consider the following
run-time parameters:
ο‚· Available buffer size: the number of buffer
pages allocated to an operator. This
parameter determines the number of runs
in a hash join and sort-merge join.
ο‚· CPU utilization: this parameter determines
the possible speedup of operation
execution [15].
The cluster mediator’s choice of operations and
site scheduling is sensitive to the values of these
parameters. Having the cluster mediator handle
the intra-cluster scheduling autonomously has
three-fold benefits. Firstly, the communication
within a cluster is time-efficient, so the value of
run-time parameters collected by the cluster
mediator is much fresher than that gathered by
the global mediator. Secondly, each cluster
mediator can deal with its own query
concurrently, which implicatively employs the
independent parallelism. Thirdly, the complexity
of the centralized optimization of a whole query
plan in distributed environment is greatly
decreased, since cluster mediators can conquer
every piece of less complex sub-plan
respectively.
3.2 Experimentation Design
We are planning to perform experiments on a
primitive database federation system which is
built by ourselves. One of the goals of these
experiments is to motivate the need for
considering run-time parameters as well as to
understand the trade-offs involved in the
optimization process. Moreover, we are going to
implement the proposed cluster-and-conquer
approach, and then verify this approach by
running test cases and analyzing the results with
our cost model. We also want to implement the
simplistic version of exhaustive algorithm and
the two-phase algorithm [15], in order to
compare our approach with them. For query
workload, we will use queries from the TPC-H
benchmark. Since we want to concentrate only
on the join ordering and scheduling, we will
modify those queries somehow. The network
will be simulated using the message cost model
introduced in [12]: A data set of size n bytes
takes 𝛼 + 𝛽 ∗ 𝑛 to reach the other end, where 𝛼
is the start-up cost and 𝛽 is the transfer cost per
byte. By setting the cost parameters we can
simulate local area network as well as wide area
network.
4 Evaluation
To evaluate the correctness and efficiency of the
proposed approach, we will firstly validate the
cluster-and-conquer approach theoretically by
analyzing its algorithm and comparing with
existing related works. In addition, we will
check the implementation of the approach,
including checking whether the simulations are
rational and the primitive database federation
works well. Finally we will revisit the design of
experiments and analyze their performances.
5 Schedule
ο‚·
ο‚·
ο‚·
ο‚·
ο‚·
September – October 2008: clarify the
research ideas and present the proposal.
November 2008: polish the prototype of
the database federation and perform
motivation experiments.
December 2008 – January 2009:
implement the proposed approach.
January – February 2009: test the approach
and do experimental study.
February – April 2009: revise and
complement the thesis.
References
[1] L. M. Haas, E. T. Lin, M. A. Roth. Data
Integration through Database Federation. IBM
System Journal, VOL 41, No 4, 2002.
[2] I. Manolescu, L. Bouganim, F. Fabret, E.
Simon. Efficient Querying of Distributed
Resources
in
Mediator
Systems.
CoopIS/DOA/ODBASE 2002, pp.468-485, 2002.
[3] X. Wang, R. Burns, A. Terzis, A. Deshpande.
Network-Aware Join Processing in Global Scale
Database Federations. ICDE 2008.
[4] J. Blakeley, C. Cunningham, N. Ellis, B.
Rathakrishnan,
M.
Wu.
Distributed/
Heterogeneous Query Processing in Microsoft
SQL Server. ICDE 2005.
[5] J. Madhavan, S. R. Jeffery, S. Cohen, X.
Dong, D. Ko, C. Yu, A. Halevy. Web-scale Data
Integration: You can only afford to Pay As You
Go. CIDR 2007.
[6] J. D. Ullman. Information Integration Using
Logical Views. Proceedings of the 6th
International Conference on Database Theory,
1997.
[7] Y. Papakonstantinou, A. Gupta, L. Haas.
Capabilities-Based Query Rewriting in Mediator
Systems. Distributed and Parallel Databases,
Volumn 6, Issue 1, 1998.
[8] M. T. Roth, F. Ozcan, L. Hass. Cost Models
Do Matter: Providing Cost Information for
Diverse Data Sources in a Federated System.
VLDB 1999.
[9] L. F. Mackert, G. M. Lohman. R* Optimizer
Validation and Performance Evaluation for
Distributed Queries. VLDB 1986.
[10] M. Stonebraker, P. M. Aoki, W. Litwin, A.
Pfeffer, A. Sah, J. Sidell, Carl Staelin, A. Yu.
Mariposa: a wide-area Distributed Database
System. The VLDB Journal, 1996.
[11] Z. G. Ives, A. Y. Halevy, D. S. Weld.
Adapting to Source Properties in Processing
Data Integration Queries. SIGMOD 2004.
[12] A. Deshpande, J. M. Hellerstein. Decoupled
Query Optimization for Federated Database
Systems. ICDE 2002.
[13] PostgreSQL document about Server
Configuration,
http://www.postgresql.org/docs/8.3/static/runtim
e-config-query.html
[14] Y. E. Ioannidis, R. T. Ng, K. Shim, T. K.
Sellis. Parametric Query Optimization. VLDB
1992.
[15] W. Hong, M. Stonebraker. Optimization of
Parallel Query Execution Plans in XPRS. In
Proc. Of the 1st International PDIS Conference,
1991.
[16] M. N. Garofalakis, Y. E. Ioannidis. Parallel
Query Scheduling and Optimization with Timeand Space-Shared Resources. VLDB 1997.
Download