OVERVIEW OF QUERY PROCESSING

advertisement
OVERVIEW OF QUERY PROCESSING
Structure
6.0 Objectives
6.1 Introduction
6.2 Query Processing Problem
6.3 Objectives of Query Processing
6.4 Characterization of Query Processors
6.5 Layers of Query Processing
6.5.1 Query Decomposition
6.5.2 Data Localization
6.5.2 Global Query Optimization
6.5.3 Local Query Optimization
6.6 Summary
6.0 Objectives: In this unit we learn about an overview of query processing in
Distributed Data Base Management Systems (DDBMSs). This is explained with the
help of Relational Calculus and Relational Algebra because of their generality and
wide use in DDBMSs. In this we discuss

Various problems of query processing

About an ideal Query Processor

The concept of layering in query processing

Some related examples of query processing
6.1 Introduction: The increasing success of relational database technology in data
processing is suitable, in part, to the availability of nonprocedural languages, which
can significantly improve application development and end-user productivity. By
hiding the low-level details about the physical organization of the data, relational
database languages allow the expression of complex queries in a concise and simple
fashion. In particular, to construct the answer to the query, the user does not exactly
specify the procedure to follow. This procedure is actually devised by a DBMS
module, called as Query Processor. This relieves the user from query optimization, a
time consuming task that is handled properly by the query processor.
This issue has considerably important both in Centralized and Distributed
processing systems. However, the query processing problem is much more difficult in
distributed environments than in the conventional systems. In exact, the relations
involved in distributed queries may be fragmented and/or replicated, there by
inducing communication overhead costs.
So, in this unit let us discuss the different issues of query processing, about an
ideal query processor for distributed environment and finally, a layered software
approach for distributed query processing.
6.2 Query Processing Problem:
The main duty of a relational query processor is to transform a high-level query (in
relational calculus), into an equivalent lower level query (in relational algebra). The
distributed database is of major importance for query processing since the definition
of fragments is based on the objective of increasing reference locality, and
sometimes-parallel execution for the most important queries. The role of a distributed
query processor is to map a high level query on a distributed database (a set of global
relations) into a sequence of database operations (of relational algebra) on relational
fragments. Several important functions characterize this mapping:

The calculus query must be decomposed into a sequence of relational
operations called an algebraic query

The data accessed by the query must be localized so that the operations on
relations are translated to bear on local data (fragments)

The algebraic query on fragments must be extended with communication
operations and optimized with respect to a cost function to be minimized. This
cost function refers to computing resources such as disk I/Os, CPUs, and
communication networks.
The low-level query actually implements the execution strategy for the query. The
transformation must achieve both correctness and efficiency. The well-defined
mapping with the above said functional characteristics makes the correctness issue
easy. But producing an efficient execution strategy is more complex. A relational
calculus query may have many equivalent and correct transformations into relational
algebra. Since each equivalent execution strategy can lead to different consumptions
of computer resources, the main problem is to select the execution strategy that
minimizes the resource consumption.
Example: We consider the following subset of engineering database scheme given in
fig.6.0: E (ENO, ENAME, TITLE) G (ENO, JNO, RESP, DUR) and the simple
user query: “ Find the names of employees who are managing a project”.
E
G
ENO
ENAME
TITLE
ENO
JNO
RESP
DUR
E1
A
Elect. Eng.
E1
J1
Manager
12
E2
B
Syst. Arial,
E2
J1
Analyst
24
E3
C
Mech. Eng.
E2
J2
Analyst
6
E4
D
Programmer
E3
J3
Consultant
10
E5
E
Syst. Anal.
E3
J4
Engineer
48
E6
F
Elect. Eng.
E4
J2
Programmer
18
E7
G
Mech. Eng.
E5
J2
Manager
24
E8
H
Syst. Anal.
E6
J4
Manager
48
E7
J3
Engineer
36
E8
J3
Manager
40
J
S
JNO
JNAME
BUDGET
LOC
TITLE
SAL
J1
Instrumentation
150000
Montreal
Elect. Eng.
40000
J2
Database Develop.
135000
New York
Syst. Anal.
34000
J3
CAD/CAM
250000
New York
Mech. Eng.
27000
J4
Maintenance
310000
Paris
Programmer
24000
Figure 6.0 Example Database

The equivalent relational calculus using SQL syntax is:
SELECT ENAME
FROM
E, G
WHERE E.ENO = G.ENO
AND

RESP = “ Manager”
Two equivalent relational algebra queries that are correct transformations of the
above query are:
PJ
ENAME(SL RESP = “Manager” AND E.ENO=G.GNO
(E CP G))
and
PJ
ENAME
(E JN ENO (SL RESP = “Manager” (G)))
NOTE: The following observations are made from the above example:

It can be observed that the second query avoids the Cartesian product (CP) of
E and G, consumes much less computing resource than the first and thus
should be retained. That is, we have to avoid performing Cartesian product
operation on a full table.

In a centralized environment, the role of the query processor is to choose the
best relational algebra query for a given query among all equivalent ones.

In a distributed environment, relational algebra is not enough to express
execution strategies. It must be supported with operations for exchanging data
between sites. The distributed query processor has to select the best sites to
process the data and the way in which the data should be transformed with the
choice of ordering the relations.
Example: This example illustrates the importance of site selection and
communication for a chosen relational algebra query against a fragmented database.
We consider the following query:
PJ
ENAME
(E JN ENO (SL RESP = “Manager” (G)))
This query is written considering the relations of the previous example. We assume
that the relations E and G are horizontally fragmented as follows:
E 1 = SL ENO  “ E3” (E)
E 2 = SL ENO > “ E3” (E)
G1 = SL ENO  “ E3” (G)
G2 = SL ENO > “ E3” (G)
Fragments G1, G2, E1 and E2 are stored at the sites 1,2,3, and 4, respectively, and the
result is expected at the site 5 as shown in the fig 6.1. For simplicity, we have ignored
the project operation here. In the figure two equivalent strategies for the above query
are shown.
Some of the observations of the Strategies:
 An
arrow from site i to site j labeled with R indicates that relation R is
transferred from
 Strategy
site i to site j.
A exploits the fact that relations E and G are fragmented in the same
way in order to perform the select and join operations in parallel.
 Strategy
B centralizes all the operations and the data at the result site before
processing the query.
Resource consumption of these two strategies:
 Assumptions made:
1. Tuple access denoted as tupacc is 1 unit.
2. A tuple transfer, denoted as tuptrans, is 10 units.
3. Relations E and G have 400 and 1000 tuples respectively.
4. There are 20 managers in relation G.
5. The data is uniformly distributed among sites.
6. E and G relations are locally clustered an attributes RESP
and
ENO, respectively.
7. There is direct access to tuples of G (respectively, E) based
on the value of attribute RESP (respectively, ENO)
The Cost Analysis:
The cost of strategy A can be derived as follows:
1. Produce G' by selecting G requires 20 * tupacc
= 20
2. Transfer G' to the sites of E requires 20 * tuptrans = 200
3. Produce E' by joining G' and E requires
(10*10)* tupacc*2
4. Transfer E' to result site requires 20* tuptrans
The total cost
= 200
= 200
620
The cost of strategy B can be derived as follows:
1. Transfer E to site 5 requires 400 * tuptrans
= 4000
2. Transfer G to site 5 requires 1000 * tuptrans
= 10000
3. Produce G' by selecting G requires 1000 * tupacc
= 1000
4. Join E and G' requires 400 * 20 * tupacc
= 8000
The total cost
23000
The strategy A is better by a factor of 37, which is quite significant. Also it provides
the better distribution of work among the sites. The difference would be still better if
we assume slower communication and/or higher degree of fragmentation.
Result = E1 UN E2
E'1
E'2
Site 3
Site 4
E2 = E2 JN ENO G2
E1 = E1 JN ENO G1
G'1
G'2
Site 5
Site 2
Site 1G1 = SLRESP = ‘Manager’ G1
G2 = SLRESP = ‘Manager’ G2
(a) Strategy A
Result = (E1 UN E2 JN ENO PJ RESP ='Manager’ (G1 UN G2)
G1
G2
Site 1
Site 2
E1
Site 3
E2
Site 4
(b) Strategy B
Fig.6.1 Equivalent Distributed Execution Strategies
6.3 Objectives of Query Processing:
 The
main objectives of query processing in a distributed environment is to form
a high level query on a distributed database, which is seen as a single database
by the users, into an efficient execution strategy expressed in a low level
language on local databases.
 An
important point of query processing is query optimization. Because many
execution strategies are correct transformations of the same high-level query, the
one that optimizes (minimizes) resource consumption should be retained.
 The good
measures of resource consumption are:
o The total cost that will be incurred in processing the query. It is the
some of all times incurred in processing the operations of the query at
various sites and intrinsic communication.
o The resource time of the query. This is the time elapsed for executing
the query. Since operations can be executed in parallel at different
sites, the response time of a query may be significantly less than its
cost.
 Obviously the
total cost should be minimized.
o In a distributed system, the total cost to be minimized includes CPU,
I/O, and communication costs. These costs can be minimized by
reducing the number of I/O operations through fast access methods to
the data and efficient use of main memory. The communication cost is
the time needed for exchanging the data between sites participating in
the execution of the query. This cost is incurred in processing the
messages and transmitting the data on the communication network. In
distributed system, the communication cost factor is largely
dominating the local processing cost, so that the other cost factors are
ignored.
o In centralized systems, only CPU and I/O cost have to be considered.
6.4 Characterization of Query Processors: It is very difficult to give the characteristics,
which differentiates centralized and distributed query processors. Still some of them have
been listed here. Out of them, the first four are common to both and the next four are
particular to distributed query processors.
o Languages: The input language to the query processor can be based on
relational calculus or relational algebra. The former requires an additional
phase to decompose a query expressed in relational calculus to relational
algebra. In distributed context, the output language is generally some form of
relational algebra augmented with communication primitives. That is it must
perform perfect mapping between input languages with the output language.
o Types of optimization: Conceptually, query optimization is to choose a best
point of solution space that leads to the minimum cost. A popular approach
called exhaustive search is used. This is a method where heuristic techniques
are used. In both centralized and distributed systems a common heuristic is to
minimize the size of intermediate relations. Performing unary operations first
and ordering the binary operations by the increasing size of their intermediate
relations can do this.
o Optimization Timing: A query may be optimized at different times relative to
the actual time of query execution. Optimization can be done statically before
executing the query or dynamically as the query is executed. The main
advantage of the later method is that the actual sizes of the intermediate
relations are available to the query processor, thereby minimizing the
probability of a bad choice. The main drawback of the dynamic method is that
the query optimization, which is an expensive one, must be repeated for each
and every query. So, Hybrid optimization may be better in some situation.
o Statistics: The effectiveness of the query optimization is based on statistics on
the database. Dynamic query optimization requires statistics in order to
choose the operation that has to be done first. Static query optimization
requires statistics to estimate the size of intermediate relations. The accuracy
of the statistics can be improved by periodical updating.
o Decision sites: Most of the systems use centralized decision approach, in
which a single site generates the strategy. However, the decision process
could be distributed among various sites participating in the elaboration of the
best strategy. The centralized approach is simpler but requires the knowledge
of the complete distributed database where as the distributed approach
requires only local information. Hybrid approach is better where the major
decisions are taken at one particular site and other decisions are taken locally.
o Exploitation of the Network Topology: the distributed query processor
exploits the network topology. With wide area networks, the cost function to
be minimized can be restricted to the data communication cost, which is a
dominant factor. This issue reduces the work of distributed query
optimization, that can be dealt as two separate problems: Selection of the
global execution strategy, based on the inter-site communication and selection
of each local execution strategy, based on a centralized query processing
algorithms. With local area networks, communication costs are comparable to
I/O costs. Therefore, it is reasonable to the distributed query processor to
increase parallel execution at the cost of increasing communication.
o Exploitation of Replicated fragments: For reliability purposes it is useful to
have fragments replicated at different sites. Query processors have to exploit
this information either statically or dynamically for processing the query
efficiently.
o
Use of semi- joins: The semi-join operation reduces the size of the data that
are exchanged between the sites so that the communication cost can be
reduced.
6.5 Layers Of Query Processing:
The problem of query processing can itself be decomposed into several subprograms,
corresponding to various layers. In figure 6.2, a generic layering scheme for query
processing is shown where each layer solves a well-defined sub-problem. The input is
a query on distributed data expressed in relational calculus. This distributed query is
posed on global (distributed) relations, meaning that data distribution is hidden. Four
main layers are involved to map the distributed query into an optimized sequence of
local operations, each acting on a local database. These layers perform the functions
of query decomposition, data localization, global query optimization, and local query
optimization. The first three layers are performed by a central site and use global
information; the local sites do the fourth.
CALCULUS QUERY ON DISTRIBUTED
RELATIONS
QUERY
GLOBAL
SCHEMA
DECOMPOSITION
ALGEBRAIC QUERY ON DISTRIBUTED
RELATIONS
CONTROL
SITE
DATA
LOCALIZATION
FRAGMENT
SCHEMA
FRAGMENT QUERY
GLOBAL
OPTIMIZATION
STATISTICS ON
FRAGMENTS
OPTIMIZED FRAGMENT QUERY
WITH COMMUNICATION OPERATIONS
LOCAL
SITES
LOCAL
OPTIMIZATION
LOCAL
SCHEMA
OPTIMIZED LOCAL
QUERIES
Figure 6.3 Generic Layering Scheme for Distributed Query Processing
6.5.1 Query Decomposition: The first layer decomposes the distributed calculus
query into an algebraic query on global relations. The information needed for this
transformation is found in the global conceptual schema describing the global
relations. However, the information about data distribution is not used here but in the
next layer. Thus the techniques used by this layer are those of a centralized DBMS.
Query decomposition can be viewed as four successive steps:
o The calculus query is rewritten in a normalized form that is suitable for
subsequent manipulation. Normalization of a query generally involves the
manipulation of the query quantifiers and of the query qualification by
applying logical operator priority.
o The normalized query is analyzed semantically so that incorrect queries are
detected and rejected as early as possible. Techniques to detect incorrect
queries exist only for a subset of relational calculus. Typically, they use some
sort of graph that captures the semantics of the query.
o The correct query (still expressed in relational calculus) is simplified. One way
to simplify a query is to eliminate redundant predicates.
o The calculus query is restructured as an algebraic query. The quality of an
algebraic query is defined in terms of expected performance. The traditional
way to do this transformation toward a "better" algebraic specification is to
start with an initial algebraic query and transform it in order to find a "good"
one. The initial algebraic query is derived immediately from the calculus
query by translating the predicates and the target statement into relational
operations as they appear in the query. This directly translated algebra query is
then restructured through transformation rules. The algebraic query generated
by this layer is good in the sense that the worse executions are avoided.
6.5.2 Data Localization: The input to the second layer is an algebraic query on
distributed relations. The main role of the second layer is to localize the query’s data
using data distribution information. Relations are fragmented and stored in disjoint
subsets called fragments, each being stored at a different site. This layer determines
which fragments are involved in the query and transforms the distributed query into a
fragment query. Fragmentation is defined through fragmentations rules that can be
expressed as relational operations. A distributed relation can be reconstructed by
applying the fragmentation rules, and then deriving a program, called a localization
program, of relational algebra operations, which then act on fragments.
Generating a fragments query is done in two steps.
o The distributed query is mapped into a fragment query by substituting each
distributed relation by its reconstruction program (also called materialization
program.
o The fragment query is simplified and restructured to produce another “good”
query. Simplification and restructuring may be done according to the same
rules used in the decomposition layer. As in the decomposition layer, the final
fragment query is generally far from optimal because information regarding
fragments is not utilized.
6.5.3 Global Query Optimization: The input to the third layer is a fragment query,
that is, an algebraic query on fragments. The goal of query optimization is to find an
execution strategy for the query, which is close to optimal. An execution strategy for a
distributed query can be described with relational algebra operations and
communication primitives (send/receive operations) for transferring data between
sites. The previous layers have already optimized the query for example, by
eliminating redundant expressions. However, this optimization is independent of
fragments characteristics such as cardinalities. In addition, communication operations
are not yet specified. By permuting the ordering of operations within one fragment
query, many equivalent queries may be found.
Query optimization consists of
finding the “best” ordering of operations in the fragments query, including
communication operations, which minimize a cost function. The cost function, often
defined in terms of time units, refers to computing resources such as disk space, disk
I/Os, buffer space, CPU cost, communication cost and so on. An important aspect of
query optimization is join ordering, since permutations of the joint within the query
may lead to improvements of orders of magnitude. One basic technique for
optimizing a sequence of distributed join operations is through the semi-join operator.
The main value of the semi-join in a distributed system is to reduce the size of the join
operands and then the communication cost. The output of the query optimization layer
is an optimized algebraic query with communication operation included on fragments.
6.5.4 Local Query Optimization: The last layer us performed by all the sites having
fragments involved in query. Each sub-query executing at one site, called a local
query, is then optimized using the local schema of the site. At this time, the
algorithms to perform the relational operations may be chosen. Local optimization
uses the algorithms of centralized systems.
Check Your Progress: Answer the following: a) What is a Query processor?
b) State the Query processing problem.
c) Explain the different characteristics of Query processor.
d) Describe the layer architecture of query processing.
e) Discuss Query optimization.
6.6 Summary: In this unit we have provided an overview of query processing in
distributed DBMSs. The following points are discussed:

We have introduced the function and objectives of query processing.

The goals of the query processing are discussed. They are

Given a calculus query on a distributed database, find a
corresponding execution strategy that minimizes a system cost
function, which includes I/O, CPU, and communication costs.

An execution strategy is specified in terms of relational algebra
operations and communication primitives applied to the local
databases.

We have described a characterization of query processors based on
their implementation choices. This is useful for comparing alternative
query processor designs and to understand the trade-offs between
efficiency and complexity.

We have proposed a generic layering scheme for describing distributed
query processing. Here four main functions have been isolated: Query
decomposition, Data localization, Query optimization, and Local query
optimization.
Download