DISTRIBUTED DATABASE DESIGN

advertisement
DISTRIBUTED DATABASE DESIGN
Structure
1.0 Objectives
5.1 Introduction
5.2 A Framework for Distributed Database Design
5.2.1 Objectives of the Design of Data Distribution
5.2.2 Top – Down and Bottom – Up Approach –A classical Design
Methodologies
5.3 The Design of Database Fragmentation
5.3.1 Horizontal Fragmentation
5.3.1.1 Primary Fragmentation
5.3.1.2 Derived Horizontal Fragmentation
5.3.2 Vertical Fragmentation
5.3.3 Mixed Fragmentation
5.4 The Allocation of Fragments
5.5 Summary
5.0 Objectives: In this unit you will come to know the different design aspects of
distributed databases. At the end of this unit you will be able to describe the topics
like

A framework for distributed database design
 The objectives of design of data distribution
 Top – Down and Bottom – Up design approaches

The design of database fragmentation
 Horizontal Fragmentation
 Vertical Fragmentation
 Mixed Fragmentation

The allocation of fragments
 General Criteria for Fragment allocation
5.1 Introduction: The concept of data distribution itself is difficult to design and
implement because of various technical and organizational issues. So we need to have
an efficient design methodology. From the technical aspect, the interconnection of
sites and appropriate distribution of the data and applications to the sites depending
upon the requirement of applications and for optimizing performances. From the
organizational point, the issue of decentralization is crucial and distributing an
application has a greater effect on the organization. In recent years, lot of research
work has taken place in this area and the major outcome of this are:
o Design criteria for effective data distribution
o Mathematical background of the design aids
In the section 5.2 you will learn a framework of the design including the design
approaches like Top – Down and Bottom – Up. The section 5.3 explains about the
design of Horizontal and Vertical Fragmentation. In the section 5.4 we will give
principles and concepts in Fragment allocation.
5.2 A Framework for Distributed Database Design: The design of a centralized
database concentrates on:
 Designing the conceptual schema that describes the complete database
 Designing the Physical database, which maps the conceptual schema to the
storage areas and determines the appropriate access methods.
The above two steps contributes in distributed database towards the design of Global
schema and the design of local databases. The added steps are:
 Designing the Fragmentation: - The actual procedure of dividing the existing
global relations into horizontal, vertical or mixed fragments
 Designing the allocation of fragments: -Allocation of different fragments
according to the site requirements
Before designing the Distributed database a thorough knowledge about the application
is a must. In this case we expect the following things from the designer.

Site of Origin: The site from which the application is issued.

The frequency of invoking the request at each site

The number, type and the statistical distribution of accesses made by
each application to each required data.
In the coming section let us try to know the actual need of design of data distribution.
5.2.1 Objectives of the Design of Data Distribution: In the design of data
distribution the following objectives should be considered.

Processing locality: Reducing the remote references in turn maximizing the
local references is the primary aim of the data distribution. This can be
achieved by having redundant fragment allocation meeting the site
requirements. Complete locality is an extended idea, which simplifies the
execution of application.

Availability and reliability of distributed data: Availability is achieved by
having multiple copies of the data for read only applications. Reliability is
achieved by
storing the multiple copies of the information, as it will be
helpful in case of system crashes.

Workload distribution: workload distribution is the major goal to have high
degree of parallelism.

Storage costs and Processing locality: Cost criteria and Availability of
storage areas should be intelligently handled for effective data distribution.
Using the all above criteria may increase the design complexity. So important aspects
are taken as objectives depending upon the requirement and others are treated as
constraints. In the next section let us design a simple approach for maximizing the
processing locality.
5.2.2 Top – Down and Bottom – Up Approach –A classical Design
Methodologies:
There are two classical approaches as far as distributed databases design is concerned.
They are:
1. Top – Down Approach: This may be quite useful when the system has to be
designed from the scratch. Here we follow the following steps:
 Design of Global Schema.
 Design of Fragmentation Schema.
 Design of Allocation Schema.
 Design of Local Schema (Design of “Physical Databases”).
2. Bottom - Up Approach: This can be used for an existing system. This
approach is based on the integration of existing schemata into a single, global
schema. But requires that the following aspects have to be fulfilled.
 The selection of a common database model for describing the Global
schema of the database.
 The translation of each local schema into the common data model.
 The Integration of common schemata into a common Global schema.
i.e the merging of common data definitions and the resolution of
conflicts among different representations given to the same data.
The Bottom – Up design require solving these three problems. Then of course
the design steps are just reverse of the previous method.
5.3 The Design of Database Fragmentation: Here we discuss the design of nonoverlapping fragments, which are the logical units of allocation. That is, it is
important to have an efficient design methodology so that we can overcome the
related problems of allocation. In the following, we explain the design of Horizontal,
Vertical and Mixed Fragmentations.
5.3.1 Horizontal Fragmentation: Here we discuss two important methods called
Primary and Derived. Determining the horizontal fragmentation involves knowing:
 The logical properties of the data such as fragmentation predicates.
 The statistical properties of the data such as the number of references of
applications to the fragments.
5.3.1.1 Primary Fragmentation: The correctness of Primary fragmentation requires
that each global relation be selected in one and only one fragment. Thus, determining
the primary horizontal fragmentation of a global relation requires determining a set of
disjoint and complete selection predicates (we shall define this later in this section).
The property we expect from each fragment is that the elements of them must be
referenced homogeneously by all applications.
Let G be the global relation for which we want to produce a
horizontal

primary fragmentation. Let us define some terminologies.
A Simple Predicate: is a predicate of the type:
Attribute = value

A Min-term Predicate y for a set P of simple predicates is the
conjunction of all predicates appearing in P, either taken in natural
form or negated. Thus:
y =  pi *
pi  p
Where (pi* = pi or pi * =NOT pi) and y  false

A
fragment is the set of all tuples for which a min-term predicate
holds.

A simple predicate is relevant respect to a set P of simple predicates if
there exists at least two min-term predicates of P whose expression
differs only in the predicate pi itself such that the corresponding
fragments are referenced in a different way by at least one application.
Let us try to understand the above terminologies by taking an example. Let us
consider the relations DEPT (DEPTNUM, NAME, AREA) and JOB(JOBID,JOB
NAME). Let us assume that only two departments are functioning i.e 1 & 2.Now
some examples for simple predicates are:
DEPTNUM =1 or DEPTNUM  2, DEPTNUM = 2 or DEPTNUM  1
JOB = “programmer” or JOB  “programmer”.
The corresponding min-term predicates are
DEPTNUM =1 AND JOB = “programmer”
DEPTNUM =1 AND JOB  “programmer”
DEPTNUM 1 AND JOB  “programmer”
DEPTNUM 1 AND JOB = “programmer”
Now let us concentrate on some more supporting terminologies. Let P = {p1,p2,….p
n}be
a set of simple predicates. For correct and efficient fragmentation P must be
complete and minimal.

We say that a set P of predicates is complete if and only if any two
tuples belonging to the same fragment are referenced with the same
probabilities by any application.

The set P is minimal if all its predicates are relevant.
Example: In the above example, P1 ={DEPTNUM = 1} is not complete since the
application is even interested in the employees who are “programmers”. So in this
case P2 = {DEPTNUM =1,JOB = “programmer”} is complete and minimal. The set
P3 = {DEPTNUM =1, JOB = “programmer”, SAL > 50} is complete but not minimal
since SAL >50 is not relevant.
By knowing the minimum characteristics that are to be considered now let us
generalize the method to be followed while producing fragments of the given global
relation.
I. Consider a predicate pi that partitions the tuples of the global relation G
into two parts, which are referenced differently at least by one
application. Let P = p1.
II. Consider a new simple predicate pi which partitions at least one
fragment of P into two parts, which are referenced in a different way by
at least one application. Eliminate non-relevant predicates from P.
Repeat this step until the set of min-term fragments of P is complete.
Example: Let us take two cities of Karnataka: Shimoga and Mysore. The example
application considered is the marketing of medical goods. The global schema for this
application includes the relations EMPL, DEPT, SUPPLIER and SUPPLY. These
relations look as follows:
EMPL (EMPNUM, NAME, SAL, TAX, MGRNUM, DEPTNUM)
DEPT (DEPTNUM, NAME, AREA, MGRNUM)
SUPPLIER (SNUM, NAME, CITY)
SUPPLY (SNUM, PNUM, DEPTNUM, QUAN)
We design the fragmentation of SUPPLIER and DEPT with a Primary Fragmentation.
Now let us take a query. Find the names of suppliers with a given number SNUM.
As you have already come across a popular query language SQL can be used for
representing this query.
Select NAME
From SUPPLIER
Where SNUM = $Y
This query issued at any one of the sites. Let us assume that we have three
sites in our purview. Site 1 is in Shimoga, Site 2 is in Mysore and Site 3 is in between
Shimoga and Mysore. So, if the query is issued at Site 1 it references SUPPLIERS
whose CITY is “Shimoga” with almost 90% probability; if it is issued at Site 2 it
references SUPPLIERS of “Shimoga” and “Mysore” with equal probability; if it is
issued at Site3 it references SUPPLIERS whose CITY is “Mysore” with almost 90%
probability. This is because the obvious fact that department around one city tends to
use suppliers, which are close to them.
We can write the predicates for the above application,

P1: CITY = “SHIMOGA”

P2: CITY = “MYSORE”
Since the set {P1, P2} is complete and minimal, the search is terminated.
Let us now consider the global relation DEPT: DEPT (DEPTNUM, NAME,
AREA, MGRNUM). Some example predicates that are suitable for administrative
applications are considered.
P1: DEPTNUM < = 10
P2: (10 < DEPTNUM < = 20)
P3: DEPTNUM > 20
P4: AREA = “NORTH”
P5: AREA = “SOUTH”
If we assume that in the northern area the departments with DEPTNUM > 20
will never be there, then AREA = “NORTH” implies that DEPTNUM > 20 is false.
Thus the fragments are reduced to the following four:
Y1: DEPTNUM < = 10
Y2: (10 < DEPTNUM < = 20) AND (AREA = “NORTH” )
Y3: (10 < DEPTNUM < = 20) AND (AREA = “SOUTH” )
Y4: DEPTNUM > 20
If we now concentrate about the fragment allocation we can easily allocate
fragments corresponding to y1 and y4 at sites 1 and 3.But depending upon the
requirement fragments y2 and y3 will be allocated to either sites 1 or 3.
5.3.1.2 Derived Horizontal Fragmentation: This is not based on the properties of its
own attributes, but it is derived from the horizontal fragmentation of another relation.
It is used to make the join between the fragments. A distributed join is a join
between horizontally fragmented relations. That is when you want to join the two
relations G and H you have to compare their fragments. Join Graphs can efficiently
represent it. The fig 5.1 represents the different possible join graphs.
o Total: The join graph is total when it contains all possible edges
between fragments of G and H.
o Reduced: The join graph is reduced when some of the edges between
G and H are missing. Here we have two types:
 Partitioned: A reduced graph is partitioned if the graph is
composed of two or more sub graphs without edge between
them.
 Simple: A reduced graph is simple if it is partitioned and each
sub graph has just one edge.
Example: Consider the relation SUPPLY (SNUM, PNUM, DEPTNUM, QUAN). Let
us take the following case. Some application
o Requires the information about supplies of given suppliers; thus they
join between SUPPLY and SUPPLIER in the SNUM attribute.
o Requires the information about supplies at a given department; then
they perform join between SUPPLY and DEPT on the DEPTNUM
attribute.
Let us assume that the relation DEPT is horizontally fragmented on the attribute
DEPTNUM and that SUPPLIER is horizontally fragmented on the attribute SNUM.
The derived horizontal fragmentation can be obtained for relation SUPPLY by either
performing a Semi - join operation with SUPPLIER on SNUM or with DEPT on
DEPTNUM; both of them are correct.
5.3.2 Vertical Fragmentation: This requires grouping the attributes into sets, which
are referenced in the similar manner by applications. This method has been discussed
by considering two separate types of problems:

The Vertical Partitioning Problem: Here set must be disjoint. Of
course one attribute must be common. For example assume that a
relation S is vertically fragmented using this concept into S1 and
S2.This can be useful where an application can be executed using
either S1 or S2.Otherwise having the complete S at a particular site
may be a unnecessary burden.
Two possible design approaches:
1. The
split
approach:
The global relations are
progressively split into fragments
2. The
Grouping
approach:
The
attributes
are
progressively aggregated to constitute fragments.
Both are Heuristic approaches as each iteration steps look for best
choice. In both the cases formulas are used to indicate the best
possible splitting or grouping.
R1
R1
S1
R1
S1
R2
S2
S1
R2
R2
S2
R3
R3
S3
R4
S4
R3
S3
R4
S1
R4
S2
R5
Figure 5.1 The different possible join graphs

The Vertical Clustering Problem: Here sets can overlap. Here
depending upon the requirement you may have more than one common
attribute in the two different fragments of a global relation. It
introduces Replication within fragments, as some common attributes
are present in the fragments. It is suitable only for Read-Only
applications; because for applications, which involve frequent updating
of these common attributes needs to be referred to the sites where all
these attributes are present. Therefore, Vertical clustering is suggested
where overlapping attributes are not heavily updated.
Example: Consider the global relation
EMPL (EMPNUM, NAME, SAL, TAX,
MGRNUM, DEPTNUM). The following are made:

Administrative
applications,
requires
NAME,
SAL,
TAX
of
employees.

The department, requires NAME, MGRNUM and DEPTNUM
Here Vertical clustering is suggested as the attribute NAME is required in both the
fragments. So the fragments may be:
EMPL1 (EMPNUM, NAME, SAL, TAX)
EMPL2 (EMPNUM, NAME, MGRNUM, DEPTNUM)
5.3.3 Mixed Fragmentation: The simple way for performing this is:

Apply Horizontal fragmentation to Vertical fragments

Apply Vertical fragmentation to Horizontal fragments
Both these aspects are illustrated using the following diagrams 5.2 and 5.3.
A1
A2
A3
A4
A5
Fig: 5.2 Vertical fragmentations followed by horizontal fragmentation.
A1
A2
A3
A4
A5
Fig: 5.3 Vertical fragmentations followed by horizontal fragmentation
5.4 The Allocation of Fragments:
In this section we explain the different aspects to be considered when you go for
allocating a particular fragment to site. This section describes some general criteria
that can be used for allocating fragments. There are two types of allocation methods,
which can be followed. They are:
Non-redundant Allocation: It is simple. A method known as “Best-fit approach” can
be used; i.e a measure is associated with each possible allocation, and the site with the
bets measure is selected. It avoids placing a fragment at a given site where already a
fragment is present which is related to this fragment.

Redundant Allocation: It is complex design, since:
o The degree of replication is a variable of the problem.
o The modeling of read applications is complicated as the applications
may select any of the several alternatives.
The following two methods can be used for determining the redundant
allocation of fragments:

Determine the set of all sites where the benefit of allocating one
copy of the fragment is higher than the cost, and allocate a copy
of the fragment to each element of this site; this method selects
“all beneficial sites”.

Start from a non-replicated version. Then progressively
introduce replicated copies from the most beneficial; the
process is terminated when no additional replication is
beneficial.
Both the reliability and availability of the system increases if there are two or three
copies of the fragment, but further copies give a less than proportional increase.
5.5 Summary:
In this unit we have discussed the four phases of the design of Distributed databases:
Global schema, Fragmentation schema, Allocation schema and Local schema. Some
important aspects of design of fragmentation and allocation schemas are described in
detail. Also some of the practical examples are chosen for familiarizing the new
concepts.
Download