DISTRIBUTED DATABASE DESIGN Structure 1.0 Objectives 5.1 Introduction 5.2 A Framework for Distributed Database Design 5.2.1 Objectives of the Design of Data Distribution 5.2.2 Top – Down and Bottom – Up Approach –A classical Design Methodologies 5.3 The Design of Database Fragmentation 5.3.1 Horizontal Fragmentation 5.3.1.1 Primary Fragmentation 5.3.1.2 Derived Horizontal Fragmentation 5.3.2 Vertical Fragmentation 5.3.3 Mixed Fragmentation 5.4 The Allocation of Fragments 5.5 Summary 5.0 Objectives: In this unit you will come to know the different design aspects of distributed databases. At the end of this unit you will be able to describe the topics like A framework for distributed database design The objectives of design of data distribution Top – Down and Bottom – Up design approaches The design of database fragmentation Horizontal Fragmentation Vertical Fragmentation Mixed Fragmentation The allocation of fragments General Criteria for Fragment allocation 5.1 Introduction: The concept of data distribution itself is difficult to design and implement because of various technical and organizational issues. So we need to have an efficient design methodology. From the technical aspect, the interconnection of sites and appropriate distribution of the data and applications to the sites depending upon the requirement of applications and for optimizing performances. From the organizational point, the issue of decentralization is crucial and distributing an application has a greater effect on the organization. In recent years, lot of research work has taken place in this area and the major outcome of this are: o Design criteria for effective data distribution o Mathematical background of the design aids In the section 5.2 you will learn a framework of the design including the design approaches like Top – Down and Bottom – Up. The section 5.3 explains about the design of Horizontal and Vertical Fragmentation. In the section 5.4 we will give principles and concepts in Fragment allocation. 5.2 A Framework for Distributed Database Design: The design of a centralized database concentrates on: Designing the conceptual schema that describes the complete database Designing the Physical database, which maps the conceptual schema to the storage areas and determines the appropriate access methods. The above two steps contributes in distributed database towards the design of Global schema and the design of local databases. The added steps are: Designing the Fragmentation: - The actual procedure of dividing the existing global relations into horizontal, vertical or mixed fragments Designing the allocation of fragments: -Allocation of different fragments according to the site requirements Before designing the Distributed database a thorough knowledge about the application is a must. In this case we expect the following things from the designer. Site of Origin: The site from which the application is issued. The frequency of invoking the request at each site The number, type and the statistical distribution of accesses made by each application to each required data. In the coming section let us try to know the actual need of design of data distribution. 5.2.1 Objectives of the Design of Data Distribution: In the design of data distribution the following objectives should be considered. Processing locality: Reducing the remote references in turn maximizing the local references is the primary aim of the data distribution. This can be achieved by having redundant fragment allocation meeting the site requirements. Complete locality is an extended idea, which simplifies the execution of application. Availability and reliability of distributed data: Availability is achieved by having multiple copies of the data for read only applications. Reliability is achieved by storing the multiple copies of the information, as it will be helpful in case of system crashes. Workload distribution: workload distribution is the major goal to have high degree of parallelism. Storage costs and Processing locality: Cost criteria and Availability of storage areas should be intelligently handled for effective data distribution. Using the all above criteria may increase the design complexity. So important aspects are taken as objectives depending upon the requirement and others are treated as constraints. In the next section let us design a simple approach for maximizing the processing locality. 5.2.2 Top – Down and Bottom – Up Approach –A classical Design Methodologies: There are two classical approaches as far as distributed databases design is concerned. They are: 1. Top – Down Approach: This may be quite useful when the system has to be designed from the scratch. Here we follow the following steps: Design of Global Schema. Design of Fragmentation Schema. Design of Allocation Schema. Design of Local Schema (Design of “Physical Databases”). 2. Bottom - Up Approach: This can be used for an existing system. This approach is based on the integration of existing schemata into a single, global schema. But requires that the following aspects have to be fulfilled. The selection of a common database model for describing the Global schema of the database. The translation of each local schema into the common data model. The Integration of common schemata into a common Global schema. i.e the merging of common data definitions and the resolution of conflicts among different representations given to the same data. The Bottom – Up design require solving these three problems. Then of course the design steps are just reverse of the previous method. 5.3 The Design of Database Fragmentation: Here we discuss the design of nonoverlapping fragments, which are the logical units of allocation. That is, it is important to have an efficient design methodology so that we can overcome the related problems of allocation. In the following, we explain the design of Horizontal, Vertical and Mixed Fragmentations. 5.3.1 Horizontal Fragmentation: Here we discuss two important methods called Primary and Derived. Determining the horizontal fragmentation involves knowing: The logical properties of the data such as fragmentation predicates. The statistical properties of the data such as the number of references of applications to the fragments. 5.3.1.1 Primary Fragmentation: The correctness of Primary fragmentation requires that each global relation be selected in one and only one fragment. Thus, determining the primary horizontal fragmentation of a global relation requires determining a set of disjoint and complete selection predicates (we shall define this later in this section). The property we expect from each fragment is that the elements of them must be referenced homogeneously by all applications. Let G be the global relation for which we want to produce a horizontal primary fragmentation. Let us define some terminologies. A Simple Predicate: is a predicate of the type: Attribute = value A Min-term Predicate y for a set P of simple predicates is the conjunction of all predicates appearing in P, either taken in natural form or negated. Thus: y = pi * pi p Where (pi* = pi or pi * =NOT pi) and y false A fragment is the set of all tuples for which a min-term predicate holds. A simple predicate is relevant respect to a set P of simple predicates if there exists at least two min-term predicates of P whose expression differs only in the predicate pi itself such that the corresponding fragments are referenced in a different way by at least one application. Let us try to understand the above terminologies by taking an example. Let us consider the relations DEPT (DEPTNUM, NAME, AREA) and JOB(JOBID,JOB NAME). Let us assume that only two departments are functioning i.e 1 & 2.Now some examples for simple predicates are: DEPTNUM =1 or DEPTNUM 2, DEPTNUM = 2 or DEPTNUM 1 JOB = “programmer” or JOB “programmer”. The corresponding min-term predicates are DEPTNUM =1 AND JOB = “programmer” DEPTNUM =1 AND JOB “programmer” DEPTNUM 1 AND JOB “programmer” DEPTNUM 1 AND JOB = “programmer” Now let us concentrate on some more supporting terminologies. Let P = {p1,p2,….p n}be a set of simple predicates. For correct and efficient fragmentation P must be complete and minimal. We say that a set P of predicates is complete if and only if any two tuples belonging to the same fragment are referenced with the same probabilities by any application. The set P is minimal if all its predicates are relevant. Example: In the above example, P1 ={DEPTNUM = 1} is not complete since the application is even interested in the employees who are “programmers”. So in this case P2 = {DEPTNUM =1,JOB = “programmer”} is complete and minimal. The set P3 = {DEPTNUM =1, JOB = “programmer”, SAL > 50} is complete but not minimal since SAL >50 is not relevant. By knowing the minimum characteristics that are to be considered now let us generalize the method to be followed while producing fragments of the given global relation. I. Consider a predicate pi that partitions the tuples of the global relation G into two parts, which are referenced differently at least by one application. Let P = p1. II. Consider a new simple predicate pi which partitions at least one fragment of P into two parts, which are referenced in a different way by at least one application. Eliminate non-relevant predicates from P. Repeat this step until the set of min-term fragments of P is complete. Example: Let us take two cities of Karnataka: Shimoga and Mysore. The example application considered is the marketing of medical goods. The global schema for this application includes the relations EMPL, DEPT, SUPPLIER and SUPPLY. These relations look as follows: EMPL (EMPNUM, NAME, SAL, TAX, MGRNUM, DEPTNUM) DEPT (DEPTNUM, NAME, AREA, MGRNUM) SUPPLIER (SNUM, NAME, CITY) SUPPLY (SNUM, PNUM, DEPTNUM, QUAN) We design the fragmentation of SUPPLIER and DEPT with a Primary Fragmentation. Now let us take a query. Find the names of suppliers with a given number SNUM. As you have already come across a popular query language SQL can be used for representing this query. Select NAME From SUPPLIER Where SNUM = $Y This query issued at any one of the sites. Let us assume that we have three sites in our purview. Site 1 is in Shimoga, Site 2 is in Mysore and Site 3 is in between Shimoga and Mysore. So, if the query is issued at Site 1 it references SUPPLIERS whose CITY is “Shimoga” with almost 90% probability; if it is issued at Site 2 it references SUPPLIERS of “Shimoga” and “Mysore” with equal probability; if it is issued at Site3 it references SUPPLIERS whose CITY is “Mysore” with almost 90% probability. This is because the obvious fact that department around one city tends to use suppliers, which are close to them. We can write the predicates for the above application, P1: CITY = “SHIMOGA” P2: CITY = “MYSORE” Since the set {P1, P2} is complete and minimal, the search is terminated. Let us now consider the global relation DEPT: DEPT (DEPTNUM, NAME, AREA, MGRNUM). Some example predicates that are suitable for administrative applications are considered. P1: DEPTNUM < = 10 P2: (10 < DEPTNUM < = 20) P3: DEPTNUM > 20 P4: AREA = “NORTH” P5: AREA = “SOUTH” If we assume that in the northern area the departments with DEPTNUM > 20 will never be there, then AREA = “NORTH” implies that DEPTNUM > 20 is false. Thus the fragments are reduced to the following four: Y1: DEPTNUM < = 10 Y2: (10 < DEPTNUM < = 20) AND (AREA = “NORTH” ) Y3: (10 < DEPTNUM < = 20) AND (AREA = “SOUTH” ) Y4: DEPTNUM > 20 If we now concentrate about the fragment allocation we can easily allocate fragments corresponding to y1 and y4 at sites 1 and 3.But depending upon the requirement fragments y2 and y3 will be allocated to either sites 1 or 3. 5.3.1.2 Derived Horizontal Fragmentation: This is not based on the properties of its own attributes, but it is derived from the horizontal fragmentation of another relation. It is used to make the join between the fragments. A distributed join is a join between horizontally fragmented relations. That is when you want to join the two relations G and H you have to compare their fragments. Join Graphs can efficiently represent it. The fig 5.1 represents the different possible join graphs. o Total: The join graph is total when it contains all possible edges between fragments of G and H. o Reduced: The join graph is reduced when some of the edges between G and H are missing. Here we have two types: Partitioned: A reduced graph is partitioned if the graph is composed of two or more sub graphs without edge between them. Simple: A reduced graph is simple if it is partitioned and each sub graph has just one edge. Example: Consider the relation SUPPLY (SNUM, PNUM, DEPTNUM, QUAN). Let us take the following case. Some application o Requires the information about supplies of given suppliers; thus they join between SUPPLY and SUPPLIER in the SNUM attribute. o Requires the information about supplies at a given department; then they perform join between SUPPLY and DEPT on the DEPTNUM attribute. Let us assume that the relation DEPT is horizontally fragmented on the attribute DEPTNUM and that SUPPLIER is horizontally fragmented on the attribute SNUM. The derived horizontal fragmentation can be obtained for relation SUPPLY by either performing a Semi - join operation with SUPPLIER on SNUM or with DEPT on DEPTNUM; both of them are correct. 5.3.2 Vertical Fragmentation: This requires grouping the attributes into sets, which are referenced in the similar manner by applications. This method has been discussed by considering two separate types of problems: The Vertical Partitioning Problem: Here set must be disjoint. Of course one attribute must be common. For example assume that a relation S is vertically fragmented using this concept into S1 and S2.This can be useful where an application can be executed using either S1 or S2.Otherwise having the complete S at a particular site may be a unnecessary burden. Two possible design approaches: 1. The split approach: The global relations are progressively split into fragments 2. The Grouping approach: The attributes are progressively aggregated to constitute fragments. Both are Heuristic approaches as each iteration steps look for best choice. In both the cases formulas are used to indicate the best possible splitting or grouping. R1 R1 S1 R1 S1 R2 S2 S1 R2 R2 S2 R3 R3 S3 R4 S4 R3 S3 R4 S1 R4 S2 R5 Figure 5.1 The different possible join graphs The Vertical Clustering Problem: Here sets can overlap. Here depending upon the requirement you may have more than one common attribute in the two different fragments of a global relation. It introduces Replication within fragments, as some common attributes are present in the fragments. It is suitable only for Read-Only applications; because for applications, which involve frequent updating of these common attributes needs to be referred to the sites where all these attributes are present. Therefore, Vertical clustering is suggested where overlapping attributes are not heavily updated. Example: Consider the global relation EMPL (EMPNUM, NAME, SAL, TAX, MGRNUM, DEPTNUM). The following are made: Administrative applications, requires NAME, SAL, TAX of employees. The department, requires NAME, MGRNUM and DEPTNUM Here Vertical clustering is suggested as the attribute NAME is required in both the fragments. So the fragments may be: EMPL1 (EMPNUM, NAME, SAL, TAX) EMPL2 (EMPNUM, NAME, MGRNUM, DEPTNUM) 5.3.3 Mixed Fragmentation: The simple way for performing this is: Apply Horizontal fragmentation to Vertical fragments Apply Vertical fragmentation to Horizontal fragments Both these aspects are illustrated using the following diagrams 5.2 and 5.3. A1 A2 A3 A4 A5 Fig: 5.2 Vertical fragmentations followed by horizontal fragmentation. A1 A2 A3 A4 A5 Fig: 5.3 Vertical fragmentations followed by horizontal fragmentation 5.4 The Allocation of Fragments: In this section we explain the different aspects to be considered when you go for allocating a particular fragment to site. This section describes some general criteria that can be used for allocating fragments. There are two types of allocation methods, which can be followed. They are: Non-redundant Allocation: It is simple. A method known as “Best-fit approach” can be used; i.e a measure is associated with each possible allocation, and the site with the bets measure is selected. It avoids placing a fragment at a given site where already a fragment is present which is related to this fragment. Redundant Allocation: It is complex design, since: o The degree of replication is a variable of the problem. o The modeling of read applications is complicated as the applications may select any of the several alternatives. The following two methods can be used for determining the redundant allocation of fragments: Determine the set of all sites where the benefit of allocating one copy of the fragment is higher than the cost, and allocate a copy of the fragment to each element of this site; this method selects “all beneficial sites”. Start from a non-replicated version. Then progressively introduce replicated copies from the most beneficial; the process is terminated when no additional replication is beneficial. Both the reliability and availability of the system increases if there are two or three copies of the fragment, but further copies give a less than proportional increase. 5.5 Summary: In this unit we have discussed the four phases of the design of Distributed databases: Global schema, Fragmentation schema, Allocation schema and Local schema. Some important aspects of design of fragmentation and allocation schemas are described in detail. Also some of the practical examples are chosen for familiarizing the new concepts.