Distributed Database Systems COP5711 What is a Distributed Database System ? A distributed database is a collection of databases which are distributed over different computers of a computer network. • Each site has autonomous processing capability and can perform local applications. • Each site also participates in the execution of at least one global application which requires accessing data at several sites. Multiprocessor Database Computers Cannot run an application by itself Application (front-end) computer Interface Processor Access Processor Access Processor Access Processor What we miss here is the existence of local applications, in the sense that the integration of the system has reached the point where no one of the computers (i.e., IFPs & ACPs) is capable of executing an application by itself. Why Distributed Databases ? 1. Local Autonomy: permits setting and enforcing local policies regarding the use of local data (suitable for organization that are inherently decentralized). 2. Improved Performance: The regularly used data is proximate to the users and given the parallelism inherent in distributed systems. 3. Improved Reliability/Availability: Data replication can be used to obtain higher reliability and availability. The autonomous processing capability of the different sites ensures a graceful degradation property. 4. Incremental Growth: supports a smooth incremental growth with a minimum degree of impact on the already existing sites. 5. Shareability: allows preexisting sites to share data. 6. Reduced Communication Overhead: The fact that many applications are local clearly reduces the communication overhead with respect to centralized databases. Disadvantages of DDBSs Cost: replication of effort (manpower). Security: More difficult to control Complexity: • The possible duplication is mainly due to reliability and efficiency considerations. Data redundancy, however, complicates update operations. • If some sites fail while an update is being executed, the system must make sure that the effects will be reflected on the data residing at the failing sites as soon as the system can recover from the failure. • The synchronization of transactions on multiple sites is considerably harder than for a centralized system. Distributed DBMS Architecture NetworkTransparancy • The user should be protected from the operational details of the network. • It is desirable to hide even the existence of the network, if possible. Location transparency: The command used is independent of the system on which the data is stored. Naming transparency: a unique name is provided for each object in the database. Replication & Fragmentation Transparancy • The user is unaware of the replication of framents • Queries are specified on the relations (rather than the fragments). Copy 1 of R1 Site A Copy 1 of R2 Relation R Fragment R1 Fragment R2 Copy 2 of R1 Site B Fragment R3 Site C Fragment R4 Copy 2 of R2 ANSI/SPARC Architecture External Schema External view External view Conceptual Schema Conceptual view Internal Schema Internal view External view Internal view: deals with the physical definition and organization of data. Conceptual view: abstract definition of the database. It is the “real world” view of the enterprise being modeled in the database. External view: individual user’s view of the database. A Taxonomy of Distributed Data Systems A distributed database can be defined as • a logically integrated collection of shared data which is • physically distributed across the nodes of a computer network. Distributed data systems Homogeneous Federated Loosely coupled (interoperable DB systems using export schema) Heterogeneous (Multidatabase) Unfederated (no local users) Tightly coupled (/w global schema) Architecture of a Homogeneous DDBMS Global user view 1 Global user view n A homogeneous DDBMS resembles a Global Schema centralized DB, but Fragmentation Schema instead of storing all the data at one site, Allocation Schema the data is distributed Local conceptual schema 1 Local conceptual schema n Local internal schema 1 Local internal schema n Local DB 1 Local DB n across a number of sites in a network. Fragmentation Schema & Allocation Schema Fragmentation Schema: describes how the global relations are divided into fragments. Allocation Schema: specifies at which sites each fragment is stored. Example: Fragmentation of global relation R. A B C D E To materialize R, the following operations are required: R = (A B) U ( C D) U E Homogeneous vs. Heterogeneous • Homogeneous DDBMS Global user Local user Multidatabase Management system DBMS Database 1 – No local users Local user – Most systems do not have local schemas (i.e., every user uses the same schema) • Heterogeneous DDBMS – There are both local and global users DBMS DBMS DBMS Database 2 Database 3 Database 4 – Multidatabase systems are split into: • Tightly Coupled Systems: have a global schema • Loosely Coupled Systems: do not have a global schema. Schema Architecture of a TightlyCoupled System Global user view 1 Global user view n Global Conceptual Schema Auxiliary Schema 1 Local user view 1 Local user view 2 An individual node’s participation in the MDB is defined by means of a participation schema. Local Participation Schema 1 Local Participation Schema 1 Auxiliary Schema 1 Local Conceptual Schema 1 Local Conceptual Schema 1 Local user view 1 Local Internal Schema 1 Local Internal Schema 1 Local user view 2 Local DB 1 Local DB 1 Auxiliary Schema (1) Auxiliary schema describes the rules which govern the mappings between the local and global levels. Rules for unit conversion: may be required when one site expresses distance in kilometers and another in miles, … Rules for handling null values: may be necessary where one site stores additional information which is not stored at another site. – Example: One site stores the name, home address and telephone number of its employees, whereas another just stores names and addresses. Auxiliary Schema (2) Rules for naming conflicts: naming conflicts occur when: semantically identical data items are named differently • DNAME Department name (at Site 1) • DEPTNAME Department name (at Site 2) semantically different data items are named identically. • NAME Department name (at Site 1) • NAME Manager name (at Site 2) Rules for handling data representation conflicts: Such conflicts occur when semantically identical data items are represented differently in different data source. Example: Data represented as a character string in one database may be represented as a real number in the other database. Auxiliary Schema (3) Rules for handling data scaling conflicts: Such conflicts occur when semantically identical data items stored in different databases using different units of measure. Example: “Large”, “New”, “Good”, etc. These problems are called domain mismatch problems Loosely-Coupled Systems (Interoperable Database Systems) Local user view 1 Local user view 2 Global user view 1 Global user view 2 Global user view 3 Local Conceptual schema 1 Local Conceptual Schema 2 Local Conceptual Schema n Local internal schema 1 Local internal Schema 2 Local internal Schema n Local DB 1 Local DB 2 Local DB n Loosely-Coupled Systems Global user view 1 Export schema 1 Local user view 1 Local user view 2 Global user view 2 Export schema 2 Export Schema 3 Global user view m Export Schema n Local Conceptual schema 1 Local Conceptual Schema 2 Local Conceptual Schema n Local internal schema 1 Local internal Schema 2 Local internal Schema n Local DB 1 Local DB 2 Local DB n Integration of Heterogeneous Data Models • Provide bidirectional translators between all pairs of models – Advantage: support multiple models at the global level. No need to learn another data model and language – Disadvantage: requires n(n-1) translators, where n is the number of different models. • Adopt a single model (called canonical model) at the global level and map all the local models onto this model – Advantage: requires only 2n translators – Disadvantage: translations must go through the global model. (The 2nd approach is more widely used) Distributed Database Design •Top-Down Approach: The database system is being designed from scratch. • Issues: fragmentation & allocation •Bottom-up Approach: Integrating existing databases into one database • Issues: Design of the export and global schemas. TOP-DOWN DESIGN PROCESS Requirements Analysis Entity analysis + functional analysis System Requirements (Objectives) Conceptual design Global conceptual schema Defining the interfaces for end users View integration View Design Access information External Schema Definitions Distribution Design Local Conceptual Schemas Maps the local conceptual schemas to physical storage devices Physical Design Physical Schema Fragmentation & allocation Design Consideration (1) The organization of distributed systems can be investigated along three dimensions: Level of sharing 1. No sharing: Each application and its data execute at one site. 2. Data sharing: Programs are replicated at all sites, but data files are not. 3. Data + Program Sharing: Both data and programs may be shared. Design Consideration (2) Access Pattern 1. Static: Access patterns do not change. 2. Dynamic: Access patterns change over time. Level of Knowledge 1. No information 2. Partial information: Access patterns may deviate from the predictions. 3. Complete information: Access patterns can reasonably be predicted. Fragmentation Alternatives J JNO JNAME BUDGET J1 J2 J3 J4 Instrumental Database Dev. CAD/CAM Maintenance 150,000 135,000 250,000 350,000 LOC Montreal New York New York Paris Horizontal Partitioning J1 JNO J1 J2 J2 JNAME Instrumental Database Dev. Vertical Partitioning BUDGET LOC 150,000 135,000 Montreal New York JNO JNAME BUDGET LOC J3 J4 CAD/CAM Maintenance. 150,000 310,000 Montreal Paris JNO J1 J2 J3 J4 JNO J1 J2 J3 J4 BUDGET 150,000 135,000 250,000 310,000 JNAME Instrumentation Database Devl CAD/CAM Maintenance LOC Montreal New York New York Paris Why fragment at all? Reasons: • Interquery concurrency • Intraquery concurrency Disadvantages: • Vertical fragmentation may incur overhead. • Attributes participating in a dependency may be allocated to different sites. Integrity checking is more costly. Degree of Fragmentation • Application views are usually subsets of relations. Hence, it is only natural to consider subsets of relations as distribution units. • The appropriate degree of fragmentation is dependent on the applications. Correctness Rules • Vertical Partitioning • Lossless decomposition • Dependency preservation • Horizontal Partitioning • Disjoint fragments Allocation Alternatives •Partitioning: No replication •Partial Replication: Some fragments are replicated •Full Replication: Database exists in its entirety at each site Notations S Title SAL L1 E ENO ENAME TITLE J JNO JNAME BUDGET LOC L2 G L3 ENO JNO RESP DUR L1: 1-to-many relationship S: Owner(L1), Source relation E: Member(L1), Target relation Simple Predicates Given a relation R(A1, A2, …, An) where Ai has domain Di, a simple predicate pj defined on R has the form pj: Ai Value where {, , , , , } and Value D i Example: J JNO J1 J2 J3 J4 JNAME Instrumental Database Dev. CAD/CAM Maintenance Simple predicates: BUDGET 150,000 135,000 250,000 350,000 LOC Montreal New York New York Orlando p1: JNAME = “Maintenance” P2: BUDGET < 200,000 Note: A simple predicate defines a data fragment MINTERM PREDICATE Given a set of simple predicates for relation R. P = {p1, p2, …, pm} The set of minterm predicates M = {m1, m2, …, mn} is defined as * p M = {mi | mi = p j P j where } TITLE SAL Elect. Eng. 40,000 Syst. Analy. 54,000 Mech. Eng. 32,000 Programmer 42,000 p*j p j or p*j p j Possible simple predicates: P1: TITLE=“Elect. Eng.” P2: TITLE=“Syst. Analy” P3: TITLE=“Mech. Eng.” P4: TITLE=“Programmer” P5: SAL ≤ 35,000 P6: SAL > 35,000 Some corresponding minterm predicates: m1 : TITLE " Elect .Eng." SAL 30,000 m 2 : TITLE " Elect .Eng" SAL 30,000 A minterm predicate defines a data fragment Primary Horizontal Fragmentation A primary horizontal fragmentation is defined by a selection operation on the owner relations of a database schema. E ENO ENAME TITLE J JNO JNAME BUDGET LOC L2 G ENO JNO RESP DUR L3 Owner(L3) = J A possible fragmentation of J is defined as follows: J1 BUDGET200,000 ( J ) J 2 BUDGET200,000 ( J ) Horizontal Fragments Thus, a horizontal fragment Ri of relation R consists of all the tuples of R that satisfy a minterm predicate mi. There are as many horizontal fragments (also called minterm fragments) as there are minterm predicates. Completeness (1) A set of simple predicate Pr is said to be complete if and only if there is an equal probability of access by every application to any two tuples belonging to any minterm fragment that is defined according to Pr. Simple Predicates A1 ≥ k1 Minterm Fragments A3 ≤ k3 A4 = k4 p1 F1 A2 = k2 F2 Applications p1 p3 A1 p3 A2 A3 F3 Complete The fragments look homogeneous A4 Completeness (2) Simple Predicates A1 ≥ k1 Minterm Fragments A3 ≤ k3 A4 = k4 p1 F1 A2 = k2 F2 F3 Applications p1 p3 A1 p3 p4 p5 A2 A3 A4 Set of simple predicates is incomplete Completeness (2) Simple Predicates A1 ≥ k1 Minterm Fragments p1 F1 A2 = k2 F2 A3 ≤ k3 A4 = k4 F31 F3 A5 > k5 Applications p1 p3 A1 p3 p4 p5 A2 A3 A4 F32 Additional simple predicate Now complete ! Completeness (4) A set of simple predicate Pr is said to be complete if and only if there is an equal probability of access by every application to any two tuples belonging to any minterm fragment that is defined according to Pr. J 1 LOC " MONTREAL " ( J ) J 2 LOC " NewYork " ( J ) Case 1: The only application that accesses J wants to access the tuples according to the location. J 3 LOC " Orlando " ( J ) The set of simple predicates LOC=“Montreal” J J1 LOC=“New York” J2 LOC=“Orlando” J3 LOC=“Montreal”, Pr = LOC=“New York”, LOC=“Orlando” is complete because each tuple of each fragment has the same probability of being accessed. Completeness (5) Example: J1 J2 LOC=“Montreal”, Pr = LOC=“New York”, LOC=“Orlando” J3 JNO 001 JNAME Instrumental BUDGET 150,000 LOC Montreal JNO 004 007 JNAME GUI CAD/CAM BUDGET LOC 135,000 New York 250,000 New York JNO 003 JNAME Database Dev. BUDGET LOC 310,000 Orlando Case 2: There is a second application which accesses only those project tuples where the budget is less than $200,000. Since tuple “004” is accessed more frequently than tuple “007”, Pr is not complete. To make the the set complete, we need to add (BUDGET< 200,000) to Pr. Completeness (6) BUDGET<=200,000 LOC=“Montreal” J1 J LOC=“New York” J2 LOC=“Orlando” J3 J11 J12 BUDGET>200,000 BUDGET<=200,000 J21 Small-budget applications J22 BUDGET>200,000 BUDGET<=200,000 J31 J32 BUDGET>200,000 Note: Completeness is a desirable property because a complete set defines fragments that are not only logically uniform in that they all satisfy the minterm predicate, but statistically homogeneous. Redundant Fragmentation Logically uniform & statistically homogeneous fragment Fragment 1 Fragment 2 • Fragments 1 and 2 have the same characteristics • The fragmentation is unnecessary Minimality Relevant: Let mi and mj be two almost identical minterm predicates: mi = p1 Λ p2 Λ p3 fragment fi mj = p1 Λ ¬ p2 Λ p3 fragment fj p2 is relevant if and only if acc (m j ) acc(mi ) card ( f i ) card ( f j ) f p1 f1 p3 f12 p2 fi ¬p2 fj Access frequency Cardinality Prob1 Prob2 A Prob1 ≠ Prob2 Minimality Relevant: Let mi and mj be two almost identical minterm predicates: mi = p1 Λ p2 Λ p3 fragment fi mj = p1 Λ ¬ p2 Λ p3 fragment fj p2 is relevant if and only if acc (m j ) acc(mi ) card ( f i ) card ( f j ) Access frequency Cardinality That is, there should be at least one application that accesses fi and fj differently. i.e., The simple predicate pi should be relevant in determining a fragmentation. Minimal: If all the predicates of a set Pr are relevant, Pr is minimal. A Complete and Minimal Example Two applications: 1. One application accesses the tuples according to location. 2. Another application accesses only those project tuples where the budget is less than $200,000. Case 1: Pr={Loc=“Montreal”, Loc=“New York”, Loc=“Orlando”, BUDGET<=200,000,BUDGET>200,000} is complete and minimal. Case 2: If, however, we were to add the predicate JNAME= “Instrumentation” to Pr, the resulting set would not be minimal since the new predicate is not relevant with respect to the applications. BUDGET<=200,000 LOC=“Montreal” J1 J LOC=“New York” J2 LOC=“Orlando” J3 JNAME = “Instrument” J11 J121 J12 J122 BUDGET>200,000 JNAME! “Instrument” BUDGET<=200,000 J21 [ JNAME = “Instrument” ] is not relevant. J22 BUDGET>200,000 BUDGET<=200,000 J31 J32 BUDGET>200,000 Relevant Irrelevant Application Information • Qualification Information – The fundamental qualification information consists of the predicates used in user queries (i.e., “where” clauses in SQL). – 80/20 rule: 20% of user queries account for 80% of the total data access. One should investigate the more important queries. • Quantitative Information – Minterm Selectivity sel(mi): number of tuples that would be accessed by a query specified according to a given minterm predicate. – Access Freequency acc(qi): the access frequency of queries in a given period. Qualitative information guides the fragmentation activity Quantitative information guides the allocation activity Determine the set of meaningful minterm predicates Applications: • Take the salary and determine a raise accordingly. • The employee records are managed in two places, one handling the records of those with salary less than or equal to $30,000 and the other handling the records of those who earn more than $30,000. Pr={p1: SAL<=30,000, p2: SAL>30,000} is complete and minimal. The minterm predicates: m1 : ( SAL 30,000) ( SAL 30,000) m 2 : ( SAL 30,000) ( SAL 30,000) m3 : ( SAL 30,000) ( SAL 30,000) m 4 : ( SAL 30,000) ( SAL 30,000) Implications: i1 : ( SAL 30,000) ( SAL 30,000) i 2 : ( SAL 30,000) ( SAL 30,000) i 3 : ( SAL 30,000) ( SAL 30,000) i 4 : ( SAL 30,000) ( SAL 30,000) i1 m1 is contradictory i 2 m 4 is contradictory Therefore, we are left with M = {m2, m3} Invalid Implications J JNO J1 J2 J3 J4 Simple predicates p1: LOC = “Montreal” p2: LOC = “New York” p3: LOC = “Orlando” p4: BUDGET ≤ 200,000 p5: BUDGET > 200,000 JNAME Instrumental Database Dev. CAD/CAM Maintenance BUDGET 150,000 135,000 250,000 350,000 VALID Implications i 1 : p 1 p 2 p 3 i 2 : p 2 p 1 p 3 i 3 : p 3 p 1 p 2 i 4 : p 4 p 5 i 5 : p 5 p 4 i 6 : p 4 p 5 i 7 : p 5 p 4 LOC Montreal New York New York Orlando INVALID Implications i 8 : LOC " Montreal" ( BUDGET 200,000) i 9 : LOC " Orlando " ( BUDGET 200,000) Implications should be defined according to the semantics of the database, not according to the current values. Compute Complete & Minimal Set Rule: a relation or fragment is partitioned into at least two parts which are accessed differently by at least one application. Relevant: a simple predicate which satisfies the above rule, is relevant. • Repeat until the predicate set is complete – – – – Find a simple predicate pi that is relevant Determine minterm fragments fi and fj according to pi Accept pi , fi , and fj Remove any pk and fk from acceptance list if pk becomes irrelevant /* the list is minimal */ • Determine the set of minterm predicates M (using the acceptance list) • Determine the set of implications I (among the acceptance list) • For each mi in M, remove mi if it is contradictory according to I Derived Horizontal Fragmentation Derived fragmentation is used to facilitate the join between fragments. In some cases, the horizontal fragmentation of a relation cannot be based on a property of its own attributes, but is derived from the horizontal fragmentation of another relation. Benefits of Derived Fragmentation PAY (TITLE, SAL) Primary Fragmentation: PAY 1 ( TITLE "Assistant Professor")( PAY ) PAY 2 ( TITLE " Associate Professor") ( PAY ) EMP (ENO, ENAME, TITLE) PAY 3 ( TITLE " Full Professor")( PAY ) Using Derived Fragmentation: PAY1 EMP1 EMP2 PAY2 EMP3 PAY3 EMP1 = EMP SJ PAY1 EMP2 = EMP SJ PAY2 EMP3 = EMP SJ PAY3 EMPi and PAYi can be allocated to the same site. Not using derived fragmentation: one can divide EMP into EMP1 and EMP2 based on TITLE and divide PAY into PAY1, PAY2, PAY3 based on SAL. To join EMP and PAY, we have the following scenarios. PAY1 EMP1 EMP2 PAY2 EMP3 PAY3 More communication overhead ! Chain Relationships • R1 (R1PK, …) R2 (R2PK, R1FK, …) R3 (R3PK, R2FK, …) ... • Design the primary fragmenation for R1. Derive the derived fragmentation for Rk as follows: • Rk = Rk SJRKFK=R(k-1)PK R(k-1) • for 2 k n in that order. Derived Fragmentation EMP (ENO, ENAME, TITLE) Join might be required PROJ (PNO, PNAME, BUDGET) EMP_PROJ (ENO, PNO, RESP, DUR) • How do we fragment EMP_PROJ ? – Semi-Join with EMP, or – Semi-Join with PROJ • Criterion: Suport the more-frequent join operation VERTICAL FRAGMENTATION Purpose: Identify fragments Ri such that many applications can be executed using just one fragment. Advantage: When many applications which use R1 and many applications which use R2 are issued at different sites, fragmenting R avoids communication overhead. A7 A1 R2 R1 Site 1 Site 2 Vertical partitioning is more complicated than horizontal partitioning: • Vertical Partitioning: The number of possible fragments is equal to mm where m is the number of nonprimary key attributes • Horizontal Partitioning: 2n possible minterm predicates can be defined, where n is the number of simple predicates in the complete and minimal set Pr. Vertical Fragmentation Approaches Greedy Heuristic Approaches: Split Approach: Global relations are progressively split into fragments. Grouping Approach: Attributes are progressively aggregated to constitute fragments. Correctness: Each attribute of R belongs to at least one fragment. Each fragment includes either a key of R or a “tuple identifier”. Vertical Clustering - Replication In evaluating the convenience of vertical clustering, it is important that overlapping attributes are not heavily updated. Example: EMP(ENUM,NAME,SAL,TAX,MGRNUM,DNUM) Administrative Applications at Site 1 Applications at all sites Bad Fragmentation: NAME not available in EMP2 1. EMP1(ENUM,NAME,TAX,SAL) 2. EMP2(ENUM,MGRNUM,DNUM) Good Fragmentation: 1. EMP1(ENUM, NAME, TAX, SAL) 2. EMP2(ENUM, NAME, MGRNUM, DNUM) NAME is relatively stable Split Approach • Splitting is considered only for attributes that do not participate in the primary key. • The split approach involves three steps: 1. Obtain attribute affinity matrix. 2. Use a clustering algorithm to group some attributes together based on the attribute affinity matrix. This algorithm produces a clustered affinity matrix. 3. Use a partitioning algorithm to partition attributes such that set of attributes are accessed solely or for the most part by distinct set of applications. Attribute Usage Matrix PROJ PNO PNAME BUDGET LOC A1 q1: SELECT BUDGET FROM PROJ WHERE PNO=Value; q2: SELECT PNAME, BUDGET FROM PROJ; q3: SELECT PNAME FROM PROJ WHERE LOC=Value; q4: SELECT SUM(BUDGET) FROM PROJ WHERE Loc=Value A2 A3 use(qi,Aj) = A4 1 if Aj is referenced by qi 0 otherwise A1 A2 A3 A4 q1 1 0 1 q2 0 1 1 q3 0 1 0 q4 0 0 1 0 0 1 1 Attribute Usage Matrix Attribute Affinity Measure aff ( Ai, Aj ) ref (q ) acc (q ) s k , use ( qk , Ai ) 1 use ( qk , A j ) 1 s For each query qk that uses both Ai and Aj Popularity of using Ai and Aj together Relation R k s Popularity of such Ai-Aj pair at all sites Site n Site m Ai qi qk qi Ak Aj qi Site s ref s (qk ) refs(qk) : Number of accesses to attributes (Ai,Aj) for each execution of qk at site s k qk qi accs (qk ) accs (qk) : Application access frequency of qk at site s. Attribute Affinity Matrix aff ( Ai, Aj ) k , use ( qk , Ai ) s use ( qk , A j ) s ref (q ) acc (q ) s For each query qk that uses both Ai and Aj refs (qk): Number of accesses to attributes (Ai,Aj) for each execution of qk at site s accs (qk): Application access frequency of qk at site s. s k s k Popularity of such Ai-Aj pair at all sites A1 A2 A3 A4 A1 A2 A3 aff ( A2, A3) A4 Attribute Affinity Matrix Attribute Affinity Matrix Example A1 A2 A3 A4 q1 q2 q3 q4 1 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 Attribute Usage Matrix A1 A2 A1 A2 A3 A4 A3 A4 45 0 45 0 0 80 5 75 45 5 53 3 0 75 3 78 Attribute Affinity Matrix (AA) Next Step - Determine clustered affinity (CA) matrix Clustered Affinity Matrix Step 1: Initialize CA Copy first 2 columns A1 A2 A1 A2 A3 A4 A3 A4 A1 A1 45 0 A2 0 80 A3 45 5 A4 0 75 45 0 45 0 0 80 5 75 45 5 53 3 0 75 3 78 Attribute Affinity Matrix (AA) A2 A3 A4 Clustered Affinity Matrix (CA) Clustered Affinity Matrix Step 2: Determine Location for A3 3 possible positions for A3 A0 A1 A2 A3 A4 A1 A2 A1 A2 A3 A0 A3 A1 A3 A4 45 0 45 0 0 80 5 75 45 5 53 3 0 75 3 78 A5 Attribute Affinity Matrix (AA) A1 A3 A2 A0 A1 A2 A3 A4 A1 A2 45 0 0 80 45 5 0 75 A3 A4 A5 Clustered Affinity Matrix (CA) Clustered Affinity Matrix Step 2: Determine the order for A3 n bond ( Ax , Ay ) aff ( Az , Ax ) aff ( Az , Ay ) z 1 Contribution cont ( Ai , Ak , A j ) 2 bond ( Ai , Ak ) 2 bond ( Ak , A j ) 2 bond ( Ai , A j ) Cont(A0,A3,A1) = 8820 Cont(A1,A3,A2) = 10150 Cont(A2,A3,A4) = 1780 Since Cont(A1,A3,A2) is the greatest, [A1,A3,A2] is the best order. A1 A2 A1 A2 A3 A4 A3 A4 45 0 45 0 0 80 5 75 45 5 53 3 0 75 3 78 Attribute Affinity Matrix (AA) A1 A1 A2 A3 A4 A3 A2 A4 45 45 0 0 5 80 45 53 5 0 75 3 Clustered Affinity Matrix (CA) Note: aff(A0,Ai)=aff(Ai,A0)=aff(A5,Ai)=aff(Ai,A5)=0 by definition Clustered Affinity Matrix Step 2: Determine the order for A4 Since Cont(A3,A2,A4) is the biggest, [A3,A2,A4] is the best order. A1 A2 A1 A2 A3 A4 A3 A4 45 0 45 0 0 80 5 75 45 5 53 3 0 75 3 78 Attribute Affinity Matrix (AA) A1 A1 A2 A3 A4 A3 A2 A4 45 45 0 0 0 80 75 5 45 53 5 0 75 78 3 3 Clustered Affinity Matrix (CA) Clustered Affinity Matrix Step 3: Re-order the Rows The rows are organized in the same order as the columns. A1 A1 A2 A3 A4 A3 A2 A4 45 45 0 0 0 80 75 5 45 53 5 0 75 78 3 3 Clustered Affinity Matrix (CA) A1 A1 A3 A2 A4 A3 A2 A4 45 45 0 0 45 53 5 3 0 5 80 75 0 3 75 78 Clustered Affinity Matrix (CA) Partitioning Find the sets of attributes that are accessed, for the most part, by distinct sets of applications We look for a good dividing points along the diagnose Bad grouping since A1 and A2 are never accessed together A1 A3 A2 A4 A1 45 45 0 0 A3 45 53 5 3 A2 0 5 80 75 A4 0 3 75 78 Clustered Affinity Matrix (CA) Cluster 1: Cluster 2: A 1 & A3 A 2 & A4 Two vertical fragments: PROJ1(A1, A3) and PROJ2(A2, A4) A4 and A3 are usually not accessed together A4 and A2 are often accessed together MIXED FRAGMENTATION •Apply horizontal fragmentation to vertical fragments. •Apply vertical fragmentation to horizontal fragments. Example: Applications about work at each department reference tuples of employees in the departments located around the site with 80% probability. EMP(ENUM,NAME,SAL,TAX,MGRNUM,DNUM) ENUM NAME TAX SAL ENUM NAME MGRNUM DNUM Jacksonville Orlando Miami Vertical fragmentation Horizontal Fragmentation (local work) i: fragment index j: site index k: application index ALLOCATION – Notations fkj: the frequency of application k at site j rki: the number of retrieval references of application k to fragment i. uki: the number of update references of application k to fragment i. nki = rki + uki Site j Fragment i rki uki Application k /w freq. fkj Allocation of Horizontal Fragments (1) No replication: Best Fit Strategy • The number of local references of Ri at site j is Benefit to Site j Bij f kj nki Number of Access by k k All applications k at Site j Frequency of application k • Ri is allocated at site j* such that Bij* is maximum. Advantage: A fragment is allocated to a site that needs it most. Disadvantage: It disregards the “mutual” effect of placing a fragment at a given site if a related fragment is also at that site. Allocation of Horizontal Fragments (2) All beneficial sites approach (replication) Fragment i Site j Bij f kj rki c f kj 'uki k Savings due to retrieval references j ' j k Cost of update references from other sites • Ri is allocated at all sites j* such that Bij* > 0. • When all Bij’s are negative, a single copy of Ri is placed at the site such that Bij* is maximum. Allocation of Horizontal Fragments (3) Another Replication Approach: di The degree of redundancy of Ri Fi The reliability and availability benefit of having Ri fully replicated. (di) The reliability and availability benefit when the fragment has di copies. (d i ) (1 21 d ) F i i (1) 0, (2) F i , (3) 3 F i , 2 4 The benefit of introducing a new copy of Ri at site j : Bij f kj rki c f kj 'uki (d i ) k k j ' j Same as All Beneficial Sites approach β Fi 1 Also takes into account the benefit of availability di Allocation of Vertical Fragments PSr A1 A3 Ri Rs Application type A1 at site PSr , that accesses only Rs B ist At A4 PSs PSt PS4 k Applications of type As at PSs As As f ks n ks k A2 k At f kt n kt f kt n kt A2 Rt A3 PSr ... k 2 f k Should we allocate fragment Rs to site PSs , and fragment Rt to site PSt ? An As PSn PSs n ki 4 l n k A3 Rs .. . f kl n ki A2 Rt A4 f ks n ks A1 ki A1 An At PS4 PSn Al This formula can be used within an exhaustive “splitting” algorithm by trying all possible combinations of sites s and t. PSt SUMMARY Design of a distributed DB consists of four phases: – Phase 1: Global schema design (same as in centralized DB design) – Phase 2: Fragmentation • Horizontal Fragmentation – Primary: Determent a complete and minimal set of predicates – Derived: Use semijoin • Vertical Fragmentation Identify fragments such that many applications can be executed using just one fragment. – Phase 3: Allocation The primary goal is to minize the number of remote accesses. – Phase 4: Physical schema design (same as in centralized DB design). Database Integration Bottom-up Design Overview • The design process in multidatabase systems is bottomup. – The individual databases actually exists – Designing the global conceptual schema (GCS) involves integrating these local databases into a multidatabase. • Database integration can occur in two steps: Schema Translation and Schema Integration. Database 1 Database 2 Database 3 Translator 1 Translator 2 Translator 3 InS1 Intermediate schema in canonical representation InS2 INTEGRATOR GCS InS3 Network Data Model (Review) • There are two basic data structures in the network model: records and sets. Record type: a group of records of the same type. Set type: indicates a many-to-one relationship in the direction of the arrow. DEPARTMENT (DEPT-NAME, BUDGET, MANAGER) Employs owner record type set type EMPLOYEE (E#, NAME, ADDRESS, TITLE, SALARY) • Implementation of set instances: DEPARTMENT (owner record) Database Jones, L. Patel, J. EMPLOYEE (member records) Vu, K. member record type Example: Three Local Databases Database 1 (Relational Model): S (TITLE, SAL) E (ENO, ENAME, TITLE) J (JNO, JNAME, BUDGET, LOC, CNAME) G (ENO, JNO, RESP, DUR) Database 2 (Network Model): DEPARTMENT (DEPT_NAME, BUDGET, MANAGER) Employs Work Dummy Record Type Worksin EMPLOYEE (E#, NAME, ADDRESS, TITLE, SALARY) Example: Three Local Databases Database 3 (ER Model): Engineer No. Engineer Name ENGINEER Title Project No. Responsibility N WORKS IN 1 Project Name Budget PROJECT N Salary Location CONTRACTED BY Duration Contract Date 1 CLIENT Client Name Address Schema Translation: Relational to ER S (TITLE, SAL) ENO E (ENO, ENAME, TITLE) J (JNO, JNAME, BUDGET, LOC, CNAME) ENAME N E N G (ENO, JNO, RESP, DUR) ENAME E TITLE J BUDGET LOC S RESP N SAL JNAME CNAME 1 TITLE ENO JNO M DUR PAY • E & J have a many-tomany relationship • E & S have a 1-to-many relationship RESP JNO JNAME M DUR SAL Treat salary as an attribute of an engineer entity J BUDGET CNAME LOC Relationships may be identified from the foreign keys defined for each relation. Schema Translation: Network to ER DEPARTMENT WORK EMPLOYEE N Works-in Employs WORK EMPLOYS Dummy record type DEPARTMENT 1 DEPARTMENT N EMPLOYS M M WORKS-IN 1 EMPLOYEE EMPLOYEE • Map each record type in the network schema to an entity and each set type to a relationship. • Network model uses dummy records in its representation of many-to-many relationships that need to be recognized during mapping. Schema Integration Schema integration follows the translation process and generates the GCS by integrating the intermediate schemas. – Identify the components of a database which are related to one another. • Two components can be related as (1) equivalent, (2) one contained in the other one, (3) overlapped, or (4) disjoint. – Select the best representation for the GCS. – Integrate the components of each intermediate schema. Integration Methodologies Integration Process Binary Ladder Balanced N-ary One-shot Iterative Binary: Decreases the potential integration complexity and lead toward automation techniques. One-shot: There is no implied priority for integration order of schemas, and the trade-off can be made among all schemas rather than among a few. Integration Process Schema integration occurs in a sequence of four steps: • Preintegration: establish the “rules” of the integration process before actual integration occurs. • Comparison: naming and structural conflicts are identified. • Conformation: resolve naming and structural conflicts • Merging and restructuring: all schemas must be merged into a single database schema and then restructured to create the “best integrated schema. Schema Integration: Preintegration 1. An integration method (binary or n-ary) must be selected and the schema integration order defined. – The order implicitly defines priorities. 2. Candidate keys in each schema are identified to enable the integrator to determine dependencies implied by the schemas. 3. The mapping or transformation rules should be described before integration begins. – e.g., mapping from degree Celsius in one schema to degrees Fahrenheit in another. Preintegration Example: InS1 Engineer No. Engineer Name ENGINEER Title Salary Project No. Responsibility N WORKS IN 1 Project Name Budget PROJECT N Location CONTRACTED BY Duration Contract Date 1 CLIENT Client Name Address Preintegration Example: InS2 & InS3 Name Dept-name E# Budget Title EMPLOYEE Address Eno N EMPLOYS 1 Manager DEPARTMENT InS2 Salary JNO Resp Ename Jname Budget ENGINEER Title Sal N EMPLOYS Dur M J Cname Loc InS3 Keys & Integration Order KEYS InS1 InS2 InS3 Integration method InS1: Engineer No. in ENGINEER Project No. in PROJECT Client name in CLIENT InS2: E# in EMPLOYEE Dept-name in DEPARTMENT InS3: Eno in E Jno in J Schema Comparison: Naming Conflict (1) Synonyms: two identical entities that have different names. InS1 ENGINEER Engineering No Engineer Name Salary WORKSIN Responsibility Duration PROJECT Project No Project Name Location InS3 E Eno Ename Sal G Resp Dur J Jno Jname Loc Schema Comparison: Naming Conflict (2) Homonyms: Two different entities that have identical names. • In InS1, ENGINEER.Title refers to the title of engineers. • In InS2, EMPLOYEE.Title refers to the title of all employees. domain (EMPLOYEE.Title) >> domain (ENIGNEREER.Title) Schema Comparison – Relation between Schemas • Two schemas can be related in four possible ways: – They can be identical to one another. – One can be a subset of the other. – Some components from one may occur in other while retaining some unique features – They could be completely different with no overlap. • An attribute in one schema may represent the same information as an entity in another one Schema Comparison Example • InS3 is a subset of InS2 E# Name ENGINEER EMPLOYEE Title Address IS-A relationship EMPLOYS Salary DEPARTMENT • Some parts of InS1 (about engineers) and InS3 (about engineers) occur in InS2 (about employees) Schema Comparison – Structural Conflicts (1) • Type conflicts: occur when the same object is represented by an attribute in one schema and by an entity in another schema. – The client of a project is modeled as an entity in InS1, however – the client is included as an attribute of the J entity in InS3 Resp EMPLOYS Dur JNO M N Jname CONTRACTED BY Budget J Cname InS3 Loc Contract Date 1 CLIENT Client Name Address PROJECT InS1 Schema Comparison – Structural Conflicts (2) This is 1-to-many Dependency conflicts: occur when different relationship modes are used to represent the same thing in different schemas. Eno Engineer No. Title ENGINEER Title N ENGINEER Sal N WORKS IN EMPLOYS Dur 1 PROJECT InS1 Salary This is many-to-many Resp Ename Project No. Engineer Name M J InS3 Schema Comparison: Structural Conflicts (3) • Key conflicts: occur when different candidate keys are available and different primary keys are selected in different schemas • Behavioral conflicts: are implied by the modeling mechanism, – e.g., deletion of the last employee causes the dissolution of the department. Conformation: Naming Conflicts Naming conflicts are resolved simply by renaming conflict ones. Homonyms: Synonyms: rename the schema of InS3 to conform to the naming of InS1. InS3 E Eno Engineering No Ename Engineering Name Sal Salary G Resp Dur Responsibility Duration J Jno Project No Jname Project Name Loc Location InS1 ENGINEER Engineering No Engineer Name Salary WORKSIN Responsibility Duration PROJECT Project No Project Name Location • Prefix each attribute by the name of the entity to which it belong, e.g., ENGINEER.Title EMPLOYEE.Title • and prefix each entity by the name of the schema to which it belongs. e.g., InS1.ENGINEER InS2.EMPLOYEE Resolving Structural Conflicts Transforming entities/attributes/relationships among one another Engineer No. InS3 Project No. Responsibility Engineer Name Project Name Budget ENGINEER Title N Salary Engineer No. M PROJECT Salary Project No. Responsibility Engineer Name N Location Client Name Duration ENGINEER Title WORKS IN WORKS IN M Budget PROJECT N C-P Duration Example: Transform the attribute Client name in InS3 to an entity C to make InS3 conform to the presentation of InS1. Project Name M Client Name C Location New InS3 Schema Integration: Merging & Restructuring Merging requires that the information contained in the participating schemas be retained in the integrated schema. Merging using the IS-A relationship InS2 (Employees) InS3 InS1 (Engineers) (Engineers) Use InS3 as the final schema since it is more general in terms of the C-P relationship (i.e., many-to-many) (next page) Integrate InS1 & InS3 Engineer No. InS1 Engineer Name ENGINEER Title Project No. Responsibility N WORKS IN Salary 1 ENGINEER Title Responsibility Engineer Name Salary N WORKS IN M PROJECT N CONTRACTED BY Duration M Client Name C 1 CLIENT Client Name Project Name Address Budget Location InS3 Budget Location CONTRACTED BY Contract Date Engineer No. PROJECT N Duration Project No. Project Name InS3 is more general Merging & Restructuring Example Final Result: ENGINEER Duration Responsibility N WORKS IN M Project No. Project Name Budget PROJECT Location E# Name EMPLOYEE N EMPLOYS 1 InS2 Title Address Manager CLIENT Client name InS1/InS3 Address SAL DEPARTMENT Budget CONTRACTED BY Dept-name Unfortunately, Conformation and restructuring stages are an art rather then a science Query Processing in Multidatabase Systems Query Processing in Three Steps 1. Global query is decomposed into local queries Schema Integration Local Schema 1 Local Schema 2 Local Schema 3 Translator 1 Translator 2 Translator 3 InS1 InS2 Q1,1 Q1,2 InS3 Q1,3 INTEGRATOR Q1 GCS Query Processing in Three Steps 2. Each local query is translated into queries over the corresponding local database system Schema Integration Local Schema 1 Q’1,1 Local Schema 2 Q’1,3 Q’1,2 Translator 1 Translator 2 InS1 Local Schema 3 Translator 3 InS2 Q1,1 Q1,2 InS3 Q1,3 INTEGRATOR Q1 GCS Query Processing in Three Steps 3. Results of the local queries are combined into the answer Schema Integration Local Schema 1 Q’1,1 Translator 2 InS1 Combine Local Schema 3 Q’1,3 Q’1,2 Translator 1 Final answer Local Schema 2 Translator 3 InS2 Q1,1 Q1,2 InS3 Q1,3 INTEGRATOR Q1 GCS Query Processing in Three Steps 1. Global query is decomposed into local queries 2. Each local query is translated into queries over the corresponding local database system 3. Results of the local queries are combined into the answer Schema Integration Local Schema 1 Local Schema 2 Local Schema 3 Translator 1 Translator 2 Translator 3 InS1 InS2 INTEGRATOR GCS InS3 Outline • Overview of major query processing components in multidatabase systems: – Query Decomposition – Query Translation – Global Query Optimization • Techniques for each of the above components Query Decomposition Query Decomposition Overview Global Query Query decomposition & global optimization SQ1 Query translator 1 TQ1 DB1 SQ2 Query translator 2 TQ2 DB2 ... … ... SQn Query translator n TQn DBn PQ1 … PQn SQi export-schema subquery in global query language TQi target query (local subquery) in local query language PQi postprocessing query used to combine results returned by subqueries to form the answer Assumptions • We use the object-oriented data model to present a query decomposition algorithm • To simplify the discussion, we assume that there are only two export schemas: ES1 Emp1: SSN Name Salary Age ES2 Emp2: SSN Name Salary Rank Definitions • type: Given a class C, the type of C denoted by type(C ), is the set of attributes defined for C and their corresponding domains. • world: the world of C, denoted by world(C ), is the set of realworld objects described by C. • extension: the extension of C, denoted by extension(C ), is the set of instances contained in C. World Type Extension A Class Schema Integration • Integration through outerjoin • Integration through outerunion (generalization) Review: Outerjoin The outerjoin of relation R1 and R2 (R1 ⋈o R2 ) is the union of three components: – the join of R1 and R2, – dangling tuples of R1 padded with null values, and – dangling tuples of R2 padded with null values. Outerjoin Example EmpO = Emp1 ⋈o Emp2 Emp1 OID SSN Name Salary Age 3 6789 Smith 90,000 40 4 4321 Chang 62,000 30 5 8642 Patel 75,000 35 Emp2 OID SSN Name Salary Rank 1 2222 Ahad 98,000 S. Mgr. 2 7531 Wang 95,000 S. Mgr. 3 6789 Smith 25,000 Mgr. OID SSN Name Salary Age Rank 1 2222 Ahad 98,000 null S. Mgr. 2 7531 Wang 95,000 mull S. Mgr. 3 6789 Smith Inconsistent 40 Mgr. 4 4321 Chang 62,000 30 null 5 8642 Patel 75,000 35 null Dangling Tuple Dangling Tuple Outerunion Emp1 EmpG = Emp1 Uo Emp2 OID SSN Name Salary Age 3 6789 Smith 90,000 40 4 4321 Chang 62,000 30 5 8642 Patel 75,000 35 Emp2 OID SSN Name Salary Age Rank 1 2222 Ahad 98,000 null S. Mgr. 2 7531 Wang 95,000 mull S. Mgr. 3 6789 Smith Conflict null Mgr. 3 6789 Smith Conflict 40 null OID SSN Name Salary Rank 4 4321 Chang 62,000 30 null 1 2222 Ahad 98,000 S. Mgr. 5 8642 Patel 75,000 35 null 2 7531 Wang 95,000 S. Mgr. 3 6789 Smith 25,000 Mgr. Schema Integration Using Outerjoin Two classes C1 and C2 can be integrated by equi-outerjoining the two classes on the OID to form a new class C. – extension(C ) = extension(C1 ) ⋈o extension(C2 ) – type(C ) = type(C1 ) ⋃ type(C2 ) – world(C ) = world(C1 ) ⋃ world(C2 ) C1 C2 C Schema Integration thru Generalization Two classes C1 and C2 can be integrated by generalizing the two classes to form the superclass C. Generalization type(C ) = type(C1 ) ⋂ type(C2 ) Outer union extension(C ) = ᅲtype(C) [extension(C1 ) ⋃o extension(C2 )] world(C ) = world(C1 ) ⋃ world(C2 ) Generalization Example Emp1: SSN Name Salary Age Emp2: SSN Name Salary Rank EmpG: SSN Name Salary Generalization • Emp1 and Emp2 will also appear in the global schema since not all information in Emp1 and Emp2 is retained in EmpG EmpG SSN Name Salary Emp1 Age More specific Rank Emp2 Inconsistency Resolution • The schema integration techniques work as long as there is no data inconsistency • If data inconsistency occurs, aggregate functions may be used to resolve the problem. Inconsistency Resolution Example Export Schemas Emp1: SSN Name Salary Age Emp2: SSN Name Salary Rank Integrated Schema EmpG: SSN Name Salary Generalization Aggregate Functions - Examples: EmpO: SSN or Name Salary Age Rank Outer join EmpG.Name = Emp1.Name, if EmpG is in world(Emp1) = Emp2.Name, if EmpG is in world(Emp2) – world(Emp1) EmpG.Salary = Emp1.Salary, if EmpG is in world(Emp1) – world(Emp2) = Emp2.Salary, ifEmpG is in world(Emp2) – world(Emp1) = Sum(Emp1.Salary, Emp2.Salary), if EmpG is in world(Emp1) ⋂ world(Emp2) EmpO.Age = Emp1.Age, if EmpO is in world(Emp1) = Null, if EmpO is in world(Emp2) – world(Emp1) EmpO.Rank = Emp2.Rank, if EmpO is in world(Emp2) = Null, if EmpO is in world(Emp1) – world(Emp2) world(Emp2) – world(Emp1) world(Emp1) – world(Emp2) world(Emp1) ⋂ world(Emp2) World (Emp1) World (Emp2) Query Decomposition Step 1: Determine Number of Subqueries Global Query Select From EmpO.Name, EmpO.Rank EmpO Where EmpO.Salary > 80,000 AND EmpO.Age > 35 Assume Outerjoin is used for schema integration Obtain a partition of world(EmpO) based on the aggregate function used to resolve the data inconsistency. Inconsistency Function: Option 1 (based on Salary) part. 1: world(Emp1) – world(Emp2) part. 2: world(Emp2) – world(Emp1) part. 3: world(Emp1) ⋂ world(Emp2) EmpO.Salary = Emp1.Salary, if EmpO is in world(Emp1) – world(Emp2) = Emp2.Salary, if EmpO is in world(Emp2) – world(Emp1) world(Emp1) 1 3 2 world(Emp2) = Sum(Emp1.Salary,Emp2.Salary), if EmpO is in world(Emp1) ⋂ world(Emp2) Query Decomposition Step 1: Determine Number of Subqueries Global Query Select From EmpO.Name, EmpO.Rank EmpO Where EmpO.Salary > 80,000 AND EmpO.Age > 35 Obtain a partition of world(EmpO) based on the aggregate function used to resolve the data inconsistency. Inconsistency Function: EmpO.Age = Emp1.Age, if EmpO is in world(Emp1) = Null, if EmpO is in world(Emp2) – world(Emp1) Option 2 (based on Age) part. 1: world(Emp1) part. 2: world(Emp2) – world(Emp1) world(Emp1) 1 2 world(Emp2) Query Decomposition Step 1: Determine Number of Subqueries Global Query Select From EmpO.Name, EmpO.Rank EmpO Where EmpO.Salary > 80,000 AND EmpO.Age > 35 Obtain a partition of world(EmpO) based on the aggregate function used to resolve the data inconsistency. Option 1 (based on Salary) Option 2 (based on Age) part. 1: world(Emp1) – world(Emp2) part. 2: world(Emp2) – world(Emp1) part. 3: world(Emp1) ⋂ world(Emp2) part. 1: world(Emp1) part. 2: world(Emp2) – world(Emp1) world(Emp1) 1 world(Emp1) 3 2 world(Emp2) 1 2 world(Emp2) We use Option 1 since it is the finest partition among all the partitions. Query Decomposition Another Example Option 1: Option 2: world(Emp1) 1 world(Emp1) 2 1 world(Emp2) world(Emp2) Use finer partition (Option 3): world(Emp1) 1 3 2 2 world(Emp2) Query Decomposition Step 2: Query Decomposition Global Query: Select EmpO.Name, EmpO.Rank From EmpO Where EmpO.Salary > 80,000 AND EmpO.Age > 35 Partition: 1 world(Emp1) 3 2 part. 1: Select Emp1.Name From Emp1 Where Emp1.Salary > 80,000 AND Emp1.Age > 35 AND Emp1.SSN NOT IN (Select Emp2.SSN From Emp2) part. 2: This subquery is discarded because EmpO.Age is Null. world(Emp2) part. 3: Select Emp1.Name, Emp2.Rank Query Decomposition: Obtain From Emp1, Emp2 a query for each subset in Where Sum(Emp1.Salary, the chosen partition. Emp2.Salary) > 80,000 AND Emp1.Age > 35 AND EmpO.Age = Emp1.Age, if EmpO is in world(Emp1) = Null, if EmpO is in world(Emp2) – world(Emp1) Emp1.SSN = Emp2.SSN EmpO.Salary = Emp1.Salary, if EmpG is in world(Emp1) – world(Emp2) = Emp2.Salary, ifEmpG is in world(Emp2) – world(Emp1) = Sum(Emp1.Salary, Emp2.Salary), if EmpG is in world(Emp1) ⋂ world(Emp2) Query Decomposition Step 2: Query Decomposition Global Query: Select EmpO.Name, EmpO.Rank From EmpO Where EmpO.Salary > 80,000 AND EmpO.Age > 35 Query Decomposition: Obtain a query for each subset in the chosen partition. Emp1.Salary Emp1.Age 1 3 Emp1.Salary + Emp2.Salary Emp1.Age world(Emp1) 2 Emp2.Salary Age = null world(Emp2) part. 1: Select Emp1.Name From Emp1 Where Emp1.Salary > 80,000 AND Emp1.Age > 35 AND Emp1.SSN NOT IN (Select Emp2.SSN From Emp2) part. 2: This subquery is discarded because EmpO.Age is Null. part. 3: Select Emp1.Name, Emp2.Rank From Emp1, Emp2 Where Sum(Emp1.Salary, Emp2.Salary) > 80,000 AND Emp1.Age > 35 AND Emp1.SSN = Emp2.SSN Query Decomposition Step 3: Further Decomposition STEP 3: Some resulting query may still reference data from more than one database. They need to be further decomposed into subqueries and possibly also postprocessing queries Before STEP 3: Select Emp1.Name From Emp1 Where Emp1.Salary > 80,000 and Emp1. Age > 35 and Emp1.SSN NOT IN (Select Emp2.SSN From Emp2) Select Emp1.Name From Emp1 Where Emp1.Salary > 80,000 and Emp1. Age > 35 and Emp1.SSN NOT IN X X Insert INTO X Select Emp2.SSN From Emp2) Query Decomposition Step 4: Query Optimization STEP 4: It may be desirable to reduce the number of subqueries by combining subqueries for the same database. Query Translation Query Translation (1) IF THEN Global Query Language ≠ Local Query Language Export Schema Subquery Translator Local Query Language Query Translation (2) IF the source query language has a higher expressive power THEN EITHER – Some source queries cannot be translated; or – they must be translated using both • the syntax of the target query language, and • some facilities of a high-level programming language. Example: A recursive OODB query may not be translated into a relational query using SQL alone. Relation-to-OO Translation OODB Schema: Auto OID Color Manufacturer Company OID Name Profit Headquarter President Equivalent Relational Schema: People OID Name Hometown Automobile Age City OID Name State Foreign key Auto (Auto-OID, Color, Company-OID) Company (Company-OID, Name, Profit, City-OID, People-OID) People (People-OID, Name, Age, City-OID, Auto-OID) City (City-OID, Name, State) Relational-to-OO Example (1) Global Query: Select Auto1.* From 1 2 3 4 5 6 Auto Auto1, Auto Auto2, Company, People, City City1, City City2 Where Auto1.Conmpany-OID = Company.Company-OID AND Company.People-OID = People.People-OID AND People.Age = 52 AND People.Auto-OID = Auto2.Auto-OID AND Auto2.Color = “red” AND People.City-OID = City1.City-OID AND City1.Name = City2.Name AND Company.City-OID = City2.City-OID Relational Predicate Graph: Auto1 1) Company-OID Company 2) People-OID City2 People Age=52 3) Auto-OID City1 Auto2 Color=red Find all red cars own by a 52 year old who is the President 1+2+3of the car manufacturer and lives in the same city of the car manufacturer 4+5+6 Relational-to-OO Example (2) OO Predicate Graph: Auto1 Company-OID Company People-OID City1 Auto1 Company 2) People-OID City2 People Age=52 3) Auto-OID City1 Auto2 Color=red Age=52 Auto-OID Relational Predicate Graph: 1) Company-OID People City2 Auto2 Color=red Relational-to-OO Example (3) OO Predicate Graph: Auto1 Company-OID Company People-OID City1 Predicate 3 People Predicate 1 Age=52 Auto-OID City2 Auto2 Color=red Predicate 2 OO Query: Where Auto.Manufacturer.President.Age = 52 AND Auto.Manufacturer.President.Automobile.Color = red AND Auto.Manufacturer.Headquarter.Name = Auto.Manufacturer.President.Hometown.Name Global Query Optimization Query Optimization (1) CASE 1: A single target query is generated IF the target database system has a query optimizer THEN the query optimizer can be used to optimize the translated query ELSE the translator has to consider the performance issues Query Optimization (2) CASE 2: A set of target queries is needed. • It might pay to have the minimum number of queries – It minimizes the number of invocations of the target system – It may also reduce the cost of combining the partial results • It might pay for a set to contain target queries that can be well coordinated – The results or intermediate results of the queries processed earlier can be used to reduce the cost of processing the remaining queries Global Query Optimization (1) • A query obtained by the query modification process may still reference data from more than one database. Example: part. 3 (i.e., world(Emp1) ⋂ world(Emp2)) on page 126 Select Emp1.Name, Emp2.Rank From Emp1, Emp2 /* access two databases Where sum(Emp1.Salary, Emp2.Salary) > 80,000 AND Emp1.Age > 35 AND Emp1.SSN = Emp2.SSN → Some global strategy is needed to process such queries Global Query Optimization (2) • Select Emp1.Name, Emp2.Rank From Emp1, Emp2 /* access two databases Where sum(Emp1.Salary, Emp2.Salary) > 80,000 AND Emp1.Age > 35 AND Emp1.SSN = Emp2.SSN → Some global strategy is needed to process such queries Site 1 Site 1 Emp1 Emp1 Site 2 Site 1 Emp2 Emp1 form result form result 1+2 form result Emp2 Emp2 Site 3 Site 2 Site 2 Data Inconsistency • If C is integrated from C1 and C2 with no data inconsistency on attribute A, then бA op a (C) = бA op a (C1) ⋃ бA op a (C2) • If A has data inconsistency, then the above equality may no longer hold. Example: Consider the select operation EmpO OID SSN Name Salary Age Rank 1 2222 Ahad 98,000 null S. Mgr. 2 7531 Wang 95,000 mull S. Mgr. 3 6789 Smith Inconsistent 40 Mgr. 4 4321 Chang 62,000 30 null 5 8642 Patel 75,000 35 null бEmpO.Salary > 100,000 (EmpO) The correct answer should have the record for Smith. However, the above query returns an empty set Smith does have a combined salary greater than 100,000 Data Inconsistency - Optimization Express an outerjoin (or a generalization) as outer-unions as follows: C1 ⋈o C2 = C1-O ⋃o C2-O ⋃o (C1-C ⋈OID C2-C) C1-O: Those tuples of C1 that have no matching tuples in C2 (private part) C1-C: Those tuples of C1 that have matching tuples in C2 (overlap part) бA op a (C1 ⋈o C2 ) = бA op a (C1-O) ⋃o бA op a (C2-O) ⋃o бA op a (C1-C ⋈ C2-C) Can we improve this term ? Distribution of Selections (1) бA op a (C1 ⋈o C2 ) = бA op a (C1-O) ⋃o бA op a (C2-O) ⋃o бA op a (C1-C ⋈ C2-C) When can we dustribute б over ⋈ ? Expensive operation Attribute A is defined by an aggregate function (see page 124) Distribution of Selection (2) Four cases were identified when all arguments of the aggregate function (for resolving conflicts) are non-negative 1. f(A1,A2) op a ≡ A1 op a AND A2 op a: Aggregate function бA op a (C1-C ⋈ C2-C) = бA op a (C1-C) ⋈ бA op a ( C2-C) Example: max(Emp1-C.Salary, Emp2-C.Salary) < 30K ≡ Emp1-C.Salary < 30K AND Emp2-C.Salary < 30K An aggregate function 2. f(A1,A2) op a ≡ f(A1 op a, A2 op a) op a: бA op a(C1-C ⋈ C2-C) = бA op a(бA1 op a(C1-C) ⋈ бA2 op a(C2-C)) Example: sum(Emp1-C.Salary, Emp2-C.Salary) < 30K ≡ sum(Emp1-C.Salary < 30K, Emp2-C.Salary < 30K) < 30K Distribution of Selection (3) 3. f(A1,A2) op a ≡ f(A1 op’ a, A2 op’ a) op a: бA op a(C1-C ⋈ C2-C) = бA op a(бA1 op’ a(C1-C) ⋈ бA2 op’ a(C2-C)) Example: sum(Emp1-C.Salary, Emp2-C.Salary) = 30K ≡ sum(Emp1-C.Salary ≤ 30K, Emp2-C.Salary ≤ 30K) = 30K 4. No improvement is possible: Example: sum(Emp1-C.Salary, Emp2-C.Salary) > 30K Distribution Rules for б over ⋈ бA op a(C1-C ⋈ C2-C) f op > ≥ ≤ < = ≠ in Not in sum(A1, A2) 4 4 2 2 3 4 4 4 avg(A1, A2) 4 4 2 2 3 4 4 4 max(A1, A2) 4 4 1 1 3 4 4 4 min(A1, A2) 1 1 4 4 3 4 4 4 No improvement possible Problem in Global Query Optimization (1) Important information about local entity sets that is needed to determine global query processing plans may not be provided by the local database systems. – Example: cardinalities availability of fast access paths – Techniques: • Sampling queries may be designed to collect statistics about the local databases. • A monitoring system can be used to collect the completion time for subqueries. This can be used to better estimate subsequent subqueries. Problems in Global Query Optimization (2) • Different query processing algorithms may have been used in different local database systems. → Cooperation across different systems difficult Examples: Semijoin may not be supported on some local systems. • Data transmission between different local database systems may not be fully supported. Examples: – A local database system may not allow update operations – For many nonrelational systems, the instances of one entity set are more likely to be clustered with the instances of other entity sets. Such clustering makes it very expensive to extract data for one entity set. → Need more sophisticated decomposition algorithms.