Distributed Databases

advertisement
Distributed Database Systems
COP5711
What is a Distributed Database System ?
A distributed database is a collection of databases which are distributed
over different computers of a computer network.
• Each site has autonomous processing capability and can perform local
applications.
• Each site also participates in the execution of at least one global
application which requires accessing data at several sites.
Multiprocessor Database Computers
Cannot run an
application
by itself
Application
(front-end)
computer
Interface
Processor
Access
Processor
Access
Processor
Access
Processor
What we miss here is the existence of local
applications, in the sense that the integration of the
system has reached the point where no one of the
computers (i.e., IFPs & ACPs) is capable of executing
an application by itself.
Why Distributed Databases ?
1.
Local Autonomy: permits setting and enforcing local policies regarding
the use of local data (suitable for organization that are inherently
decentralized).
2. Improved Performance: The regularly used data is proximate to the
users and given the parallelism inherent in distributed systems.
3. Improved Reliability/Availability:


Data replication can be used to obtain higher reliability and
availability.
The autonomous processing capability of the different sites
ensures a graceful degradation property.
4. Incremental Growth: supports a smooth incremental growth with a
minimum degree of impact on the already existing sites.
5. Shareability: allows preexisting sites to share data.
6. Reduced Communication Overhead: The fact that many applications
are local clearly reduces the communication overhead with respect to
centralized databases.
Disadvantages of DDBSs
Cost: replication of effort (manpower).
Security: More difficult to control
Complexity:
•
The possible duplication is mainly due to reliability and
efficiency considerations. Data redundancy, however,
complicates update operations.
•
If some sites fail while an update is being executed, the
system must make sure that the effects will be
reflected on the data residing at the failing sites as
soon as the system can recover from the failure.
•
The synchronization of transactions on multiple sites is
considerably harder than for a centralized system.
Distributed DBMS
Architecture
NetworkTransparancy
• The user should be protected from the
operational details of the network.
• It is desirable to hide even the existence
of the network, if possible.
 Location transparency: The command used is
independent of the system on which the data is
stored.
 Naming transparency: a unique name is
provided for each object in the database.
Replication & Fragmentation
Transparancy
• The user is unaware of the replication of
framents
• Queries are specified on the relations
(rather than the fragments).
Copy 1 of R1
Site A
Copy 1 of R2
Relation R
Fragment R1
Fragment R2
Copy 2 of R1
Site B
Fragment R3
Site C
Fragment R4
Copy 2 of R2
ANSI/SPARC Architecture
External Schema
External
view
External
view
Conceptual Schema
Conceptual
view
Internal Schema
Internal
view
External
view
Internal view: deals with the physical definition and organization of data.
Conceptual view: abstract definition of the database. It is the “real
world” view of the enterprise being modeled in the database.
External view: individual user’s view of the database.
A Taxonomy of Distributed Data Systems
A distributed database
can be defined as
• a logically integrated
collection of shared
data which is
• physically distributed
across the nodes of a
computer network.
Distributed data systems
Homogeneous
Federated
Loosely coupled
(interoperable DB
systems using
export schema)
Heterogeneous
(Multidatabase)
Unfederated
(no local users)
Tightly coupled
(/w global schema)
Architecture of a Homogeneous DDBMS
Global user
view 1
Global user
view n
A homogeneous
DDBMS resembles a
Global Schema
centralized DB, but
Fragmentation
Schema
instead of storing all
the data at one site,
Allocation
Schema
the data is distributed
Local
conceptual
schema 1
Local
conceptual
schema n
Local
internal
schema 1
Local
internal
schema n
Local DB 1
Local DB n
across a number of
sites in a network.
Fragmentation Schema & Allocation Schema
Fragmentation Schema: describes how the global
relations are divided into fragments.
Allocation Schema: specifies at which sites each
fragment is stored.
Example: Fragmentation of global relation R.
A
B
C
D
E
To materialize R, the following
operations are required:
R = (A B) U ( C D) U E
Homogeneous vs. Heterogeneous
• Homogeneous DDBMS
Global
user
Local
user
Multidatabase
Management
system
DBMS
Database 1
– No local users
Local
user
– Most systems do not have
local schemas (i.e., every user
uses the same schema)
• Heterogeneous DDBMS
– There are both local and
global users
DBMS
DBMS
DBMS
Database 2
Database 3
Database 4
– Multidatabase systems are
split into:
• Tightly Coupled Systems:
have a global schema
• Loosely Coupled Systems:
do not have a global
schema.
Schema Architecture of a TightlyCoupled System
Global user
view 1
Global user
view n
Global Conceptual Schema
Auxiliary
Schema 1
Local user
view 1
Local user
view 2
An individual node’s
participation in the MDB
is defined by means of a
participation schema.
Local
Participation
Schema 1
Local
Participation
Schema 1
Auxiliary
Schema 1
Local
Conceptual
Schema 1
Local
Conceptual
Schema 1
Local user
view 1
Local
Internal
Schema 1
Local
Internal
Schema 1
Local user
view 2
Local DB 1
Local DB 1
Auxiliary Schema (1)
Auxiliary schema describes the rules which
govern the mappings between the local and
global levels.
 Rules for unit conversion: may be required when
one site expresses distance in kilometers and
another in miles, …
 Rules for handling null values: may be necessary
where one site stores additional information which
is not stored at another site.
– Example: One site stores the name, home address and
telephone number of its employees, whereas another just
stores names and addresses.
Auxiliary Schema (2)
 Rules for naming conflicts: naming conflicts occur when:
 semantically identical data items are named differently
• DNAME  Department name (at Site 1)
• DEPTNAME  Department name (at Site 2)
 semantically different data items are named identically.
• NAME  Department name (at Site 1)
• NAME  Manager name (at Site 2)
 Rules for handling data representation conflicts: Such
conflicts occur when semantically identical data items
are represented differently in different data source.
 Example: Data represented as a character string in one
database may be represented as a real number in the other
database.
Auxiliary Schema (3)
Rules for handling data scaling conflicts: Such
conflicts occur when semantically identical
data items stored in different databases
using different units of measure.
 Example: “Large”, “New”, “Good”, etc.
These problems are called
domain mismatch problems
Loosely-Coupled Systems
(Interoperable Database Systems)
Local
user view 1
Local
user view 2
Global
user view 1
Global
user view 2
Global
user view 3
Local
Conceptual
schema 1
Local
Conceptual
Schema 2
Local
Conceptual
Schema n
Local
internal
schema 1
Local
internal
Schema 2
Local
internal
Schema n
Local DB 1
Local DB 2
Local DB n
Loosely-Coupled Systems
Global
user view 1
Export
schema 1
Local
user view 1
Local
user view 2
Global
user view 2
Export
schema 2
Export
Schema 3
Global
user view m
Export
Schema n
Local
Conceptual
schema 1
Local
Conceptual
Schema 2
Local
Conceptual
Schema n
Local
internal
schema 1
Local
internal
Schema 2
Local
internal
Schema n
Local DB 1
Local DB 2
Local DB n
Integration of Heterogeneous Data Models
• Provide bidirectional translators between all
pairs of models
– Advantage: support multiple models at the global level.
No need to learn another data model and language
– Disadvantage: requires n(n-1) translators, where n is
the number of different models.
• Adopt a single model (called canonical model) at
the global level and map all the local models onto
this model
– Advantage: requires only 2n translators
– Disadvantage: translations must go through the global
model.
(The 2nd approach is more widely used)
Distributed Database Design
•Top-Down Approach: The database system is
being designed from scratch.
• Issues: fragmentation & allocation
•Bottom-up Approach: Integrating existing
databases into one database
• Issues: Design of the export and global
schemas.
TOP-DOWN DESIGN PROCESS
Requirements Analysis
Entity analysis +
functional
analysis
System Requirements
(Objectives)
Conceptual
design
Global
conceptual
schema
Defining the
interfaces for
end users
View integration
View Design
Access
information
External Schema
Definitions
Distribution Design
Local Conceptual Schemas
Maps the local
conceptual
schemas to
physical storage
devices
Physical Design
Physical Schema
Fragmentation
& allocation
Design Consideration (1)
The organization of distributed systems can be
investigated along three dimensions:
Level of sharing
1. No sharing: Each application and its data
execute at one site.
2. Data sharing: Programs are replicated at all
sites, but data files are not.
3. Data + Program Sharing: Both data and
programs may be shared.
Design Consideration (2)
Access Pattern
1. Static: Access patterns do not change.
2. Dynamic: Access patterns change over
time.
Level of Knowledge
1. No information
2. Partial information: Access patterns may
deviate from the predictions.
3. Complete information: Access patterns
can reasonably be predicted.
Fragmentation Alternatives
J
JNO
JNAME
BUDGET
J1
J2
J3
J4
Instrumental
Database Dev.
CAD/CAM
Maintenance
150,000
135,000
250,000
350,000
LOC
Montreal
New York
New York
Paris
Horizontal Partitioning
J1
JNO
J1
J2
J2
JNAME
Instrumental
Database Dev.
Vertical Partitioning
BUDGET
LOC
150,000
135,000
Montreal
New York
JNO
JNAME
BUDGET
LOC
J3
J4
CAD/CAM
Maintenance.
150,000
310,000
Montreal
Paris
JNO
J1
J2
J3
J4
JNO
J1
J2
J3
J4
BUDGET
150,000
135,000
250,000
310,000
JNAME
Instrumentation
Database Devl
CAD/CAM
Maintenance
LOC
Montreal
New York
New York
Paris
Why fragment at all?
Reasons:
• Interquery concurrency
• Intraquery concurrency
Disadvantages:
• Vertical fragmentation may incur overhead.
• Attributes participating in a dependency may be
allocated to different sites.
 Integrity checking is more costly.
Degree of Fragmentation
• Application views are usually subsets of
relations. Hence, it is only natural to
consider subsets of relations as
distribution units.
• The appropriate degree of fragmentation
is dependent on the applications.
Correctness Rules
• Vertical Partitioning
• Lossless
decomposition
• Dependency
preservation
• Horizontal Partitioning
• Disjoint fragments
Allocation Alternatives
•Partitioning: No replication
•Partial Replication: Some
fragments are replicated
•Full Replication: Database
exists in its entirety at
each site
Notations
S
Title SAL
L1
E
ENO ENAME TITLE
J
JNO JNAME BUDGET LOC
L2
G
L3
ENO JNO RESP DUR
L1: 1-to-many relationship
S: Owner(L1), Source relation
E: Member(L1), Target relation
Simple Predicates
Given a relation R(A1, A2, …, An) where Ai has domain Di, a
simple predicate pj defined on R has the form
pj: Ai  Value
where
  {, , , , , } and Value  D i
Example:
J
JNO
J1
J2
J3
J4
JNAME
Instrumental
Database Dev.
CAD/CAM
Maintenance
Simple predicates:
BUDGET
150,000
135,000
250,000
350,000
LOC
Montreal
New York
New York
Orlando
p1: JNAME = “Maintenance”
P2: BUDGET < 200,000
Note: A simple predicate defines a data fragment
MINTERM PREDICATE
Given a set of simple predicates for relation R.
P = {p1, p2, …, pm}
The set of minterm predicates
M = {m1, m2, …, mn}
is defined as
*

p
M = {mi | mi = p j P j
where
}
TITLE
SAL
Elect. Eng.
40,000
Syst. Analy.
54,000
Mech. Eng.
32,000
Programmer
42,000
p*j  p j or p*j  p j
Possible simple predicates:
P1: TITLE=“Elect. Eng.”
P2: TITLE=“Syst. Analy”
P3: TITLE=“Mech. Eng.”
P4: TITLE=“Programmer”
P5: SAL ≤ 35,000
P6: SAL > 35,000
Some corresponding
minterm predicates:
m1 : TITLE  " Elect .Eng."  SAL  30,000
m 2 : TITLE  " Elect .Eng"  SAL  30,000
A minterm predicate defines
a data fragment
Primary Horizontal Fragmentation
A primary horizontal fragmentation is defined by a selection
operation on the owner relations of a database schema.
E
ENO ENAME TITLE
J
JNO JNAME BUDGET LOC
L2
G
ENO JNO RESP DUR
L3
Owner(L3) = J
A possible fragmentation of J is defined as follows:
J1   BUDGET200,000 ( J )
J 2   BUDGET200,000 ( J )
Horizontal Fragments
Thus, a horizontal fragment Ri of relation R
consists of all the tuples of R that satisfy a
minterm predicate mi.
There are as many horizontal fragments
(also called minterm fragments) as there are
minterm predicates.
Completeness (1)
A set of simple predicate Pr is said to be complete if and only
if there is an equal probability of access by every application
to any two tuples belonging to any minterm fragment that is
defined according to Pr.
Simple Predicates
A1 ≥ k1
Minterm Fragments
A3 ≤ k3
A4 = k4
p1
F1
A2 = k2
F2
Applications
p1
p3
A1
p3
A2
A3
F3
Complete  The fragments look homogeneous
A4
Completeness (2)
Simple Predicates
A1 ≥ k1
Minterm Fragments
A3 ≤ k3
A4 = k4
p1
F1
A2 = k2
F2
F3
Applications
p1
p3
A1
p3
p4
p5
A2
A3
A4
Set of simple
predicates is
incomplete
Completeness (2)
Simple Predicates
A1 ≥ k1
Minterm Fragments
p1
F1
A2 = k2
F2
A3 ≤ k3
A4 = k4
F31
F3
A5 > k5
Applications
p1
p3
A1
p3
p4
p5
A2
A3
A4
F32
Additional
simple
predicate
Now complete !
Completeness (4)
A set of simple predicate Pr is said to be complete if and only
if there is an equal probability of access by every application
to any two tuples belonging to any minterm fragment that is
defined according to Pr.
J 1   LOC " MONTREAL " ( J )
J 2   LOC " NewYork " ( J )
Case 1: The only application that accesses
J wants to access the tuples according to
the location.
J 3   LOC " Orlando " ( J )
The set of simple predicates
LOC=“Montreal”
J
J1
LOC=“New York”
J2
LOC=“Orlando”
J3
LOC=“Montreal”,
Pr = LOC=“New York”,
LOC=“Orlando”
is complete because each tuple of each
fragment has the same probability of
being accessed.
Completeness (5)
Example:
J1
J2
LOC=“Montreal”,
Pr = LOC=“New York”,
LOC=“Orlando”
J3
JNO
001
JNAME
Instrumental
BUDGET
150,000
LOC
Montreal
JNO
004
007
JNAME
GUI
CAD/CAM
BUDGET
LOC
135,000 New York
250,000 New York
JNO
003
JNAME
Database Dev.
BUDGET LOC
310,000 Orlando
Case 2: There is a second application which accesses only those
project tuples where the budget is less than $200,000.
 Since tuple “004” is accessed more frequently than tuple
“007”, Pr is not complete.
 To make the the set complete, we need to add
(BUDGET< 200,000) to Pr.
Completeness (6)
BUDGET<=200,000
LOC=“Montreal”
J1
J
LOC=“New York”
J2
LOC=“Orlando”
J3
J11
J12
BUDGET>200,000
BUDGET<=200,000
J21
Small-budget applications
J22
BUDGET>200,000
BUDGET<=200,000
J31
J32
BUDGET>200,000
Note: Completeness is a
desirable property because a
complete set defines
fragments that are not only
logically uniform in that they
all satisfy the minterm
predicate, but statistically
homogeneous.
Redundant Fragmentation
Logically
uniform &
statistically
homogeneous
fragment
Fragment 1
Fragment 2
• Fragments 1 and 2 have the same
characteristics
• The fragmentation is unnecessary
Minimality
Relevant:
Let mi and mj be two almost identical minterm predicates:
mi =
p1 Λ p2 Λ p3
fragment fi
mj =
p1 Λ ¬ p2 Λ p3
fragment fj
p2 is relevant if and only if
acc (m j )
acc(mi )

card ( f i ) card ( f j )
f
p1
f1
p3
f12
p2
fi
¬p2
fj
Access frequency
Cardinality
Prob1
Prob2
A
Prob1 ≠ Prob2
Minimality
Relevant:
Let mi and mj be two almost identical minterm predicates:
mi =
p1 Λ p2 Λ p3
fragment fi
mj =
p1 Λ ¬ p2 Λ p3
fragment fj
p2 is relevant if and only if
acc (m j )
acc(mi )

card ( f i ) card ( f j )
Access frequency
Cardinality
That is, there should be at least one application that accesses fi and fj
differently.
i.e., The simple predicate pi should be relevant in determining a
fragmentation.
Minimal:
If all the predicates of a set Pr are relevant, Pr is minimal.
A Complete and Minimal Example
Two applications:
1. One application accesses the tuples according
to location.
2. Another application accesses only those project
tuples where the budget is less than $200,000.
Case 1: Pr={Loc=“Montreal”, Loc=“New York”, Loc=“Orlando”,
BUDGET<=200,000,BUDGET>200,000} is
complete and minimal.
Case 2: If, however, we were to add the predicate
JNAME= “Instrumentation” to Pr, the resulting
set would not be minimal since the new predicate
is not relevant with respect to the applications.
BUDGET<=200,000
LOC=“Montreal”
J1
J
LOC=“New York”
J2
LOC=“Orlando”
J3
JNAME = “Instrument”
J11
J121
J12
J122
BUDGET>200,000
JNAME!  “Instrument”
BUDGET<=200,000
J21
[ JNAME = “Instrument” ]
is not relevant.
J22
BUDGET>200,000
BUDGET<=200,000
J31
J32
BUDGET>200,000
Relevant
Irrelevant
Application Information
• Qualification Information
– The fundamental qualification information
consists of the predicates used in user
queries (i.e., “where” clauses in SQL).
– 80/20 rule: 20% of user queries account
for 80% of the total data access.
 One should investigate the more
important queries.
• Quantitative Information
– Minterm Selectivity sel(mi): number of
tuples that would be accessed by a query
specified according to a given minterm
predicate.
– Access Freequency acc(qi): the access
frequency of queries in a given period.
Qualitative
information
guides the
fragmentation
activity
Quantitative
information
guides the
allocation
activity
Determine the set of meaningful minterm predicates
Applications:
• Take the salary and determine a raise accordingly.
• The employee records are managed in two places, one handling the
records of those with salary less than or equal to $30,000 and the other
handling the records of those who earn more than $30,000.
Pr={p1: SAL<=30,000, p2: SAL>30,000} is complete and minimal.
The minterm predicates:
m1 : ( SAL  30,000)  ( SAL  30,000)
m 2 : ( SAL  30,000)  ( SAL  30,000)
m3 : ( SAL  30,000)  ( SAL  30,000)
m 4 : ( SAL  30,000)  ( SAL  30,000)
Implications:
i1 : ( SAL  30,000)  ( SAL  30,000)
i 2 : ( SAL  30,000)  ( SAL  30,000)
i 3 : ( SAL  30,000)  ( SAL  30,000)
i 4 : ( SAL  30,000)  ( SAL  30,000)
i1  m1 is contradictory
i 2  m 4 is contradictory
Therefore, we are left with
M = {m2, m3}
Invalid Implications
J
JNO
J1
J2
J3
J4
Simple predicates
p1: LOC = “Montreal”
p2: LOC = “New York”
p3: LOC = “Orlando”
p4: BUDGET ≤ 200,000
p5: BUDGET > 200,000
JNAME
Instrumental
Database Dev.
CAD/CAM
Maintenance
BUDGET
150,000
135,000
250,000
350,000
VALID Implications
i 1 : p 1   p 2  p 3
i 2 : p 2  p 1  p 3
i 3 : p 3  p 1  p 2
i 4 : p 4  p 5
i 5 : p 5  p 4
i 6 : p 4  p 5
i 7 : p 5  p 4
LOC
Montreal
New York
New York
Orlando
INVALID Implications
i 8 : LOC " Montreal"  ( BUDGET  200,000)
i 9 : LOC " Orlando "  ( BUDGET  200,000)
Implications should be
defined according to the
semantics of the database,
not according to the
current values.
Compute Complete & Minimal Set
Rule: a relation or fragment is partitioned into at least two parts which are
accessed differently by at least one application.
Relevant: a simple predicate which satisfies the above rule, is relevant.
• Repeat until the predicate set is complete
–
–
–
–
Find a simple predicate pi that is relevant
Determine minterm fragments fi and fj according to pi
Accept pi , fi , and fj
Remove any pk and fk from acceptance list if pk becomes
irrelevant /* the list is minimal */
• Determine the set of minterm predicates M (using
the acceptance list)
• Determine the set of implications I (among the
acceptance list)
• For each mi in M, remove mi if it is contradictory
according to I
Derived Horizontal Fragmentation
Derived fragmentation is used to facilitate the
join between fragments.
In some cases, the horizontal fragmentation of a
relation cannot be based on a property of its own
attributes, but is derived from the horizontal
fragmentation of another relation.
Benefits of Derived Fragmentation
PAY (TITLE, SAL)
Primary Fragmentation:
PAY 1   ( TITLE  "Assistant Professor")( PAY )
PAY 2   ( TITLE  " Associate Professor") ( PAY )
EMP (ENO, ENAME, TITLE)
PAY 3   ( TITLE  " Full Professor")( PAY )
Using Derived Fragmentation:
PAY1
EMP1
EMP2
PAY2
EMP3
PAY3
EMP1 = EMP SJ PAY1
EMP2 = EMP SJ PAY2
EMP3 = EMP SJ PAY3
EMPi and PAYi can be allocated
to the same site.
Not using derived fragmentation: one can divide EMP into EMP1
and EMP2 based on TITLE and divide PAY into PAY1, PAY2, PAY3
based on SAL. To join EMP and PAY, we have the following
scenarios.
PAY1
EMP1
EMP2
PAY2
EMP3
PAY3
More communication
overhead !
Chain Relationships
•
R1 (R1PK, …)
R2 (R2PK, R1FK, …)
R3 (R3PK, R2FK, …)
...
•
Design the primary
fragmenation for R1.
Derive the derived
fragmentation for Rk as
follows:
• Rk = Rk SJRKFK=R(k-1)PK R(k-1)
• for 2  k  n in that order.
Derived Fragmentation
EMP (ENO, ENAME, TITLE)
Join might
be required
PROJ (PNO, PNAME, BUDGET)
EMP_PROJ (ENO, PNO, RESP, DUR)
• How do we fragment EMP_PROJ ?
– Semi-Join with EMP, or
– Semi-Join with PROJ
• Criterion: Suport the more-frequent join
operation
VERTICAL FRAGMENTATION
Purpose: Identify fragments Ri such that
many applications can be executed using
just one fragment.
Advantage: When many applications which
use R1 and many applications which use R2
are issued at different sites, fragmenting
R avoids communication overhead.
A7
A1
R2
R1
Site 1
Site 2
Vertical partitioning is more complicated than horizontal
partitioning:
• Vertical Partitioning: The number of possible fragments is
equal to mm where m is the number of nonprimary key
attributes
• Horizontal Partitioning: 2n possible minterm predicates can
be defined, where n is the number of simple predicates in the
complete and minimal set Pr.
Vertical Fragmentation Approaches
Greedy Heuristic Approaches:
Split Approach: Global relations are
progressively split into fragments.
Grouping Approach: Attributes are
progressively aggregated to constitute
fragments.
Correctness:
Each attribute of R belongs to at least one
fragment.
Each fragment includes either a key of R or a
“tuple identifier”.
Vertical Clustering - Replication
In evaluating the convenience of vertical
clustering, it is important that overlapping
attributes are not heavily updated.
Example:
EMP(ENUM,NAME,SAL,TAX,MGRNUM,DNUM)
Administrative Applications
at Site 1
Applications
at all sites
Bad Fragmentation: NAME not available in EMP2
1.
EMP1(ENUM,NAME,TAX,SAL)
2. EMP2(ENUM,MGRNUM,DNUM)
Good Fragmentation:
1.
EMP1(ENUM, NAME, TAX, SAL)
2. EMP2(ENUM, NAME, MGRNUM, DNUM)
NAME is
relatively
stable
Split Approach
•
Splitting is considered only for attributes that do
not participate in the primary key.
•
The split approach involves three steps:
1.
Obtain attribute affinity matrix.
2. Use a clustering algorithm to group some attributes
together based on the attribute affinity matrix. This
algorithm produces a clustered affinity matrix.
3. Use a partitioning algorithm to partition attributes
such that set of attributes are accessed solely or for
the most part by distinct set of applications.
Attribute Usage Matrix
PROJ
PNO PNAME BUDGET LOC
A1
q1: SELECT BUDGET
FROM PROJ
WHERE PNO=Value;
q2: SELECT PNAME, BUDGET
FROM PROJ;
q3: SELECT PNAME
FROM PROJ
WHERE LOC=Value;
q4: SELECT SUM(BUDGET)
FROM PROJ
WHERE Loc=Value
A2
A3
use(qi,Aj) =
A4
1 if Aj is referenced by qi
0 otherwise
A1 A2 A3 A4
q1 1 0 1
q2 0 1 1
q3 0 1 0
q4 0 0 1
0
0
1
1
Attribute Usage Matrix
Attribute Affinity Measure

aff ( Ai, Aj ) 
 ref (q )  acc (q )
s
k , use ( qk , Ai ) 1  use ( qk , A j ) 1 s
For each query qk that uses both Ai and Aj
Popularity
of using
Ai and Aj
together
Relation R
k
s
Popularity of such Ai-Aj pair at
all sites
Site n
Site m
Ai
qi
qk
qi
Ak
Aj
qi
Site s
ref s (qk )
refs(qk) : Number of accesses to
attributes (Ai,Aj) for each
execution of qk at site s
k
qk
qi
accs (qk )
accs (qk) : Application access
frequency of qk at site s.
Attribute Affinity Matrix
aff ( Ai, Aj ) 

k , use ( qk , Ai )  s  use ( qk , A j )  s
 ref (q )  acc (q )
s
For each query qk that uses both Ai and Aj
refs (qk): Number of accesses
to attributes (Ai,Aj)
for each execution
of qk at site s
accs (qk): Application access
frequency of qk at
site s.
s
k
s
k
Popularity of such Ai-Aj pair at
all sites
A1 A2 A3 A4
A1
A2
A3
aff ( A2, A3)
A4
Attribute Affinity Matrix
Attribute Affinity Matrix Example
A1 A2 A3 A4
q1
q2
q3
q4
1 0 1 0
0 1 1 0
0 1 0 1
0 0 1 1
Attribute Usage Matrix
A1 A2
A1
A2
A3
A4
A3
A4
45
0
45
0
0
80
5
75
45
5
53
3
0
75
3
78
Attribute Affinity Matrix (AA)
Next Step - Determine clustered affinity (CA) matrix
Clustered Affinity Matrix
Step 1: Initialize CA
Copy first 2 columns
A1 A2
A1
A2
A3
A4
A3
A4
A1
A1 45 0
A2 0 80
A3 45 5
A4 0 75
45
0
45
0
0
80
5
75
45
5
53
3
0
75
3
78
Attribute Affinity Matrix (AA)
A2
A3 A4
Clustered Affinity Matrix (CA)
Clustered Affinity Matrix
Step 2: Determine Location for A3
3 possible
positions
for A3
A0
A1
A2
A3
A4
A1 A2
A1 A2 A3
A0 A3 A1
A3
A4
45
0
45
0
0
80
5
75
45
5
53
3
0
75
3
78
A5
Attribute Affinity Matrix (AA)
A1 A3 A2
A0
A1
A2
A3
A4
A1
A2
45
0
0
80
45
5
0
75
A3 A4
A5
Clustered Affinity Matrix (CA)
Clustered Affinity Matrix
Step 2: Determine the order for A3
n
bond ( Ax , Ay )   aff ( Az , Ax )  aff ( Az , Ay )
z 1
Contribution
cont ( Ai , Ak , A j )  2  bond ( Ai , Ak )  2  bond ( Ak , A j )  2  bond ( Ai , A j )
Cont(A0,A3,A1) = 8820
Cont(A1,A3,A2) = 10150
Cont(A2,A3,A4) = 1780
Since Cont(A1,A3,A2) is the greatest, [A1,A3,A2] is the best order.
A1 A2
A1
A2
A3
A4
A3
A4
45
0
45
0
0
80
5
75
45
5
53
3
0
75
3
78
Attribute Affinity Matrix (AA)
A1
A1
A2
A3
A4
A3
A2 A4
45 45
0
0
5
80
45 53
5
0
75
3
Clustered Affinity Matrix (CA)
Note: aff(A0,Ai)=aff(Ai,A0)=aff(A5,Ai)=aff(Ai,A5)=0 by definition
Clustered Affinity Matrix
Step 2: Determine the order for A4
Since Cont(A3,A2,A4) is the biggest, [A3,A2,A4] is the best order.
A1 A2
A1
A2
A3
A4
A3
A4
45
0
45
0
0
80
5
75
45
5
53
3
0
75
3
78
Attribute Affinity Matrix (AA)
A1
A1
A2
A3
A4
A3
A2
A4
45 45
0
0
0
80 75
5
45 53
5
0
75 78
3
3
Clustered Affinity Matrix (CA)
Clustered Affinity Matrix
Step 3: Re-order the Rows
The rows are organized in the same order as the columns.
A1
A1
A2
A3
A4
A3
A2
A4
45 45
0
0
0
80 75
5
45 53
5
0
75 78
3
3
Clustered Affinity Matrix (CA)
A1
A1
A3
A2
A4
A3
A2
A4
45 45
0
0
45 53
5
3
0
5
80 75
0
3
75 78
Clustered Affinity Matrix (CA)
Partitioning
Find the sets of attributes
that are accessed, for the
most part, by distinct sets
of applications
We look for a good dividing
points along the diagnose
Bad grouping since
A1 and A2 are never
accessed together
A1
A3
A2
A4
A1 45 45 0 0
A3 45 53 5 3
A2 0 5 80 75
A4 0 3 75 78
Clustered Affinity Matrix (CA)
Cluster 1:
Cluster 2:
A 1 & A3
A 2 & A4
Two vertical fragments:
PROJ1(A1, A3) and PROJ2(A2, A4)
A4 and A3
are
usually
not
accessed
together
A4 and A2
are often
accessed
together
MIXED FRAGMENTATION
•Apply horizontal fragmentation to vertical fragments.
•Apply vertical fragmentation to horizontal fragments.
Example: Applications about work at each department reference tuples
of employees in the departments located around the site with 80%
probability.
EMP(ENUM,NAME,SAL,TAX,MGRNUM,DNUM)
ENUM NAME TAX SAL
ENUM
NAME
MGRNUM
DNUM
Jacksonville
Orlando
Miami
Vertical fragmentation
Horizontal
Fragmentation
(local work)
i:
fragment index
j:
site index
k:
application index
ALLOCATION –
Notations
fkj:
the frequency of
application k at site j
rki: the number of retrieval
references of application k
to fragment i.
uki: the number of update
references of application k
to fragment i.
nki = rki + uki
Site j
Fragment i
rki
uki
Application k
/w freq. fkj
Allocation of Horizontal Fragments (1)
No replication: Best Fit Strategy
• The number of local references of Ri at site j is
Benefit to
Site j
Bij   f kj nki
Number of
Access by k
k
All applications k
at Site j
Frequency of
application k
• Ri is allocated at site j* such that Bij* is maximum.
Advantage: A fragment is allocated to a site that needs it most.
Disadvantage: It disregards the “mutual” effect of placing a
fragment at a given site if a related fragment is also at that
site.
Allocation of Horizontal Fragments (2)
All beneficial sites approach (replication)
Fragment i
Site j
Bij   f kj rki  c  f kj 'uki
k
Savings due to
retrieval
references
j ' j k
Cost of update
references from
other sites
• Ri is allocated at all sites j* such that Bij* > 0.
• When all Bij’s are negative, a single copy of Ri is
placed at the site such that Bij* is maximum.
Allocation of Horizontal Fragments (3)
Another Replication Approach:
di
The degree of redundancy of Ri
Fi
The reliability and availability benefit of having Ri fully replicated.
(di)
The reliability and availability benefit when the fragment has di
copies.
 (d i )  (1  21 d )  F i
i
 (1)  0,  (2)  F i ,  (3)  3  F i ,   
2
4
The benefit of introducing a new copy of Ri at site j :
Bij   f kj rki  c   f kj 'uki   (d i )
k
k
j ' j
Same as All Beneficial
Sites approach
β
Fi
1
Also takes into
account the benefit
of availability
di
Allocation of Vertical Fragments
PSr
A1
A3
Ri
Rs
Application type A1
at site PSr , that
accesses only Rs
B ist 
At
A4
PSs
PSt
PS4

k
Applications
of type As
at PSs
As
As

f ks n ks 

k
A2

k
At
f kt n kt 
f kt n kt 
A2
Rt
A3
PSr
...

k
 2 f
k
Should we allocate fragment Rs
to site PSs , and fragment Rt to
site PSt ?
An
As
PSn
PSs
n ki 
 
4  l  n k
A3
Rs
..
.
f kl n ki
A2
Rt
A4
f ks n ks
A1
ki
A1
An
At
PS4
PSn
Al
This formula can be used within an exhaustive “splitting”
algorithm by trying all possible combinations of sites s and t.
PSt
SUMMARY
Design of a distributed DB consists of four phases:
– Phase 1: Global schema design (same as in centralized DB
design)
– Phase 2: Fragmentation
• Horizontal Fragmentation
– Primary: Determent a complete and minimal set of predicates
– Derived: Use semijoin
• Vertical Fragmentation
Identify fragments such that many applications can be executed
using just one fragment.
– Phase 3: Allocation
The primary goal is to minize the number of remote accesses.
– Phase 4: Physical schema design (same as in centralized DB
design).
Database Integration
Bottom-up Design
Overview
• The design process in
multidatabase systems is
bottomup.
– The individual databases
actually exists
– Designing the global
conceptual schema (GCS)
involves integrating these
local databases into a
multidatabase.
• Database integration can
occur in two steps:
Schema Translation and
Schema Integration.
Database 1
Database 2
Database 3
Translator 1
Translator 2
Translator 3
InS1
Intermediate
schema in
canonical
representation
InS2
INTEGRATOR
GCS
InS3
Network Data Model (Review)
• There are two basic data structures in the network
model: records and sets.
Record type: a group of records of the same type.
Set type: indicates a many-to-one relationship in the direction of the arrow.
DEPARTMENT (DEPT-NAME, BUDGET, MANAGER)
Employs
owner record type
set type
EMPLOYEE (E#, NAME, ADDRESS, TITLE, SALARY)
• Implementation of set instances:
DEPARTMENT (owner record)
Database
Jones, L.
Patel, J.
EMPLOYEE
(member records)
Vu, K.
member record type
Example: Three Local Databases
Database 1 (Relational Model):
S (TITLE, SAL)
E (ENO, ENAME, TITLE)
J (JNO, JNAME, BUDGET, LOC, CNAME)
G (ENO, JNO, RESP, DUR)
Database 2 (Network Model):
DEPARTMENT (DEPT_NAME, BUDGET, MANAGER)
Employs
Work
Dummy
Record Type
Worksin
EMPLOYEE (E#, NAME, ADDRESS, TITLE, SALARY)
Example: Three Local Databases
Database 3 (ER Model):
Engineer
No.
Engineer
Name
ENGINEER
Title
Project
No.
Responsibility
N
WORKS
IN
1
Project
Name
Budget
PROJECT
N
Salary
Location
CONTRACTED
BY
Duration
Contract
Date
1
CLIENT
Client
Name
Address
Schema Translation: Relational to ER
S (TITLE, SAL)
ENO
E (ENO, ENAME, TITLE)
J (JNO, JNAME, BUDGET, LOC, CNAME)
ENAME
N
E
N
G (ENO, JNO, RESP, DUR)
ENAME
E
TITLE
J
BUDGET
LOC
S
RESP
N
SAL
JNAME
CNAME
1
TITLE
ENO
JNO
M
DUR
PAY
• E & J have a many-tomany relationship
• E & S have a 1-to-many
relationship
RESP
JNO
JNAME
M
DUR
SAL
Treat salary as
an attribute of
an engineer
entity
J
BUDGET
CNAME
LOC
Relationships may be identified from
the foreign keys defined for each
relation.
Schema Translation: Network to ER
DEPARTMENT
WORK
EMPLOYEE
N
Works-in
Employs
WORK
EMPLOYS
Dummy
record type
DEPARTMENT
1
DEPARTMENT
N
EMPLOYS
M
M
WORKS-IN
1
EMPLOYEE
EMPLOYEE
• Map each record type in the network schema to an entity
and each set type to a relationship.
• Network model uses dummy records in its representation of
many-to-many relationships that need to be recognized
during mapping.
Schema Integration
Schema integration follows the translation
process and generates the GCS by
integrating the intermediate schemas.
– Identify the components of a database which
are related to one another.
• Two components can be related as (1) equivalent, (2)
one contained in the other one, (3) overlapped, or (4)
disjoint.
– Select the best representation for the GCS.
– Integrate the components of each
intermediate schema.
Integration Methodologies
Integration
Process
Binary
Ladder
Balanced
N-ary
One-shot
Iterative
Binary: Decreases the
potential integration
complexity and lead toward
automation techniques.
One-shot: There is no
implied priority for
integration order of
schemas, and the trade-off
can be made among all
schemas rather than among
a few.
Integration Process
Schema integration occurs in a sequence of four
steps:
• Preintegration: establish the “rules” of the integration
process before actual integration occurs.
• Comparison: naming and structural conflicts are identified.
• Conformation: resolve naming and structural conflicts
• Merging and restructuring: all schemas must be merged into
a single database schema and then restructured to create
the “best integrated schema.
Schema Integration: Preintegration
1. An integration method (binary or n-ary) must be
selected and the schema integration order defined.
– The order implicitly defines priorities.
2. Candidate keys in each schema are identified to
enable the integrator to determine dependencies
implied by the schemas.
3. The mapping or transformation rules should be
described before integration begins.
– e.g., mapping from degree Celsius in one schema to
degrees Fahrenheit in another.
Preintegration Example: InS1
Engineer
No.
Engineer
Name
ENGINEER
Title
Salary
Project
No.
Responsibility
N
WORKS
IN
1
Project
Name
Budget
PROJECT
N
Location
CONTRACTED
BY
Duration
Contract
Date
1
CLIENT
Client
Name
Address
Preintegration Example: InS2 & InS3
Name
Dept-name
E#
Budget
Title
EMPLOYEE
Address
Eno
N
EMPLOYS
1
Manager
DEPARTMENT
InS2
Salary
JNO
Resp
Ename
Jname
Budget
ENGINEER
Title
Sal
N
EMPLOYS
Dur
M
J
Cname
Loc
InS3
Keys & Integration Order
KEYS
InS1
InS2
InS3
Integration method
InS1:
Engineer No. in ENGINEER
Project No. in PROJECT
Client name in CLIENT
InS2:
E# in EMPLOYEE
Dept-name in DEPARTMENT
InS3:
Eno in E
Jno in J
Schema Comparison:
Naming Conflict (1)
Synonyms: two identical entities that have
different names.
InS1
ENGINEER
Engineering No
Engineer Name
Salary
WORKSIN
Responsibility
Duration
PROJECT
Project No
Project Name
Location
InS3
E
Eno
Ename
Sal
G
Resp
Dur
J
Jno
Jname
Loc
Schema Comparison:
Naming Conflict (2)
Homonyms: Two different entities that have
identical names.
• In InS1, ENGINEER.Title refers to the title of
engineers.
• In InS2, EMPLOYEE.Title refers to the title of all
employees.
domain (EMPLOYEE.Title) >> domain (ENIGNEREER.Title)
Schema Comparison – Relation
between Schemas
• Two schemas can be related in four
possible ways:
– They can be identical to one another.
– One can be a subset of the other.
– Some components from one may occur in other
while retaining some unique features
– They could be completely different with no
overlap.
• An attribute in one schema may represent
the same information as an entity in
another one
Schema Comparison Example
• InS3 is a subset of InS2
E#
Name
ENGINEER
EMPLOYEE
Title
Address
IS-A relationship
EMPLOYS
Salary
DEPARTMENT
• Some parts of InS1 (about engineers) and InS3
(about engineers) occur in InS2 (about employees)
Schema Comparison – Structural
Conflicts (1)
• Type conflicts: occur when the same object is
represented by an attribute in one schema and by an
entity in another schema.
– The client of a project is modeled as an entity in InS1,
however
– the client is included as an attribute of the J entity in InS3
Resp
EMPLOYS
Dur
JNO
M
N
Jname
CONTRACTED
BY
Budget
J
Cname
InS3
Loc
Contract
Date
1
CLIENT
Client
Name
Address
PROJECT
InS1
Schema Comparison – Structural
Conflicts (2)
This is
1-to-many
Dependency conflicts:
occur when different
relationship modes are
used to represent the
same thing in different
schemas.
Eno
Engineer
No.
Title
ENGINEER
Title
N
ENGINEER
Sal
N
WORKS
IN
EMPLOYS
Dur
1
PROJECT
InS1
Salary
This is
many-to-many
Resp
Ename
Project
No.
Engineer
Name
M
J
InS3
Schema Comparison: Structural
Conflicts (3)
• Key conflicts: occur when different candidate keys
are available and different primary keys are
selected in different schemas
• Behavioral conflicts: are implied by the modeling
mechanism,
– e.g., deletion of the last employee causes the dissolution
of the department.
Conformation: Naming Conflicts
Naming conflicts are resolved simply by renaming
conflict ones.
Homonyms:
Synonyms: rename the schema of InS3
to conform to the naming of InS1.
InS3
E
Eno
 Engineering No
Ename  Engineering Name
Sal
 Salary
G
Resp
Dur
 Responsibility
 Duration
J
Jno
 Project No
Jname  Project Name
Loc
 Location
InS1
ENGINEER
Engineering No
Engineer Name
Salary
WORKSIN
Responsibility
Duration
PROJECT
Project No
Project Name
Location
• Prefix each attribute
by the name of the
entity to which it
belong,
e.g., ENGINEER.Title
EMPLOYEE.Title
• and prefix each entity
by the name of the
schema to which it
belongs.
e.g., InS1.ENGINEER
InS2.EMPLOYEE
Resolving Structural Conflicts
Transforming entities/attributes/relationships among one another
Engineer
No.
InS3
Project
No.
Responsibility
Engineer
Name
Project
Name
Budget
ENGINEER
Title
N
Salary
Engineer
No.
M
PROJECT
Salary
Project
No.
Responsibility
Engineer
Name
N
Location
Client
Name
Duration
ENGINEER
Title
WORKS
IN
WORKS
IN
M
Budget
PROJECT
N
C-P
Duration
Example:
Transform the attribute Client name in
InS3 to an entity C to make InS3
conform to the presentation of InS1.
Project
Name
M
Client
Name
C
Location
New
InS3
Schema Integration:
Merging & Restructuring
Merging requires that the information contained in the
participating schemas be retained in the integrated
schema.
Merging using the IS-A
relationship
InS2
(Employees)
InS3
InS1
(Engineers) (Engineers)
Use InS3 as the final schema
since it is more general in
terms of the C-P relationship
(i.e., many-to-many)
(next page)
Integrate InS1 & InS3
Engineer
No.
InS1
Engineer
Name
ENGINEER
Title
Project
No.
Responsibility
N
WORKS
IN
Salary
1
ENGINEER
Title
Responsibility
Engineer
Name
Salary
N
WORKS
IN
M
PROJECT
N
CONTRACTED
BY
Duration
M
Client
Name
C
1
CLIENT
Client
Name
Project
Name
Address
Budget
Location
InS3
Budget
Location
CONTRACTED
BY
Contract
Date
Engineer
No.
PROJECT
N
Duration
Project
No.
Project
Name
InS3 is
more
general
Merging & Restructuring Example
Final Result:
ENGINEER
Duration
Responsibility
N
WORKS
IN
M
Project
No.
Project
Name
Budget
PROJECT
Location
E#
Name
EMPLOYEE
N
EMPLOYS
1
InS2
Title
Address
Manager
CLIENT
Client
name
InS1/InS3
Address
SAL
DEPARTMENT
Budget
CONTRACTED
BY
Dept-name
Unfortunately, Conformation and
restructuring stages are an art
rather then a science
Query Processing in
Multidatabase Systems
Query Processing in Three Steps
1. Global query is
decomposed into local
queries
Schema Integration
Local Schema 1
Local Schema 2
Local Schema 3
Translator 1
Translator 2
Translator 3
InS1
InS2
Q1,1
Q1,2
InS3
Q1,3
INTEGRATOR
Q1
GCS
Query Processing in Three Steps
2. Each local query is
translated into
queries over the
corresponding local
database system
Schema Integration
Local Schema 1
Q’1,1
Local Schema 2
Q’1,3
Q’1,2
Translator 1
Translator 2
InS1
Local Schema 3
Translator 3
InS2
Q1,1
Q1,2
InS3
Q1,3
INTEGRATOR
Q1
GCS
Query Processing in Three Steps
3. Results of the local
queries are combined
into the answer
Schema Integration
Local Schema 1
Q’1,1
Translator 2
InS1
Combine
Local Schema 3
Q’1,3
Q’1,2
Translator 1
Final
answer
Local Schema 2
Translator 3
InS2
Q1,1
Q1,2
InS3
Q1,3
INTEGRATOR
Q1
GCS
Query Processing in Three Steps
1.
Global query is
decomposed into local
queries
2. Each local query is
translated into
queries over the
corresponding local
database system
3. Results of the local
queries are combined
into the answer
Schema Integration
Local Schema 1
Local Schema 2
Local Schema 3
Translator 1
Translator 2
Translator 3
InS1
InS2
INTEGRATOR
GCS
InS3
Outline
• Overview of major query processing
components in multidatabase systems:
– Query Decomposition
– Query Translation
– Global Query Optimization
• Techniques for each of the above
components
Query Decomposition
Query Decomposition
Overview
Global Query
Query decomposition &
global optimization
SQ1
Query
translator 1
TQ1
DB1
SQ2
Query
translator 2
TQ2
DB2
...
…
...
SQn
Query
translator n
TQn
DBn
PQ1
…
PQn
SQi export-schema subquery
in global query language
TQi target query (local
subquery) in local query
language
PQi postprocessing query
used to combine results
returned by subqueries
to form the answer
Assumptions
• We use the object-oriented data model to
present a query decomposition algorithm
• To simplify the discussion, we assume that
there are only two export schemas:
ES1
Emp1: SSN
Name
Salary
Age
ES2
Emp2: SSN
Name
Salary
Rank
Definitions
• type: Given a class C, the type
of C denoted by type(C ), is the
set of attributes defined for C
and their corresponding
domains.
• world: the world of C, denoted
by world(C ), is the set of realworld objects described by C.
• extension: the extension of C,
denoted by extension(C ), is the
set of instances contained in C.
World
Type
Extension
A Class
Schema Integration
• Integration through outerjoin
• Integration through outerunion
(generalization)
Review: Outerjoin
The outerjoin of relation R1 and R2
(R1 ⋈o R2 ) is the union of three
components:
– the join of R1 and R2,
– dangling tuples of R1 padded with null
values, and
– dangling tuples of R2 padded with null
values.
Outerjoin Example
EmpO = Emp1 ⋈o Emp2
Emp1
OID
SSN
Name
Salary
Age
3
6789
Smith
90,000
40
4
4321
Chang
62,000
30
5
8642
Patel
75,000
35
Emp2
OID
SSN
Name
Salary
Rank
1
2222
Ahad
98,000
S. Mgr.
2
7531
Wang
95,000
S. Mgr.
3
6789
Smith
25,000
Mgr.
OID
SSN
Name
Salary
Age
Rank
1
2222
Ahad
98,000
null
S. Mgr.
2
7531
Wang
95,000
mull
S. Mgr.
3
6789
Smith
Inconsistent
40
Mgr.
4
4321
Chang
62,000
30
null
5
8642
Patel
75,000
35
null
Dangling Tuple
Dangling Tuple
Outerunion
Emp1
EmpG = Emp1 Uo Emp2
OID
SSN
Name
Salary
Age
3
6789
Smith
90,000
40
4
4321
Chang
62,000
30
5
8642
Patel
75,000
35
Emp2
OID
SSN
Name
Salary
Age
Rank
1
2222
Ahad
98,000
null
S. Mgr.
2
7531
Wang
95,000
mull
S. Mgr.
3
6789
Smith
Conflict
null
Mgr.
3
6789
Smith
Conflict
40
null
OID
SSN
Name
Salary
Rank
4
4321
Chang
62,000
30
null
1
2222
Ahad
98,000
S. Mgr.
5
8642
Patel
75,000
35
null
2
7531
Wang
95,000
S. Mgr.
3
6789
Smith
25,000
Mgr.
Schema Integration Using Outerjoin
Two classes C1 and C2 can be integrated
by equi-outerjoining the two classes on
the OID to form a new class C.
– extension(C ) = extension(C1 ) ⋈o extension(C2 )
– type(C ) = type(C1 ) ⋃ type(C2 )
– world(C ) = world(C1 ) ⋃ world(C2 )
C1
C2
C
Schema Integration thru Generalization
Two classes C1 and C2 can be integrated by
generalizing the two classes to form the
superclass C.
Generalization
 type(C ) = type(C1 ) ⋂ type(C2 )
Outer
union
 extension(C ) = ᅲtype(C) [extension(C1 ) ⋃o extension(C2 )]
 world(C ) = world(C1 ) ⋃ world(C2 )
Generalization Example
Emp1: SSN
Name
Salary
Age
Emp2: SSN
Name
Salary
Rank
EmpG: SSN
Name
Salary
Generalization
• Emp1 and Emp2 will also appear in the
global schema since not all information in
Emp1 and Emp2 is retained in EmpG
EmpG SSN
Name
Salary
Emp1 Age
More
specific
Rank Emp2
Inconsistency Resolution
• The schema integration techniques
work as long as there is no data
inconsistency
• If data inconsistency occurs,
aggregate functions may be used to
resolve the problem.
Inconsistency Resolution Example
Export Schemas
Emp1: SSN
Name
Salary
Age
Emp2: SSN
Name
Salary
Rank
Integrated Schema
EmpG: SSN
Name
Salary
Generalization
Aggregate Functions - Examples:
EmpO: SSN
or
Name
Salary
Age
Rank
Outer
join
EmpG.Name = Emp1.Name, if EmpG is in world(Emp1)
= Emp2.Name, if EmpG is in world(Emp2) – world(Emp1)
EmpG.Salary = Emp1.Salary, if EmpG is in world(Emp1) – world(Emp2)
= Emp2.Salary, ifEmpG is in world(Emp2) – world(Emp1)
= Sum(Emp1.Salary, Emp2.Salary), if EmpG is in world(Emp1) ⋂ world(Emp2)
EmpO.Age = Emp1.Age, if EmpO is in world(Emp1)
= Null, if EmpO is in world(Emp2) – world(Emp1)
EmpO.Rank = Emp2.Rank, if EmpO is in world(Emp2)
= Null, if EmpO is in world(Emp1) – world(Emp2)
world(Emp2) –
world(Emp1)
world(Emp1) –
world(Emp2)
world(Emp1)
⋂
world(Emp2)
World (Emp1)
World (Emp2)
Query Decomposition
Step 1: Determine Number of Subqueries
Global
Query
Select
From
EmpO.Name, EmpO.Rank
EmpO
Where
EmpO.Salary > 80,000 AND
EmpO.Age > 35
Assume
Outerjoin is
used for
schema
integration
Obtain a partition of world(EmpO) based on the aggregate
function used to resolve the data inconsistency.
Inconsistency Function:
Option 1 (based on Salary)
part. 1: world(Emp1) – world(Emp2)
part. 2: world(Emp2) – world(Emp1)
part. 3: world(Emp1) ⋂ world(Emp2)
EmpO.Salary = Emp1.Salary, if
EmpO is in world(Emp1) – world(Emp2)
= Emp2.Salary, if
EmpO is in world(Emp2) – world(Emp1)
world(Emp1)
1
3
2
world(Emp2)
= Sum(Emp1.Salary,Emp2.Salary), if
EmpO is in world(Emp1) ⋂ world(Emp2)
Query Decomposition
Step 1: Determine Number of Subqueries
Global
Query
Select
From
EmpO.Name, EmpO.Rank
EmpO
Where
EmpO.Salary > 80,000 AND
EmpO.Age > 35
Obtain a partition of world(EmpO) based on the aggregate
function used to resolve the data inconsistency.
Inconsistency Function:
EmpO.Age
= Emp1.Age, if EmpO is in world(Emp1)
= Null, if EmpO is in world(Emp2) – world(Emp1)
Option 2 (based on Age)
part. 1: world(Emp1)
part. 2: world(Emp2) – world(Emp1)
world(Emp1)
1
2
world(Emp2)
Query Decomposition
Step 1: Determine Number of Subqueries
Global
Query
Select
From
EmpO.Name, EmpO.Rank
EmpO
Where
EmpO.Salary > 80,000 AND
EmpO.Age > 35
Obtain a partition of world(EmpO) based on the aggregate
function used to resolve the data inconsistency.
Option 1 (based on Salary)
Option 2 (based on Age)
part. 1: world(Emp1) – world(Emp2)
part. 2: world(Emp2) – world(Emp1)
part. 3: world(Emp1) ⋂ world(Emp2)
part. 1: world(Emp1)
part. 2: world(Emp2) –
world(Emp1)
world(Emp1)
1
world(Emp1)
3
2
world(Emp2)
1
2
world(Emp2)
We use Option 1 since it is the finest partition among all the partitions.
Query Decomposition
Another Example
Option 1:
Option 2:
world(Emp1)
1
world(Emp1)
2
1
world(Emp2)
world(Emp2)
Use finer partition (Option 3):
world(Emp1)
1
3
2
2
world(Emp2)
Query Decomposition
Step 2: Query Decomposition
Global Query:
Select EmpO.Name, EmpO.Rank
From EmpO
Where EmpO.Salary > 80,000 AND
EmpO.Age > 35
Partition:
1
world(Emp1)
3
2
part. 1: Select Emp1.Name
From Emp1
Where Emp1.Salary > 80,000 AND
Emp1.Age > 35 AND
Emp1.SSN NOT IN
(Select Emp2.SSN
From Emp2)
part. 2: This subquery is discarded because
EmpO.Age is Null.
world(Emp2)
part. 3: Select Emp1.Name, Emp2.Rank
Query Decomposition: Obtain
From Emp1, Emp2
a query for each subset in
Where Sum(Emp1.Salary,
the chosen partition.
Emp2.Salary) > 80,000 AND
Emp1.Age > 35 AND
EmpO.Age = Emp1.Age, if EmpO is in world(Emp1)
= Null, if EmpO is in world(Emp2) – world(Emp1)
Emp1.SSN = Emp2.SSN
EmpO.Salary = Emp1.Salary, if EmpG is in world(Emp1) – world(Emp2)
= Emp2.Salary, ifEmpG is in world(Emp2) – world(Emp1)
= Sum(Emp1.Salary, Emp2.Salary), if EmpG is in world(Emp1) ⋂ world(Emp2)
Query Decomposition
Step 2: Query Decomposition
Global Query:
Select EmpO.Name, EmpO.Rank
From EmpO
Where EmpO.Salary > 80,000 AND
EmpO.Age > 35
Query Decomposition: Obtain
a query for each subset in
the chosen partition.
Emp1.Salary
Emp1.Age
1
3
Emp1.Salary +
Emp2.Salary
Emp1.Age
world(Emp1)
2
Emp2.Salary
Age = null
world(Emp2)
part. 1: Select Emp1.Name
From Emp1
Where Emp1.Salary > 80,000 AND
Emp1.Age > 35 AND
Emp1.SSN NOT IN
(Select Emp2.SSN
From Emp2)
part. 2: This subquery is discarded because
EmpO.Age is Null.
part. 3: Select Emp1.Name, Emp2.Rank
From Emp1, Emp2
Where Sum(Emp1.Salary,
Emp2.Salary) > 80,000 AND
Emp1.Age > 35 AND
Emp1.SSN = Emp2.SSN
Query Decomposition
Step 3: Further Decomposition
STEP 3: Some resulting query may still reference
data from more than one database. They need to be
further decomposed into subqueries and possibly also
postprocessing queries
Before STEP 3:
Select Emp1.Name
From Emp1
Where Emp1.Salary > 80,000 and
Emp1. Age > 35 and
Emp1.SSN NOT IN
(Select Emp2.SSN
From Emp2)
Select Emp1.Name
From Emp1
Where Emp1.Salary > 80,000 and
Emp1. Age > 35 and
Emp1.SSN NOT IN X
X
Insert INTO X
Select Emp2.SSN
From Emp2)
Query Decomposition
Step 4: Query Optimization
STEP 4: It may be desirable to reduce
the number of subqueries by
combining subqueries for the same
database.
Query Translation
Query Translation (1)
IF
THEN
Global Query Language ≠
Local Query Language
Export
Schema
Subquery
Translator
Local
Query
Language
Query Translation (2)
IF the source query language has a higher
expressive power THEN EITHER
– Some source queries cannot be translated; or
– they must be translated using both
• the syntax of the target query language, and
• some facilities of a high-level programming language.
Example: A recursive OODB query may not be
translated into a relational query using SQL
alone.
Relation-to-OO Translation
OODB Schema:
Auto
OID
Color
Manufacturer
Company
OID
Name
Profit
Headquarter
President
Equivalent Relational Schema:
People
OID
Name
Hometown
Automobile
Age
City
OID
Name
State
Foreign
key
Auto (Auto-OID, Color, Company-OID)
Company (Company-OID, Name, Profit, City-OID, People-OID)
People (People-OID, Name, Age, City-OID, Auto-OID)
City (City-OID, Name, State)
Relational-to-OO Example (1)
Global Query:
Select Auto1.*
From
1
2
3
4
5
6
Auto Auto1, Auto Auto2,
Company, People,
City City1, City City2
Where Auto1.Conmpany-OID =
Company.Company-OID AND
Company.People-OID =
People.People-OID AND
People.Age = 52 AND
People.Auto-OID =
Auto2.Auto-OID AND
Auto2.Color = “red” AND
People.City-OID =
City1.City-OID AND
City1.Name = City2.Name AND
Company.City-OID =
City2.City-OID
Relational Predicate Graph:
Auto1
1) Company-OID
Company
2) People-OID
City2
People
Age=52
3) Auto-OID
City1
Auto2
Color=red
Find all red cars own by a 52 year old
who is the President 1+2+3of the car
manufacturer and lives in the same city
of the car manufacturer 4+5+6
Relational-to-OO Example (2)
OO Predicate Graph:
Auto1
Company-OID
Company
People-OID
City1
Auto1
Company
2) People-OID
City2
People
Age=52
3) Auto-OID
City1
Auto2
Color=red
Age=52
Auto-OID
Relational Predicate Graph:
1) Company-OID
People
City2
Auto2
Color=red
Relational-to-OO Example (3)
OO Predicate Graph:
Auto1
Company-OID
Company
People-OID
City1
Predicate 3
People
Predicate 1
Age=52
Auto-OID
City2
Auto2
Color=red
Predicate 2
OO Query:
Where Auto.Manufacturer.President.Age = 52 AND
Auto.Manufacturer.President.Automobile.Color = red AND
Auto.Manufacturer.Headquarter.Name =
Auto.Manufacturer.President.Hometown.Name
Global Query Optimization
Query Optimization (1)
CASE 1: A single target query is generated
IF the target database system has a query
optimizer
THEN the query optimizer can be used
to optimize the translated query
ELSE the translator has to consider the
performance issues
Query Optimization (2)
CASE 2: A set of target queries is needed.
• It might pay to have the minimum number of
queries
– It minimizes the number of invocations of the target
system
– It may also reduce the cost of combining the partial
results
• It might pay for a set to contain target queries
that can be well coordinated
– The results or intermediate results of the queries
processed earlier can be used to reduce the cost of
processing the remaining queries
Global Query Optimization (1)
• A query obtained by the query modification
process may still reference data from
more than one database.
Example: part. 3 (i.e., world(Emp1) ⋂ world(Emp2))
on page 126
Select Emp1.Name, Emp2.Rank
From Emp1, Emp2
/* access two databases
Where sum(Emp1.Salary, Emp2.Salary) > 80,000 AND
Emp1.Age > 35 AND
Emp1.SSN = Emp2.SSN
→ Some global strategy is needed to process such queries
Global Query Optimization (2)
• Select Emp1.Name, Emp2.Rank
From Emp1, Emp2
/* access two databases
Where sum(Emp1.Salary, Emp2.Salary) > 80,000 AND
Emp1.Age > 35 AND
Emp1.SSN = Emp2.SSN
→ Some global strategy is needed to process such queries
Site 1
Site 1
Emp1
Emp1
Site 2
Site 1
Emp2
Emp1
form
result
form
result
1+2
form
result
Emp2
Emp2
Site 3
Site 2
Site 2
Data Inconsistency
• If C is integrated from C1 and C2 with no
data inconsistency on attribute A, then
бA op a (C) = бA op a (C1) ⋃ бA op a (C2)
• If A has data inconsistency, then the above
equality may no longer hold.
Example: Consider the select operation
EmpO
OID
SSN
Name
Salary
Age
Rank
1
2222
Ahad
98,000
null
S. Mgr.
2
7531
Wang
95,000
mull
S. Mgr.
3
6789
Smith
Inconsistent
40
Mgr.
4
4321
Chang
62,000
30
null
5
8642
Patel
75,000
35
null
бEmpO.Salary > 100,000 (EmpO)
The correct answer should
have the record for Smith.
However, the above query
returns an empty set
Smith does have a combined salary greater than 100,000
Data Inconsistency - Optimization
Express an outerjoin (or a generalization) as
outer-unions as follows:
C1 ⋈o C2 = C1-O ⋃o C2-O ⋃o (C1-C ⋈OID C2-C)
C1-O: Those tuples of C1 that have no matching tuples
in C2 (private part)
C1-C: Those tuples of C1 that have matching tuples in
C2 (overlap part)
бA op a (C1 ⋈o C2 ) = бA op a (C1-O) ⋃o бA op a (C2-O)
⋃o бA op a (C1-C ⋈ C2-C)
Can we improve this term ?
Distribution of Selections (1)
бA op a (C1 ⋈o C2 ) = бA op a (C1-O) ⋃o бA op a (C2-O)
⋃o бA op a (C1-C ⋈ C2-C)
When can we dustribute
б over ⋈ ?
Expensive operation
Attribute A is defined by
an aggregate function
(see page 124)
Distribution of Selection (2)
Four cases were identified when all arguments of the aggregate
function (for resolving conflicts) are non-negative
1. f(A1,A2) op a ≡ A1 op a AND A2 op a:
Aggregate
function
бA op a (C1-C ⋈ C2-C) = бA op a (C1-C) ⋈ бA op a ( C2-C)
Example: max(Emp1-C.Salary, Emp2-C.Salary) < 30K
≡ Emp1-C.Salary < 30K AND
Emp2-C.Salary < 30K
An aggregate
function
2. f(A1,A2) op a ≡ f(A1 op a, A2 op a) op a:
бA op a(C1-C ⋈ C2-C) = бA op a(бA1 op a(C1-C) ⋈ бA2 op a(C2-C))
Example: sum(Emp1-C.Salary, Emp2-C.Salary) < 30K
≡ sum(Emp1-C.Salary < 30K,
Emp2-C.Salary < 30K) < 30K
Distribution of Selection (3)
3. f(A1,A2) op a ≡ f(A1 op’ a, A2 op’ a) op a:
бA op a(C1-C ⋈ C2-C) = бA op a(бA1 op’ a(C1-C) ⋈
бA2 op’ a(C2-C))
Example: sum(Emp1-C.Salary, Emp2-C.Salary) = 30K
≡ sum(Emp1-C.Salary ≤ 30K,
Emp2-C.Salary ≤ 30K) = 30K
4. No improvement is possible:
Example: sum(Emp1-C.Salary, Emp2-C.Salary) > 30K
Distribution Rules for б over ⋈
бA op a(C1-C ⋈ C2-C)
f
op
> ≥ ≤ < = ≠ in Not in
sum(A1, A2)
4 4 2 2 3 4 4
4
avg(A1, A2)
4 4 2 2 3 4 4
4
max(A1, A2)
4 4 1 1 3 4 4
4
min(A1, A2)
1 1 4 4 3 4 4
4
No
improvement
possible
Problem in Global Query
Optimization (1)
Important information about local entity sets that is
needed to determine global query processing plans
may not be provided by the local database systems.
– Example: cardinalities
availability of fast access paths
– Techniques:
• Sampling queries may be designed to collect statistics
about the local databases.
• A monitoring system can be used to collect the
completion time for subqueries. This can be used to
better estimate subsequent subqueries.
Problems in Global Query
Optimization (2)
• Different query processing algorithms may have been
used in different local database systems.
→ Cooperation across different systems difficult
Examples: Semijoin may not be supported on some
local systems.
• Data transmission between different local database
systems may not be fully supported.
Examples:
– A local database system may not allow update operations
– For many nonrelational systems, the instances of one entity
set are more likely to be clustered with the instances of
other entity sets. Such clustering makes it very expensive to
extract data for one entity set.
→ Need more sophisticated decomposition algorithms.
Download