2-Background.pptx

advertisement
Outline
•
•
Introduction
Background
➡ Relational database systems
•
•
•
•
•
•
•
•
•
•
•
•
➡ Computer networks
Distributed Database Design
Database Integration
Semantic Data Control
Distributed Query Processing
Multidatabase Query Processing
Distributed Transaction Management
Data Replication
Parallel Database Systems
Distributed Object DBMS
Peer-to-Peer Data Management
Web Data Management
Current Issues
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/1
Relational Model
•
Relation
➡ A relation R with attributes A = {A1, A2, …, An} defined over n domains D = {D1, D2,
..., Dn} (not necessarily distinct) with values {Dom1, Dom2, ..., Domn} is a finite, time
varying set of n-tuples d1, d2, ..., dn such that d1 Dom1, d2  Dom2, ..., dn  Domn, and
A1  D1, A2  D2, ..., An  Dn.
➡ Notation: R(A1, A2, …, An) or R(A1: D1, A2: D2, …, An: Dn)
➡ Alternatively, given R as defined above, an instance of it at a given time is a set of n-
tuples:
{A1: d1, A2: d2, …, An: dn | d1  Dom1, d2  Dom2, ..., dn  Domn}
• Tabular structure of data where
➡ R is the table heading
➡ Attributes are table columns
➡ Each tuple is a row
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/2
Relation Schemes and Instances
•
Relational scheme
➡ A relation scheme is the definition; i.e., a set of attributes
➡ A relational database scheme is a set of relation schemes:
•
✦
i.e., a set of sets of attributes
Relation instance (simply relation)
➡ An relation is an instance of a relation scheme
➡ a relation r over a relation scheme R = {A1, ..., An} is a subset of the Cartesian
product of the domains of all attributes, i.e.,
r  Dom1 × Dom2 × … × Domn
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/3
Domains
•
A domain is a type in the programming language sense
➡ Name: String
•
➡ Salary: Real
Domain values is a set of acceptable values for a variable of a given
type.
➡ Name: CdnNames = {…},
➡ Salary: ProfSalary = {45,000 - 150,000}
➡ Simple/Composite domains
•
✦
Address = Street name+street number+city+province+ postal code
Domain compatibility
➡ Binary operations (e.g., comparison to one another, addition, etc.) can be performed
•
on them.
Full support for domains is not provided in many current relational
DBMSs
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/4
Relation Schemes
EMP
ENO ENAME TITLE SAL PNO RESP DUR
PROJ
PNO PNAME
BUDGET
EMP(ENO, ENAME, TITLE, SAL, PNO, RESP, DUR)
PROJ (PNO, PNAME, BUDGET)
•
•
Underlined attributes are relation keys (tuple identifiers).
Tabular form
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/5
Example Relation Instances
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/6
Repetition Anomaly
• The NAME,TITLE, SAL attribute values are repeated for each
project that the employee is involved in.
➡ Waste of space
➡ Complicates updates
EMP
ENO
Distributed DBMS
ENAME
TITLE
SAL
E1
E2
E2
E3
E3
E4
E5
J. Doe
M. Smith
M. Smith
A. Lee
A. Lee
J. Miller
B. Casey
Elect. Eng.
Analyst
Analyst
Mech. Eng.
Mech. Eng.
Programmer
Syst. Anal.
40000
34000
34000
27000
27000
24000
34000
E6
L. Chu
Elect. Eng.
40000
E7
E8
R. Davis
J. Jones
Mech. Eng.
Syst. Anal.
27000
34000
PNO
RESP
DUR
P1
P1
P2
P3
P4
P2
P2
P4
Manager
Analyst
Analyst
Consultant
Engineer
Programmer
Manager
Manager
12
24
6
10
48
18
24
48
P3
Engineer
36
P3
Manager
40
© M. T. Özsu & P. Valduriez
Ch.2/7
Update Anomaly
•
If any attribute of project (say SAL of an employee) is updated, multiple
tuples have to be updated to reflect the change.
EMP
ENO
Distributed DBMS
ENAME
TITLE
SAL
E1
E2
E2
E3
E3
E4
E5
J. Doe
M. Smith
M. Smith
A. Lee
A. Lee
J. Miller
B. Casey
Elect. Eng.
Analyst
Analyst
Mech. Eng.
Mech. Eng.
Programmer
Syst. Anal.
40000
34000
34000
27000
27000
24000
34000
E6
L. Chu
Elect. Eng.
40000
E7
E8
R. Davis
J. Jones
Mech. Eng.
Syst. Anal.
27000
34000
PNO
RESP
DUR
P1
P1
P2
P3
P4
P2
P2
P4
Manager
Analyst
Analyst
Consultant
Engineer
Programmer
Manager
Manager
12
24
6
10
48
18
24
48
P3
Engineer
36
P3
Manager
40
© M. T. Özsu & P. Valduriez
Ch.2/8
Insertion Anomaly
•
It may not be possible to store information about a new project until an
employee is assigned to it.
EMP
ENO
Distributed DBMS
ENAME
TITLE
SAL
E1
E2
E2
E3
E3
E4
E5
J. Doe
M. Smith
M. Smith
A. Lee
A. Lee
J. Miller
B. Casey
Elect. Eng.
Analyst
Analyst
Mech. Eng.
Mech. Eng.
Programmer
Syst. Anal.
40000
34000
34000
27000
27000
24000
34000
E6
L. Chu
Elect. Eng.
40000
E7
E8
R. Davis
J. Jones
Mech. Eng.
Syst. Anal.
27000
34000
PNO
RESP
DUR
P1
P1
P2
P3
P4
P2
P2
P4
Manager
Analyst
Analyst
Consultant
Engineer
Programmer
Manager
Manager
12
24
6
10
48
18
24
48
P3
Engineer
36
P3
Manager
40
© M. T. Özsu & P. Valduriez
Ch.2/9
Deletion Anomaly
• If an engineer,
who is the only employee on a project, leaves
the company, his personal information cannot be deleted, or
the information about that project is lost.
• May have to delete many tuples.
EMP
ENO
Distributed DBMS
ENAME
TITLE
SAL
E1
E2
E2
E3
E3
E4
E5
J. Doe
M. Smith
M. Smith
A. Lee
A. Lee
J. Miller
B. Casey
Elect. Eng.
Analyst
Analyst
Mech. Eng.
Mech. Eng.
Programmer
Syst. Anal.
40000
34000
34000
27000
27000
24000
34000
E6
L. Chu
Elect. Eng.
40000
E7
E8
R. Davis
J. Jones
Mech. Eng.
Syst. Anal.
27000
34000
PNO
RESP
DUR
P1
P1
P2
P3
P4
P2
P2
P4
Manager
Analyst
Analyst
Consultant
Engineer
Programmer
Manager
Manager
12
24
6
10
48
18
24
48
P3
Engineer
36
P3
Manager
40
© M. T. Özsu & P. Valduriez
Ch.2/10
What to do?
• Take each relation individually and “improve” it in terms of the
desired characteristics
➡ Normal forms
✦
Atomic values (1NF)
✦
Can be defined according to keys and dependencies.
✦
Functional Dependencies ( 2NF, 3NF, BCNF)
✦
Multivalued dependencies (4NF)
➡ Normalization
✦
Normalization is a process of concept separation which applies a top-down
methodology for producing a schema by subsequent refinements and
decompositions.
✦
Do not combine unrelated sets of facts in one table; each relation should contain
an independent set of facts.
✦
Universal relation assumption
✦
1NF to 3NF; 1NF to BCNF
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/11
Normalization Issues
• How do we decompose a schema into a desirable normal
form?
• What criteria should the decomposed schemas follow in
order to preserve the semantics of the original schema?
➡ Reconstructability: recover the original relation  no spurious joins
➡ Lossless decomposition: no information loss
➡ Dependency preservation: the constraints (i.e., dependencies) that hold
on the original relation should be enforceable by means of the
constraints (i.e., dependencies) defined on the decomposed relations.
• What happens to queries?
➡ Processing time may increase due to joins
➡ Denormalization
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/12
Functional Dependence
•
Given relation R defined over U = {A1, A2, ..., An} where X  U, Y  U. If,
for all pairs of tuples t1 and t2 in any legal instance of relation scheme R,
t1[X] = t2[X]  t1[Y] = t2[Y],
•
then the functional dependency X  Y holds in R.
Example
➡ In relation EMP
✦
(ENO, PNO)  (ENAME, TITLE, SAL, DUR, RESP)
➡ In relation PROJ
✦
PNO  (PNAME, BUDGET)
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/13
Normal Forms Based on FDs
1NF eliminates the relations within relations or
relations as attributes of tuples.
First Normal Form (1NF)
eliminate the partial functional
dependencies of non-prime
attributes to key attributes
Lossless & Second Normal Form (2NF)
Dependency
preserving
Lossless
Distributed DBMS
eliminate the transitive functional
dependencies of non-prime
attributes to key attributes
Third Normal Form (3NF)
eliminate the partial and transitive
functional dependencies of prime
(key) attributes to key.
Boyce-Codd Normal Form
(BCNF)
© M. T. Özsu & P. Valduriez
Ch.2/14
Normalized Relations – Example
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/15
Relational Algebra
Specify how to obtain the result using a set of operators
Form
Operatorparameters Operands  Result

Relation (s)
Distributed DBMS

Relation
© M. T. Özsu & P. Valduriez
Ch.2/16
Relational Algebra Operators
• Fundamental
➡ Selection
➡ Projection
➡ Union
➡ Set difference
➡ Cartesian product
• Additional
➡ Intersection
➡ -join
➡ Natural join
➡ Semijoin
➡ Division
• Union compatibility
➡ Same degree
➡ Corresponding attributes defined over the same domain
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/17
Selection
•
•
Produces a horizontal subset of the operand relation
General form
F(R)={t tR and F(t) is true}
where
➡ R is a relation, t is a tuple variable
➡ F is a formula consisting of
✦
operands that are constants or attributes
✦
arithmetic comparison operators
<, >, =, , , 
✦
logical operators
, , ¬
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/18
Selection Example
EMP
ENO ENAME
E1
E2
E3
E4
E5
E6
E7
E8
Distributed DBMS
TITLE
J. Doe Elect. Eng.
M. Smith Syst. Anal.
A. Lee Mech. Eng.
J. Miller Programmer
B. Casey Syst. Anal.
L. Chu Elect. Eng.
R. Davis Mech. Eng.
J. Jones Syst. Anal.
TITLE='Elect. Eng.'(EMP)
ENO ENAME
TITLE
E1 J. Doe Elect. Eng
E6 L. Chu Elect. Eng.
© M. T. Özsu & P. Valduriez
Ch.2/19
Projection
• Produces a vertical slice of a relation
• General form
A1,…,An(R)={t[A1,…, An]  tR}
where
➡ R is a relation, t is a tuple variable
➡ {A1,…, An} is a subset of the attributes of R over which the projection will
be performed
• Note: projection can generate duplicate tuples. Commercial
systems (and SQL) allow this and provide
➡ Projection with duplicate elimination
➡ Projection without duplicate elimination
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/20
Projection Example
PNO,BUDGET(PROJ)
PROJ
PNO
PNAME
BUDGET
PNO
BUDGET
P1 Instrumentation
150000
P2 Database Develop. 135000
P1
150000
P2
135000
P3 CAD/CAM
250000
P4 Maintenance
310000
P3
P4
250000
310000
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/21
Union
•
•
Similar to set union
General form
R  S={t  t  R or t  S}
where R, S are relations, t is a tuple variable
➡ Result contains tuples that are in R or in S, but not both (duplicates removed)
➡ R, S should be union-compatible
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/22
Set Difference
•
General Form
R – S = {t t  R and t  S}
where R and S are relations, t is a tuple variable
➡ Result contains all tuples that are in R, but not in S.
➡R–SS–R
➡ R, S union-compatible
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/23
Cartesian (Cross) Product
• Given relations
➡ R of degree k1 , cardinality n1
➡ S of degree k2 , cardinality n2
• Cartesian (cross) product:
R × S = {t [A1,…,Ak1, Ak1+1,…,Ak1+k2]  t[A1,…,Ak1]  R and
t[Ak1+1,…,Ak1+k2]  S}
The result of R × S is a relation of degree (k1+ k2) and consists
of all (n1* n2)-tuples where each tuple is a concatenation of one
tuple of R with one tuple of S.
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/24
Cartesian Product Example
EMP × PAY
EMP
ENO
ENAME
TITLE
ENAME
EMP.TITLE
PAY.TITLE SALARY
E1
J. Doe
Elect. Eng
E1
J. Doe
Elect. Eng. Elect. Eng.
55000
E2
M. Smith Syst. Anal.
E1
J. Doe
Elect. Eng. Syst. Anal.
70000
E3
A. Lee
Mech. Eng.
E1
J. Doe
Elect. Eng. Mech. Eng.
45000
E4
J. Miller
Programmer
E1
J. Doe
Elect. Eng. Programmer
60000
E5
B. Casey Syst. Anal.
E2
M. Smith
Syst. Anal. Elect. Eng.
55000
E6
L. Chu
Elect. Eng.
E2
M. Smith
Syst. Anal. Syst. Anal.
70000
E7
R. Davis Mech. Eng.
E2
M. Smith
Syst. Anal. Mech. Eng.
45000
E8
J. Jones Syst. Anal.
E2
M. Smith
Syst. Anal. Programmer
60000
E3
A. Lee
Mech. Eng. Elect. Eng.
55000
E3
A. Lee
Mech. Eng. Syst. Anal.
70000
E3
A. Lee
Mech. Eng. Mech. Eng.
45000
SALARY
E3
A. Lee
Mech. Eng. Programmer
60000
Elect. Eng.
55000
E8
J. Jones
Syst. Anal. Elect. Eng.
55000
Syst. Anal.
70000
E8
J. Jones
Syst. Anal. Syst. Anal.
70000
Mech. Eng.
45000
E8
J. Jones
Syst. Anal. Mech. Eng.
45000
Programmer
60000
E8
J. Jones
Syst. Anal. Programmer
60000
PAY
TITLE
Distributed DBMS
ENO
© M. T. Özsu & P. Valduriez
Ch.2/25
Intersection
• Typical set intersection
R S = {t  t  R and t  S}
= R – (R – S)
• R, S union-compatible
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/26
-Join
•
General form
R ⋈F(R.A ,S.B ) S={t[A1,…,An,B1,…,Bm] 
i
j
t[A1,…,An]  R and t[B1,…,Bm]  S
and F(R.Ai, S.Bj) is true}
where
➡ R, S are relations, t is a tuple variable
•
➡ F (R.Ai, S.Bj)is a formula defined as that of selection.
A derivative of Cartesian product
➡R
⋈F S = F(R × S)
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/27
Join Example
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/28
Types of Join
• Equi-join
➡ The formula F only contains equality
➡R
⋈R.A=S.B S
• Natural join
➡ Equi-join of two relations R and S over an attribute (or attributes)
common to both R and S and projecting out one copy of those attributes
➡R
⋈
Distributed DBMS
S = RSF(R × S)
© M. T. Özsu & P. Valduriez
Ch.2/29
Natural Join Example
EMP
ENO
ENAME
TITLE
EMP
E1
J. Doe
E2
M. Smith Syst. Anal.
E3
A. Lee
Mech. Eng.
E4
J. Miller
Programmer
E5
B. Casey Syst. Anal.
E6
L. Chu
E7
R. Davis Mech. Eng.
E8
J. Jones Syst. Anal.
Elect. Eng
ENO ENAME
Elect. Eng.
PAY
TITLE
Distributed DBMS
⋈ PAY
SALARY
Elect. Eng.
55000
Syst. Anal.
70000
Mech. Eng.
45000
Programmer
60000
TITLE
SALARY
E1
J. Doe
Elect. Eng.
55000
E2
M. Smith
Analyst
70000
E3
A. Lee
Mech. Eng.
45000
E4
J. Miller
Programmer 60000
E5
B. Casey
Syst. Anal.
70000
E6
L. Chu
Elect. Eng.
55000
E7
E8
R. Davis
J. Jones
Mech. Eng.
Syst. Anal.
45000
70000
Join is over the common attribute TITLE
© M. T. Özsu & P. Valduriez
Ch.2/30
Types of Join
•
Outer-Join
➡ Ensures that tuples from one or both relations that do not satisfy the join
condition still appear in the final result with other relation’s attribute values
set to NULL
➡ Left outer join
➡ Right outer join
➡ Full outer join
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/31
Outer Join Example
•
Left outer join
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/32
Semijoin
•
Derivation
R ⋉F S = A(R ⋈F S) = A(R)
⋈ AB(S) = R ⋈F AB(S)
where
➡ R, S are relations
➡ A is a set of attributes
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/33
Semijoin Example
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/34
Division (Quotient)
•
Given relations
➡ R of degree k1 (R = {A1,…,Ak })
1
➡ S of degree k2 (S = {B1,…,Bk })
2
Let A = {A1,…,Ak1} [i.e., R(A)] and B = {B1,…,Bk2} [i.e., S(B)] and B  A.
•
Then, T = R ÷ S gives T of degree k1-k2 [i.e., T(Y) where Y = A-B] such that for a
tuple t to appear in T, the values in t must appear in R in combination with every
tuple in S.
Derivation
R ÷ S = Y(R) – Y((Y(R) × S) – R)
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/35
Division Example
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/36
Relational Calculus
•
•
•
Specify the properties that the result should hold
Tuple relational calculus
Domain relational calculus
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/37
Tuple Relational Calculus
•
Query of the form {t|F{t}} where
➡ t is a tuple variable
•
➡ F is a well-formed formula
Atomic formula
➡ Tuple-variable membership expressions
✦
R.t or R(t) : tuple t belongs to relation R
➡ Conditions
•
✦
s[A]  t[B]; s and t are tuple variables, A and B are components of s and t,
respectively,   {<,>, =,≠, ≤, ≥}; e.g., s[SAL] > t[SAL]
✦
s[A]  c; s, A, and  as defined above, c is a constant; e.g., s[ENAME] = ‘Smith’
SQL is an example of tuple relational calculus (at least in its simple form)
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/38
Domain Relational Calculus
•
•
Query of the form x1, x2, …, xn|F(x1, x2, …, xn) where
➡ F is a well-formed formula in which x1, x2, …, xn are the free variables
QBE is an example
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/39
Computer Network
•
•
An interconnected
collection of autonomous
computers that are
capable of exchanging
information among
themselves.
Components
➡ Hosts (nodes, end
systems)
➡ Switches
➡ Communication link
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/40
Internet
•
Network of networks
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/41
Types of Networks
•
According to scale (geographic distribution)
➡ Wide are network (WAN)
✦
Distance between any two nodes > 20km and can go as high as thousands of kms
✦
Long delays due to distance traveled
✦
Heterogeneity of transmission media
✦
Speeds of 150Mbps to 10Gbps (OC192 on the backbone)
➡ Local area network (LAN)
✦
Limited in geographic scope (usually < 2km)
✦
Speeds 10-1000 Mbps
✦
Short delays and low noise
➡ Metropolitan area network (MAN)
✦
In between LAN and WAN
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/42
Types of Networks (cont’d)
•
Topology
➡ Irregular
✦
No regularity in the interconnection – e.g., Internet
➡ Bus
✦
Typical in LANs – Ethernet
✦
Using Carrier Sense Medium Access with Collision Detection (CSMA/CD)
✓
Listen before and while you transmit
➡ Star
➡ Ring
➡ Mesh
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/43
Bus network
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/44
Communication Schemes
•
Point-to-point (unicast)
➡ One or more (direct or indirect) links between each pair of nodes
➡ Communication always between two nodes
➡ Receiver and sender are identified by their addresses included in the message
header
➡ Message may follow one of many links between the sender and receiver using
•
switching or routing
Broadcast (multi-point)
➡ Messages are transmitted over a shared channel and received by all the nodes
➡ Each node checks the address and if it not the intended recipient, ignores
➡ Multi-cast: special case
✦
Message is sent to a subset of the nodes
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/45
Communication Alternatives
•
•
•
•
•
•
Twisted pair
Coaxial
Fiber optic cable
Satellite
Microwave
Wireless
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/46
Data Communication
•
•
•
•
Hosts are connected by links, each of which can carry one or more
channels
Link: physical entity; channel: logical entity
Digital signal versus analog signal
Capacity – bandwidth
➡ The amount of information that can be trnsmitted over the channel in a given
•
time unit
Alternative messaging schemes
➡ Packet switching
✦
Messages are divided into fixed size packets, each of which is routed from the
source to the destination
➡ Circuit switching
✦
A dedicated channel is established between the sender and receiver for the
duration of the session
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/47
Packet Format
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/48
Communication Protocols
•
•
•
Software that ensures error-free, reliable and efficient communication
between hosts
Layered architecture – hence protocol stack or protocol suite
TCP/IP is the best-known one
➡ Used in the Internet
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/49
Message Transmission using
TCP/IP
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/50
TCP/IP Protocol
Distributed DBMS
© M. T. Özsu & P. Valduriez
Ch.2/51
Download