Database Systems Lecture #5

advertisement
Database Systems
Lecture #5
Yan Pan
School of Software, SYSU
2011
Agenda

Last time: FDs

This time:
Anomalies
Normalization: BCNF & 3NF
Next time: RA & SQL
1.
2.

2
Review: FDs

Definition:
If two tuples agree on the attributes
A1, A2, …, An
then they must also agree on the attributes
B1, B2, …, Bm


3
Notation: A1, A2, …, An  B1, B2, …, Bm
Read as: Ai functionally determines Bj
Review: Combining FDs
If some FDs are satisfied, then
others are satisfied too
If all these FDs are true:
name  color
category  department
color, category  price
Then this FD also holds: name, category  price
Why?
4
Problem: find all FDs


Given a relation instance and set of given FDs
Find all FD’s satisfied by that instance

Useful if we don’t get enough information from our
users: need to reverse engineer a data instance

Q: How long does this take?
A: Some time for each subset of atts
Q: How many subsets?





5
powerset
 exponential time in worst-case
But can often be smarter…
Closure Algorithm
Start with X={A1, …, An}.
Repeat:
B1, …, Bn  C is a FD and
B1, …, Bn are all in X
then add C to X.
if
until X didn’t change
6
Example:
name  color
category  department
color, category  price
{name, category}+ =
{name, category, color,
department, price}
Closure alg e.g.
Example:
A, B  C
A, D  B
B
 D
Compute X+, for every set X (AB is shorthand for {A,B}):
A+ = A, B+ = BD, C+ = C, D+ = D
AB+ = ABCD, AC+ = AC, AD+ = ABCD, BC+ = BC, BD+ = BD, CD+ = CD
ABC+ = ABD+ = ACD+ = ABCD (no need to compute–why?)
BCD+ = BCD, ABCD+ = ABCD
What are the keys?
7
Closure alg e.g.
In class:
A, B 
A, D 
B

A, F 
R(A,B,C,D,E,F)
Compute {A,B}+
X = {A, B,
}
Compute {A, F}+
X = {A, F,
}
What are the keys?
8
C
E
D
B
Closure alg e.g.

Product(name, price, category, color)
name, category  price
category  color
FDs are:
Keys are:

{name, category}
Enrollment(student, address, course, room, time)
student  address
room, time  course
student, course  room, time
FDs are:
Keys are:
9
Next topic: Anomalies

Identify anomalies in existing schemata

Decomposition by projection

BCNF

Lossy v. lossless

Third Normal Form
10
Types of anomalies

Redundancy


Update anomalies:


Delete some values & lose other values too
Insert anomalies:

11
Change info in one tuple but not in another
Deletion anomalies:


Repeat info unnecessarily in several tuples
Inserting row means having to insert other, separate info /
null-ing it out
Examples of anomalies
Name
SSN
Mailing-address
Phone
Michael
123
NY
212-111-1111
Michael
123
NY
917-111-1111
Hilary
456
DC
202-222-2222
Hilary
456
DC
914-222-2222
Bill
789
Chappaqua
914-222-2222
Bill
789
Chappaqua
212-333-3333
SSN  Name, Mailing-address






Redundancy: name, maddress
Update anomaly: Bill moves
Delete anom.: Bill doesn’t pay bills, lose phones  lose Bill!
Insert anom: can’t insert someone without a (non-null) phone
Underlying cause: SSN-phone is many-many
Effect: partial dependency ssn  name, maddress,

12
SSN  Phone
Whereas key = {ssn,phone}
Decomposition by projection


Soln: replace anomalous R with projections of R onto
two subsets of attributes
Projection: an operation in Relational Algebra


Projecting R onto attributes (A1,…,An) means
removing all other attributes



13
Corresponds to SELECT command in SQL
Result of projection is another relation
Yields tuples whose fields are A1,…,An
Resulting duplicates ignored
Projection for decomposition
R(A1, ..., An, B1, ..., Bm, C1, ..., Cp)
R1(A1, ..., An, B1, ..., Bm)
R2(A1, ..., An, C1, ..., Cp)
R1 = projection of R on A1, ..., An, B1, ..., Bm
R2 = projection of R on A1, ..., An, C1, ..., Cp
A1, ..., An  B1, ..., Bm  C1, ..., Cp = all attributes
R1 and R2 may (/not) be reassembled to produce original R
14
Decomposition example
Break
the
relation
into
two:
Name
SSN
Mailing-address
Phone
Michael
123
NY
212-111-1111
Michael
123
NY
917-111-1111
Hilary
456
DC
202-222-2222
Hilary
456
DC
914-222-2222
Bill
789
Chappaqua
914-222-2222
Bill
789
Chappaqua
212-333-3333
Name
SSN
Mailing-address
SSN
Phone
Michael
123
NY
123
212-111-1111
Hilary
456
DC
123
917-111-1111
Bill
789
Chappaqua
456
202-222-2222
The anomalies are gone
456
914-222-2222
789
914-222-2222
789
212-333-3333




15
No more redundant data
Easy to for Bill to move
Okay for Bill to lose all phones
Thus: high-level strategy
name
E/R Model:
Product
price
Relational Model:
plus FD’s
Normalization:
Eliminates anomalies
16
Person
buys
name
ssn
Using FDs to produce good schemas
Start with set of relations
Define FDs (and keys) for them based on real
world
Transform your relations to “normal form”
(normalize them)
1.
2.
3.

Intuitively, good design means



17
Do this using “decomposition”
No anomalies
Can reconstruct all (and only the) original information
Decomposition terminology



Projection: eliminating certain atts from relation
Decomposition: separating a relation into two by
projection
Join: (re)assembling two relations


If exactly the original rows are reproduced by joining
the relations, then the decomposition was lossless


18
Whenever a row from R1 and a row from R2 have the same
value for some atts A, join together to form a row of R3
We join on the attributes R1 and R2 have in common (As)
If it can’t, the decomposition was lossy
Lossless Decompositions
A decomposition is lossless if we can recover:
R(A,B,C)
Decompose
R1(A,B)
R2(A,C)
Recover
R’(A,B,C) should be the same as
R(A,B,C)
R’ is in general larger than R. Must ensure R’ = R
19
Lossless decomposition




20
Sometimes the same set of data is reproduced:
Name
Price
Category
Word
100
WP
Oracle
1000
DB
Access
100
DB
Name
Price
Name
Category
Word
100
Word
WP
Oracle
1000
Oracle
DB
Access
100
Access
DB
(Word, 100) + (Word, WP)  (Word, 100, WP)
(Oracle, 1000) + (Oracle, DB)  (Oracle, 1000, DB)
(Access, 100) + (Access, DB)  (Access, 100, DB)
Lossy decomposition






21
Sometimes it’s not:
Name
Price
Category
Word
100
WP
Oracle
1000
DB
Access
100
DB
What’s
wrong?
Category
Name
Category
Price
WP
Word
WP
100
DB
Oracle
DB
1000
DB
Access
DB
100
(Word, WP) + (100, WP)  (Word, 100, WP)
(Oracle, DB) + (1000, DB)  (Oracle, 1000, DB)
(Oracle, DB) + (100, DB)  (Oracle, 100, DB)
(Access, DB) + (1000, DB)  (Access, 1000, DB)
(Access, DB) + (100, DB)  (Access, 100, DB)
Ensuring lossless decomposition
R(A1, ..., An, B1, ..., Bm, C1, ..., Cp)
R1(A1, ..., An, B1, ..., Bm)
R2(A1, ..., An, C1, ..., Cp)
If A1, ..., An  B1, ..., Bm or A1, ..., An  C1, ..., Cp
Then the decomposition is lossless
Note: don’t need both



22
Examples:
name  price, so first decomposition was lossless
category  name and category  price, and so second
decomposition was lossy
Quick lossless/lossy example
X
1
4
Y
2
2
Z
3
5

At a glance: can we decompose into R1(Y,X), R2(Y,Z)?

At a glance: can we decompose into R1(X,Y), R2(X,Z)?
23
Next topic: Normal Forms

First Normal Form = all attributes are atomic


As opposed to set-valued
Assumed all along

Second Normal Form (2NF)

Third Normal Form (3NF)

Boyce Codd Normal Form (BCNF)

Fourth Normal Form (4NF)

Fifth Normal Form (5NF)
24
BCNF definition

A simple condition for removing anomalies from
relations:
A relation R is in BCNF if:
If As  Bs is a non-trivial dependency
in R , then As is a superkey for R

I.e.: The left side must always contain a key
I.e: If a set of attributes determines other attributes,
it must determine all the attributes

Slogan: “In every FD, the left side is a superkey.”

25
BCNF decomposition algorithm
Repeat
choose A1, …, Am  B1, …, Bn that violates the BNCF condition
//Heuristic: choose Bs as large as possible
split R into R1(A1, …, Am, B1, …, Bn) and R2(A1, …, Am, [others])
continue with both R1 and R2
Until no more violations
B’s
R1
26
A’s
Others
R2
Boyce-Codd Normal Form

Name/phone example is not BCNF:
Name
SSN
Mailing-address
Phone
Michael
123
NY
212-111-1111
Michael
123
NY
917-111-1111


{ssn,phone} is key
FD: ssn  name,mailing-address holds


Violates BCNF: ssn is not a superkey
Its decomposition is BCNF

Only superkeys  anything else
Name
SSN
Mailing-address
SSN
PhoneNumber
Michael
123
NY
123
212-111-1111
123
917-111-1111
27
BCNF Decomposition



Larger example: multiple decompositions
{Title, Year, Studio, President, Pres-Address}
FDs:






No many-many this time
Problem cause: transitive FDs:

28
Title, Year  Studio
Studio  President
President  Pres-Address
=> Studio  President, Pres-Address
Title, year  studio  president
BCNF Decomposition
Illegal: As  Bs, where As not a superkey
Decompose: Studio  President, Pres-Address





Result:

1.
2.
Studios(studio, president, pres-address)
Movies(studio, title, year)
Is (2) in BCNF? Is (1) in BCNF?





29
As = {studio}
Bs = {president, pres-address}
Cs = {title, year}
Key: Studio
FD: President  Pres-Address
Q: Does president  studio? If so, president is a key
But if not, it violates BCNF
BCNF Decomposition



Studios(studio, president, pres-address)
Illegal: As  Bs, where As not a superkey
 Decompose: President  Pres-Address




{Studio, President, Pres-Address} becomes


30
As = {president}
Bs = {pres-address}
Cs = {studio}
{President, Pres-Address}
{Studio, President}
Decomposition algorithm example







31
R(N,O,R,P)
F = {N  O, O  R, R  N}
Name
Office
Residence
Phone
George
Pres.
WH
202-…
George
Pres.
WH
486-…
Dick
VP
NO
202-…
Dick
VP
NO
Key: N,P
Violations of BCNF: N  O, OR, N OR
Pick N  OR
Can we rejoin?
What happens if we pick N  O instead?
Can we rejoin?
307-…
An issue with BCNF

We could lose FDs

Relation: R(Title, Theater, Neighboorhood)
FDs:
 Title,N’hood  Theater


Assume a movie shouldn’t play twice in same n’hood
Theater  N’hood
Keys:
Title
Theater
 {Title, N’hood}
Aviator
Angelica
 {Theater, Title}
Life Aquatic Angelica


32
N’hood
Village
Village
Losing FDs


BCNF violation: Theater  N’hood
Decompose:



{Theater, N’Hood}
{Theater, Title}
Resulting relations:
R1
R2
Theater
N’hood
Theater
Title
Angelica
Village
Angelica
Angelica
Aviator
Life Aquatic
33
Losing FDs

Suppose we add new rows to R1 and R2:
R1 Theater
N’hood
R2 Theater
Title
Angelica
Village
Angelica
Life Aquatic
Film Forum
Village
Angelica
Aviator
Film Forum
Life Aquatic
R’

34
Theater
N’hood
Title
Angelica
Village
Life Aquatic
Angelica
Village
Aviator
Film Forum
Village
Life Aquatic
Neither R1 nor R2 enforces FD Title,N’hood  Theater
Third normal form: motivation

Sometimes



In these cases BCNF is too severe a req.


“over-normalization”
Solution: define a weaker normal form, 3NF


35
BCNF is not dependency-preserving, and
Efficient checking for FD violation on updates is important
FDs can be checked on individual relations without
performing a join (no inter-relational FDs)
relations can be converted, preserving both data and FDs
BCNF lossiness

NB: BCNF decomp. is not data-lossy


But: it can lose dependencies


36
Results can be rejoined to obtain the exact original
After decomp, now legal to add rows whose corresponding
rows would be illegal in (rejoined) original
Data-lossy v. FD-lossy
Third Normal Form

Now define the (weaker) Third Normal Form

Turns out: this example was already in 3NF
A relation R is in 3rd normal form if :
For every nontrivial dependency A1, A2, ..., An  B
for R, {A1, A2, ..., An } is a super-key for R,
or B is part of a key, i.e., B is prime
Tradeoff:
BCNF = no FD anomalies, but may lose some FDs
3NF = keeps all FDs, but may have some anomalies
37
Canonical Cover

Consider a set F of functional dependencies and the functional dependency
   in F.
From A  C




A canonical coverFc for F is a set of dependencies such that F logically
implies all dependencies in Fc and Fc logically implies all dependencies in F,
and further


38
Attribute A is extraneous in  if A  and if A is removed from , the Iset
getofAB  C
functional dependencies implied by F doesn’t change.
Given AB  C and A  C then B is extraneous in AB
Attribute A is extraneous in  if A   and if A is removed from , the set of
functional dependencies implied by F doesn’t change.
Given A  BC and A  B then B is extraneous in BC
No functional dependency in Fc contains an extraneous attribute.
Each left side of a functional dependency in Fc is unique.
Canonical Cover

39
Compute a canonical over for F:
repeat
use the union rule to replace any
dependencies in F
1  1 and 1  2 replaced with 1 
12
Find a functional dependency
   with an
extraneous attribute
either in  or in 
If an extraneous attribute is found, delete
it from

until F does not change
Example of Computing a Canonical Cover





40
R = (A, B, C)
F = { A  BC
B  C
A  B
AB  C}
Combine A  BC and A  B into A 
BC
A is extraneous in AB  C because B
 C logically implies AB  C.
C is extraneous in A  BC since A 
BC is logically implied by A  B and B
 C.
The canonical cover is:
AB
BC
A  BC
B  C
AB  C
A  BC
B C
B C
A  BC
B C
A B
B C
3NF Decomposition Algorithm

41
Let Fc be a canonical cover for F;
i := 0;
for each functional dependency    in Fc do
if none of the schemas Rj, 1<= j <= i contains 
then begin
i:=i+1;
Ri:= ;
end
if none of the schemas Rj, 1<= j <= i contains a candidate key for R
then begin
i:=i+1;
Ri:= any candidate key for R;
end
return (R1, R2, …, Ri)
Example

Relation schema:
Banker-info-schema
branch-name, customer-name, banker-name, office-number


42
The functional dependencies for this
relation schema are:
banker-name  branch-name, officenumber
customer-name, branch-name 
banker-name
The key is:
{customer-name, branch-name}
Applying 3NF to banker - info schema

Go through the for loop in the algorithm:
banker-name  branch-name, office-number
is not in any decomposed relation (no decomposed relation so
far)
Create a new relation:
Banker-office-schema ( banker-name, branch-name, office-number )
customer-name, branch-name  banker-name
is not in any decomposed relation (one decomposed relation so
far) Create a new relation:
Banker-schema ( customer-name, branch-name, banker-name )

43
Since Banker-schema contains a candidate key for Banker-infoschema, we are done with the decomposition process.
Comparison of BCNF and 3NF

It is always possible to decompose a relation
into relations in 3NF and



It is always possible to decompose a relation
into relations in BCNF and


44
the decomposition is lossless
dependencies are preserved
the decomposition is lossless
it may not be possible to preserve dependencies
Next week

45
Read ch.5.1-2 (SQL)
Download