CS 405G: Introduction to Database Systems Lecture 9: Normal Forms Instructor: Chen Qian

advertisement
CS 405G: Introduction to
Database Systems
Lecture 9: Normal Forms
Instructor: Chen Qian

Mid project report due after Spring Break.
7/1/2016
Chen Qian @ University of Kentucky
2
Database Normalization
 Database normalization relates to the level of
redundancy in a relational database’s structure.
 The key idea is to reduce the chance of having multiple
different version of the same data.
 Well-normalized databases have a schema that reflects
the true dependencies between tracked quantities.
 Any increase in normalization generally involves
splitting existing tables into multiple ones, which must
be re-joined each time a query is issued.
7/1/2016
Chen Qian @ University of Kentucky
3
Normalization

A normalization is the process of organizing the fields
and tables of a relational database to minimize
redundancy and dependency.

A normal form is a certification that tells whether a
relation schema is in a particular state
7/1/2016
Chen Qian @ University of Kentucky
4
Normal Forms


Edgar F. Codd originally established three normal
forms: 1NF, 2NF and 3NF.
3NF is widely considered to be sufficient.

Normalizing beyond 3NF can be tricky with current
SQL technology as of 2005

Full normalization is considered a good exercise to
help discover all potential internal database
consistency problems.
7/1/2016
Chen Qian @ University of Kentucky
5
First Normal Form ( 1NF )

NF is to characterize a relation (not an attribute, a key,
etc…)


We can only say “this relation or table is in 1NF”
A relation is in first normal form if the domain of each
attribute contains only atomic values, and the value of
each attribute contains only a single value from that
domain.
7/1/2016
Chen Qian @ University of Kentucky
6
7/1/2016
Chen Qian @ Univ of Kentucky
2nd Normal Form


An attribute A of a relation R is a nonprimary attribute if
it is not part of any key in R, otherwise, A is a primary
attribute.
R is in (general) 2nd normal form if every nonprimary
attribute A in R is not partially functionally dependent
on any key of R
7/1/2016
Chen Qian @ University of Kentucky
8
Redundancy Example

If a key will result a partial dependency of a nonprimary
attribute.

e.g. EID, PID-> Ename

In this case, the attribute (Ename) should be separated
with its full dependency key (EID) to be a new table.

So, to check whether a table includes redundancy. Try
every nonprimary attribute and check whether it fully
depends on any key.
7/1/2016
Chen Qian @ University of Kentucky
9
7/1/2016
Chen Qian @ Univ of Kentucky
Second normal Form ( 2NF )




2NF prescribes full functional dependency on the
primary key.
It most commonly applies to tables that have composite
primary keys, where two or more attributes comprise the
primary key.
It requires that there are no non-trivial functional
dependencies of a non-key attribute on a part (subset) of
a candidate key.
A table is said to be in the 2NF if and only if it is in the
1NF and every non-key attribute is irreducibly dependent
on the primary key
7/1/2016
Chen Qian @ University of Kentucky
11
Decomposition
EID
PID
Ename
email
Pname
Hours
1234
10
John Smith
jsmith@ac.com
B2B platform
10
1123
9
Ben Liu
bliu@ac.com
CRM
40
1234
9
John Smith
jsmith@ac.com
CRM
30
1023
10
Susan Sidhuk
Decomposition
ssidhuk@ac.com B2B platform
40
Foreign key
EID
Ename
email
EID
PID
Pname
Hours
1234
John Smith
jsmith@ac.com
1234
10
B2B platform
10
1123
Ben Liu
bliu@ac.com
1123
9
CRM
40
1023
Susan Sidhuk
ssidhuk@ac.com
1234
9
CRM
30
1023
10
B2B platform
40


Decomposition eliminates redundancy
To get back to the original relation, use natural join.
7/1/2016
Chen Qian @ University of Kentucky
12
Decomposition

Decomposition may be applied recursively
7/1/2016
EID
PID
Pname
Hours
1234
10
B2B platform
10
1123
9
CRM
40
1234
9
CRM
30
1023
10
B2B platform
40
PID
Pname
EID
PID
Hours
10
B2B platform
1234
10
10
9
CRM
1123
9
40
1234
9
30
1023
10
40
Chen Qian @ University of Kentucky
13
Unnecessary decomposition


EID
Ename
email
1234
John Smith
jsmith@ac.com
1123
Ben Liu
bliu@ac.com
1023
Susan Sidhuk
ssidhuk@ac.com
EID
Ename
EID
email
1234
John Smith
1234
jsmith@ac.com
1123
Ben Liu
1123
bliu@ac.com
1023
Susan Sidhuk
1023
ssidhuk@ac.com
Fine: join returns the original relation
Unnecessary: no redundancy is removed, and now EID
is stored twice->
7/1/2016
Chen Qian @ University of Kentucky
14
Bad decomposition


EID
PID
Hours
1234
10
10
1123
9
40
1234
9
30
1023
10
40
EID
PID
EID
Hours
1234
10
1234
10
1123
9
1123
40
1234
9
1234
30
1023
10
1023
40
Association between PID and hours is lost
Join returns more rows than the original relation
7/1/2016
Chen Qian @ University of Kentucky
15
Lossless join decomposition

Decompose relation R into relations S and T





attrs(R) = attrs(S)  attrs(T)
S = πattrs(S) ( R )
T = πattrs(T) ( R )
The decomposition is a lossless join decomposition if,
given known constraints such as FD’s, we can guarantee
that R = S
T
Any decomposition gives R S T (why?)
 A lossy decomposition is one with R  S
T
7/1/2016
Chen Qian @ University of Kentucky
16
Loss? But I got more rows->

“Loss” refers not to the loss of tuples, but to the loss of
information

Or, the ability to distinguish different original tuples
7/1/2016
EID
PID
Hours
1234
10
10
1123
9
40
1234
9
30
1023
10
40
EID
PID
EID
Hours
1234
10
1234
10
1123
9
1123
40
1234
9
1234
30
1023
10
1023
40
Chen Qian @ University of Kentucky
17
Questions about decomposition

When to decompose

How to come up with a correct decomposition (i.e.,
lossless join decomposition)
7/1/2016
Chen Qian @ University of Kentucky
18
Third normal form
• 3NF requires that there are no non-trivial
functional dependencies of non-key attributes on
something other than a superset of a candidate key.
• Recall: non-trivial FD means LHS has no
intersection with RHS.
• In summary, all non-key attributes are mutually
independent.
7/1/2016
Chen Qian @ University of Kentucky
19
Ename and email has no FD
7/1/2016
EID
Ename
email
1234
John Smith
jsmith@ac.com
1123
Ben Liu
bliu@ac.com
1023
Susan Sidhuk
ssidhuk@ac.com
Chen Qian @ University of Kentucky
20
Boyce-Codd normal form (BCNF)
• BCNF requires that there are no non-trivial
functional dependencies of attributes on something
other than a superset of a candidate key (called a
superkey).
• All attributes are dependent on a key, a whole key
and nothing but a key (excluding trivial
dependencies, like A->A).
7/1/2016
Chen Qian @ University of Kentucky
21
• A table is said to be in the BCNF if and only if
it is in the 3NF and every non-trivial, leftirreducible functional dependency has a
candidate key as its determinant.
• In more informal terms, a table is in BCNF if it
is in 3NF and the only determinants are the
candidate keys.
7/1/2016
Chen Qian @ University of Kentucky
22
Non-key FD’s

Consider a non-trivial FD X -> Y where X is not a super
key

Since X is not a super key, there are some attributes (say
Z) that are not functionally determined by X
X
Y
Z
a
b
c
a
b
d
That b is always associated with a is recorded by multiple rows:
redundancy, update anomaly, deletion anomaly
7/1/2016
Chen Qian @ University of Kentucky
23
Dealing with Nonkey Dependency: BCNF

A relation R is in Boyce-Codd Normal Form if



When to decompose


For every non-trivial FD X -> Y in R, X is a super key
That is, all FDs follow from “key -> other attributes”
As long as some relation is not in BCNF
How to come up with a correct decomposition


Always decompose on a BCNF violation (details next)
Then it is guaranteed to be a lossless join decomposition
7/1/2016
Chen Qian @ University of Kentucky
24
BCNF decomposition algorithm

Find a BCNF violation


Decompose R into R1 and R2, where



That is, a non-trivial FD X -> Y in R where X is not a
super key of R
R1 has attributes X Y
R2 has attributes X  Z, where Z contains all attributes of
R that are in neither X nor Y (i.e. Z = attr(R) – X – Y)
Repeat until all relations are in BCNF
7/1/2016
Chen Qian @ University of Kentucky
25
BCNF decomposition example
WorkOn (EID, Ename, email, PID, hours)
BCNF violation: EID -> Ename, email
Student (EID, Ename, email)
BCNF
7/1/2016
Grade (EID, PID, hours)
BCNF
Chen Qian @ University of Kentucky
26
Another example
WorkOn (EID, Ename, email, PID, hours)
BCNF violation: email -> EID
StudentID (email, EID)
BCNF
StudentGrade’ (email, Ename, PID, hours)
BCNF violation: email -> Ename
StudentName (email, Ename)
Grade (email, PID, hours)
BCNF
BCNF
7/1/2016
Chen Qian @ University of Kentucky
27
Exercise

Property(Property_id#, County_name, Lot#, Area,
Price, Tax_rate)




Property_id#-> County_name, Lot#, Area, Price,
Tax_rate
County_name, Lot# -> Property_id#, Area, Price,
Tax_rate
County_name -> Tax_rate
area -> Price
7/1/2016
Chen Qian @ University of Kentucky
28
Exercise
Property(Property_id#, County_name, Lot#, Area, Price,
Tax_rate)
BCNF violation: County_name -> Tax_rate
LOTS1 (County_name, Tax_rate )
BCNF
LOTS2 (Property_id#, County_name, Lot#, Area, Price)
BCNF violation: Area -> Price
LOTS2A (Area, Price)
BCNF
LOTS2B (Property_id#, County_name, Lot#, Area)
BCNF
7/1/2016
Chen Qian @ University of Kentucky
29
Why is BCNF decomposition lossless
Given non-trivial X -> Y in R where X is not a super key of
R, need to prove:
 Anything we project always comes back in the join:
R  πXY ( R ) πXZ ( R )


Sure; and it doesn’t depend on the FD
Anything that comes back in the join must be in the
original relation:
R  πXY ( R ) πXZ ( R )

Proof makes use of the fact that X -> Y
7/1/2016
Chen Qian @ University of Kentucky
30
Recap


Functional dependencies: a generalization of the key
concept
Partial dependencies: a source of redundancy



Use 2nd Normal form to remove partial dependency
Non-key functional dependencies: a source of
redundancy
BCNF decomposition: a method for removing ALL
functional dependency related redundancies

Plus, BNCF decomposition is a lossless join
decomposition
7/1/2016
Chen Qian @ University of Kentucky
31
Normalization
There is a sequence to normal forms:
1NF is considered the weakest,
2NF is stronger than 1NF,
3NF is stronger than 2NF, and
BCNF is considered the strongest
Also,
any relation that is in BCNF, is in 3NF;
any relation in 3NF is in 2NF; and
any relation in 2NF is in 1NF.
7/1/2016
32
Normalization
1NF
a relation in BCNF, is also
in 3NF
2NF
a relation in 3NF is also in
2NF
3NF
BCNF
7/1/2016
a relation in 2NF is also in
1NF
33
First Normal Form
The following is not in 1NF
EmpNum
123
333
679
EmpPhone
233-9876
233-1231
233-1231
EmpDegrees
BA, BSc, PhD
BSc, MSc
EmpDegrees is a multi-valued field:
employee 679 has two degrees: BSc and MSc
employee 333 has three degrees: BA, BSc, PhD
7/1/2016
34
First Normal Form
EmployeeDegree
Employee
EmpNum
123
333
679
EmpPhone
233-9876
233-1231
233-1231
EmpNum EmpDegree
333
BA
333
333
679
679
BSc
PhD
BSc
MSc
An outer join between Employee and EmployeeDegree will
produce the information we saw before
7/1/2016
35
Second Normal Form
Consider this InvLine table (in 1NF):
InvNum
LineNum
ProdNum
InvNum, LineNum
InvNum
There are two
candidate keys.
LineNum
InvDate
Qty is the only nonkey attribute, and it is
dependent on InvNum
Since there is a determinant that is not a
candidate key, InvLine is not BCNF
InvLine is not 2NF since there is a partial
dependency of InvDate on InvNum
7/1/2016
InvDate
ProdNum
Qty
InvNum, ProdNum
Qty
InvLine is
only in 1NF
36
Second Normal Form
InvLine
InvNum
LineNum
ProdNum
Qty
InvDate
The above relation has redundancies: the invoice date is
repeated on each invoice line.
We can improve the database by decomposing the relation
into two relations:
7/1/2016
InvNum
LineNum
InvNum
InvDate
ProdNum
Qty
37
Third Normal Form
Consider this Employee relation
EmpNum
EmpName
DeptNum
Candidate keys
are? …
DeptName
EmpName, DeptNum, and DeptName are non-key attributes.
DeptNum determines DeptName, a non-key attribute, and
DeptNum is not a candidate key.
Is the relation in 3NF? … no
Is the relation in BCNF? … no
Is the relation in 2NF? … yes
7/1/2016
38
Third Normal Form
EmpNum
EmpName
DeptNum
DeptName
We correct the situation by decomposing the original relation
into two 3NF relations. Note the decomposition is lossless.
EmpNum EmpName DeptNum
DeptNum DeptName
Verify these two relations are in 3NF.
Are they in BCNF?
7/1/2016
39
Boyce-Codd Normal Form
Boyce-Codd Normal Form
BCNF is defined very simply:
a relation is in BCNF if it is in 1NF and if every
determinant is a candidate key.
7/1/2016
40
In 3NF, but not in BCNF:
Instructor teaches one
course only.
student_no course_no instr_no
Student takes a course
and has one instructor.
{student_no, course_no}  instr_no
instr_no  course_no
since we have instr_no  course-no, but instr_no is not a
Candidate key.
7/1/2016
41
student_no course_no instr_no
student_no instr_no
course_no instr_no
{student_no, instr_no}  student_no
{student_no, instr_no}  instr_no
instr_no  course_no
7/1/2016
42
Boyce-Codd Normal Form
InvNum
LineNum
InvNum, LineNum
ProdNum
ProdNum
Qty
InvNum, ProdNum
Qty
{InvNum, LineNum} and
{InvNum, ProdNum} are
the two candidate keys.
LineNum
There are two candidate keys.
Since every determinant is a candidate key, the
relation is in BCNF
This relation is about Invoice lines only.
7/1/2016
43
inv_no
7/1/2016
line_no prod_no prod_desc
qty
44
2NF, but not in 3NF, nor in BCNF:
inv_no
line_no prod_no prod_desc
qty
since prod_no is not a candidate key and we have:
prod_no  prod_desc.
7/1/2016
45
EmployeeDept
ename
7/1/2016
ssn
bdate
address dnumber dname
46
Summary

Philosophy behind BCNF, 4NF:
Data should depend on the key, the whole key, and
nothing but the key!

Philosophy behind 3NF:
… But not at the expense of more expensive constraint
enforcement!
7/1/2016
47
Download