TDDC94 Fall 2007 TDDC94—Normalisation Lecture Overview

advertisement
TDDC94
Fall 2007
Lecture Overview
TDDC94—Normalisation
User 4
Model
Real World
Database
Updates
Answers
UserQueries
3
Updates
Answers
UserQueries
2
UserQueries
1
Updates
Answers
Updates Queries Answers
Processing of
queries and updates
Database
management
system
Almut Herzog
Attentec AB
almhe@attentec.se
almhe@ida.liu.se
Access to stored data
Physical
database
September 4,
2007
TDDC94—Normalisation
Good Design
Informal Measures of Good Design
Can we be sure that a translation from EER-diagram to
relational tables results in good database design?
Confronted with a deployed database, how can we be
sure that it is well-designed?
What is good database design?
Four informal measures: ...
Formal measure: Normalisation
TDDC94—Normalisation
2
3
Informal Measures of Good Design
Reducing NULL values in tuples (not so big issue
really)
Why is the attribute there in the first place if almost every
tuple has a NULL-value for it? Unclear semantics.
Storage is not really a problem in today’s databases
But it is very costly to have joins on such columns because
the NULL-value must be taken into account with an OUTER
join which is an expensive operation.
Disallowing the possibility of generating spurious tuples
Semantics of the attributes (it is straightforward to
understand why the attribute is in the relation, putting
department data in the EMP-table is for example not
straightforward)
Reducing redundant information in tuples because
redundancy can cause update anomalies:
Given: EMP(EMPID, EMPNAME, DEPTNAME, DEPTMGR)
Insertion anomaly: if the EMP table also contains DEPT-info,
the insertion of a new employee must contain correct,
consistant DEPT-information.
Deletion anomaly: If the last employee of the department is
deleted, all department information is lost.
Modification anomaly: If a department receives a new
manager, all employee tuples need to be updated with that
information.
TDDC94—Normalisation
4
Functional dependencies (FD)
Let R be a relational schema with the attributes A1,...,An and let X and Y
be subsets of {A1,...,An}.
Let r(R) denote a relation in relational schema R.
We
Wesay
saythat
that XXfunctionally
functionallydetermines
determinesY,
Y,
XX YY
Despite the mathematical definition
an FD cannot be determined
automatically.
It
is
a
property
of
the
semantics
of attributes.in r(R):
ififfor
r(R)and
andfor
forall
allrelations
relations in r(R):
foreach
eachpair
pairof
oftuples
tuplestt11, ,tt22∈∈r(R)
IfIftt1[X]
[X]==tt2 [X]
[X]then
thenwe
wemust
mustalso
alsohave
havett1[Y]
[Y]==tt2 [Y]
[Y]
1
2
1
2
Figure 10.5 in the book …
Only join on foreign key/primary key-attributes.
Lossless join property: It shall be possible to recreate the
universal relation without spurious tuples.
TDDC94—Normalisation
Author: Almut Herzog
5
TDDC94—Normalisation
6
1
TDDC94
Fall 2007
Definitions
Definitions
An FD X Y is a full functional dependency (FFD) if
removal of any attribute Ai from X means that the
dependency does not hold any more.
Key: A set of attributes that uniquely and minimally
identifies a tuple of a relation.
Superkey: a set of attributes uniquely (but not minimally!)
identifying a tuple of a relation.
Candidate key: If there is more than one key in a relation, the
keys are called candidate keys.
Primary key: One candidate key is chosen to be the primary
key.
Prime attribute: An attribute A that is part of a candidate key
X (A ⊆ X)
Inference rules…
TDDC94—Normalisation
TDDC94—Normalisation
7
8
1NF – rather a definition
The relation must not have non-atomic values.
Normal Forms
1NF, 2NF, 3NF, BCNF, 4NF, 5NF
Minimise redundancy
Minimise update anomalies
Normal form ↑ = redundancy ↓ and relations become
smaller
Rnon1NF
ID
Name
LivesIn
100
Pettersson
{Stockholm, Linköping}
101
Andersson
{Linköping}
102
Svensson
{Ystad, Hjo, Berlin}
R11NF
Normalisation
TDDC94—Normalisation
TDDC94—Normalisation
Author: Almut Herzog
Linköping
Pettersson
102
Ystad
101
Andersson
102
Hjo
102
Svensson
102
Berlin
10
3NF – 2NF + no non-key attribute shall be functionally dependent on any other nonkey attribute (= no transitive dependencies)
Rnon3NF
Work%
EmpName
ID
Name
Zip
City
Dev
50
Baker
100
Andersson
58214
Linköping
100
Support
50
Baker
101
Björk
10223
Stockholm
200
Dev
80
Miller
102
Carlsson
58214
Linköping
R22NF
R13NF
Dept
Work%
Normalisation
R23NF
ID
Name
Zip
Zip
City
100
Andersson
58214
58214
Linköping
10223
Stockholm
EmpID
EmpName
100
Dev
50
100
Baker
100
Support
50
101
Björk
10223
200
Miller
200
Dev
80
102
Carlsson
58214
11
Linköping
100
Dept
EmpID
Stockholm
100
101
100
R12NF
LivesIn
100
Name
EmpID
Normalisation
ID
ID
TDDC94—Normalisation
9
2NF: no non-key attribut shall be functionally dependent on parts
of any candidate key
Rnon2NF
R21NF
TDDC94—Normalisation
12
2
TDDC94
Fall 2007
Boyce-Codd Normal Form
BCNF
Every determinant is a superkey.
“colloquial and historical definition, which does
not consider candidate keys”:
3NF + no attribute of the primary key is fully
functional dependent on a non-key attribute
Warning! With this definition you do not
always get it right, use the formal one
instead:
Every determinant is a superkey (in practice:
every determinant is a candidate key)
TDDC94—Normalisation
13
Properties of decomposition
Keep all attributes from the universal relation R.
Preserve the identified functional dependencies.
Lossless join
It must be possible to join the smaller tables to arrive at
composite information without spurious tuples.
TDDC94—Normalisation
15
PID
PersonName
PID, Country
NumVisitsInCountry
Country
Continent
Continent
ContinentSize
With these FFDs which are the keys in R (= can we find a
FD that contains all attributes?)
Use the inference rules and get started.
Example: Given R(A,B,C,D) and the functional dependencies
AB CD, C B
R is in 3NF but not in BCNF.
C is determinant but not superkey, i.e. it cannot be turned into a
key by taking away 0 or more attributes.
In reality, a relation that is in 3NF is most often also in BCNF
(but you cannot be sure ☺)
TDDC94—Normalisation
14
Normalisation
Given the universal relationen
R(PID, PersonName,
Country, Continent, ContinentSize, NumVisitsInCountry)
How does one elicit the functional dependencies?
Which are the keys in R?
TDDC94—Normalisation
Country
Continent
leads to
Country
16
Continent
ContinentSize
Continent, ContinentSize (transitive rule)
PID, Country
PID, Country
PID, Country
Continent, ContinentSize (augmentation rule)
PersonName (augmentation rule)
NumVisitsInCountry
leads to
PID, Country
Continent, ContinentSize, PersonName,
NumVisitsInCountry (additive rule)
Thus Person, Country is key for R.
TDDC94—Normalisation
Author: Almut Herzog
17
TDDC94—Normalisation
18
3
TDDC94
Fall 2007
Is
R (PID, Country, Continent, ContinentSize, PersonName, NumVisitsInCountry)
Are R1, R21, R22 in 3NF?
in 2NF?
No, because PersonName is only ffd on PID, thus we split
R1(PID, PersonName)
R2(PID, Country, Continent, ContinentSize, NumVisitsInCountry)
R22(PID, Country, NumVisitsInCountry),
R1(PID, PersonName):
Yes, because there is only one non-key attribute.
Is R2 in 2NF?
No, because Continent and ContinentSize are only ffd on Country, thus
R1(PID, PersonName)
R21(Country, Continent, ContinentSize)
R22(PID, Country, NumVisitsInCountry)
R1, R21, R22 are now in 2NF
R21(Country, Continent, ContinentSize):
No, because Continent determines ContinentSize, thus
R211(Country, Continent)
R212(Kontinent, ContinentSize)
R1, R22, R211, R212 are now in 3NF
2NF:
2NF: non
non non-key
non-key attribute
attribute shall
shall be
be
functionally
functionally dependent
dependent on
on parts
parts of
of
any
candidate
key.
any candidate key.
TDDC94—Normalisation
3NF:
3NF: 2NF
2NF ++ no
no non-key
non-key attribute
attribute
shall
shall be
be fd
fd on
on any
any other
other non-key
non-key
attribute
attribute (=
(= no
no transitive
transitive
dependencies)
dependencies)
TDDC94—Normalisation
19
Är R1, R22, R211, R212 i BCNF?
20
4NF and 5NF
BCNF:
BCNF: Every
Every determinant
determinant is
is aa
superkey.
superkey.
R22(PID, Country, NumVisitsInCountry),
R1(PID, PersonName):
R211(Country, Continent)
R212(Continent, ContinentSize)
YES
Don’t get fooled by candidate keys as in
Rx(PID1, PID2, PName)
Relevant if primary key consists of more than two
attributes.
Relevant if confronted with an existing database.
Can the universal relation R be recreated from R1, R22, R211 and
R212 without spurious tuples?
TDDC94—Normalisation
21
Summary and Open Issues
Good design: informal and formal properties of relations
Functional dependencies, and thus normal forms, are about
attribute semantics (= real-world knowledge), normalisation can
only be automated if FDs are given.
TDDC94—Normalisation
22
1. Which normal form?
The database contains data about cars, their owners and when the car
was registered for that owner.
Are high normal forms good design when it comes to
performance?
No, denormalisation may be required.
TDDC94—Normalisation
Author: Almut Herzog
23
PersonID
FirstName
LastName
LicensePlate
RegistrationDate
Birthdate
1000
Ann
Anderson
ABC123
2004-10-12
1981-04-04
1010
Ben
Benson
DEF234
2003-02-12
1945-12-12
1000
Ann
Anderson
ABC123
2001-04-23
1981-04-04
TDDC94—Normalisation
24
4
TDDC94
Fall 2007
2. Which normal form?
3. Which normal form?
A database contains data about registered cars and their make
(type).
Date
Flight
Aircraft
Pilot
13-Jan-2005
TGU7
Airbus 300
John
14-Jan-2005
TGU7
Boeing 747
Daniel
12-Jan-2005
SKX6
Airbus 300
John
LicensePlate
Type
Maker
ABC123
C70
Volvo
DEF234
S40
Volvo
13-Jan-2005
SKX6
Boeing 747
Ann
FGH345
Corolla
Toyota
14-Jan-2005
SKX6
Fokker 50
Mary
TDDC94—Normalisation
Author: Almut Herzog
The database contains data about flights, aircrafts and their pilots.
Flights use different aircrafts depending on the number of booked
passengers.
25
TDDC94—Normalisation
26
5
Download