Normalisation

advertisement
MSc IT
UFIE8K-15-M Data Management
Prakash Chatterjee
http://www.cems.uwe.ac.uk/~p-chatterjee/
Department of Computer Science and Creative Technologies
University of the West of England
Lecture 6 : Normalisation
Normalisation (1)
 What is Normalisation?
Informally, normalisation can be thought of as a process defined within the theory of relational database to break up larger
relations into many small ones using a set of rules. Normalisation resolve problems with data anomalies and redundancy.
It is essentially a two-step process to:
1. put the data into tabular form (by removing repeating groups); and
2. to remove duplicated records to separate tables.
As we work through the normalisation process, we will make use of data that relates to the Bus Depots’ Database – a
description and E-R model of which was handed out in last weeks session and is also available from the resource area.
UFIE8K-15-M Data
Management 2008
2
Normalisation (2)
 Un-normalised data (1)
Well-normalised databases have a design that reflects the true dependencies
between entities, allowing the data to be updated quickly with little risk of
introducing inconsistencies. Before discussing how to design a well-normalised
database using Codd's normalisation techniques, we first consider a poor
database design.
Consider for example a relation 'bus' which includes bus registration number,
model, type number, type description, depot name (note that names have
changed slightly from the study for the purposes of this example):
registration no
model
type number
type description
depot
Al 23ABC
Routemaster
1
doubledecker
Holloway
D678FGH
Volvo 8700
2
metrobus
Holloway
H2591JK
Daf SB220
3
midibus
Hornsey
P200IJK
Mercedes 709D
2
metrobus
Hornsey
P300RTY
Mercedes Citaro
4
bendy-bus
Hornsey
R678FDS
Daf SB220
1
doubledecker
W653TJH
Routemaster
1
doubledecker
UFIE8K-15-M Data
Management 2008
3
Normalisation (3)
 Un-normalised data (2)
There are several problems with the previous relation:
 Redundancy - the 'type description' is repeated for each 'type
number' in the relation. The 'model' is also repeated for a particular
'type description', for example a Routemaster is always a
doubledecker bus
 Update anomalies - as a consequence of the redundancy, we
could update the 'type description' in one tuple, while leaving it fixed
in another
 Deletion anomalies - if we should delete all the buses of a
particular type, we might lose all the information about that type
 Insertion anomalies - the inverse to deletion anomalies is we
cannot record a new type in our table unless there exists a bus of
that type - for example if there is the type 'open top' we cannot
store this in our database. To get around this we might put null
values in the type number and description components of a tuple
for that bus, but when we enter an item for that supplier, will we
remember to delete the tuple with nulls?
UFIE8K-15-M Data
Management 2008
4
Normalisation (4)
 Functional dependencies (1)
Determinants
A formal definition for the term functional dependence is:
Given a relation which has attributes (x, y, ...), we say that an
attribute y is functionally dependent on another attribute x, if (and only
if) each x value has associated with it precisely one y value (at any one
time).
For example, examine the following relation:
Cleaner no.
(cno)
Cleaner name
(cname)
Cleaner salary
(csalary)
Depot no.
(dno)
110
John
2550
101
111
Jean
2500
101
112
Betty
2400
102
113
Vince
2800
102
114
Jay
3000
102
UFIE8K-15-M Data
Management 2008
5
Normalisation (5)
In the previous diagram, attributes cname, csalary and dno are each
functionally dependent on attribute cno - given a particular cno value,
there exists precisely one corresponding value for each of the cname,
csalary and dno.
In general then, the same x-values may appear in many different tuples
of the relation; if y is functionally dependent on x, then every one of
these tuples must contain the same yvalue.
Going back to the cleaner example, we can represent these functional
dependencies diagrammatically as:
UFIE8K-15-M Data
Management 2008
6
Normalisation (6)
The previous figure is an example of a determinacy diagram. The
arrow line can be read as 'depends on' (reading from left to right). So
we say, for example, 'cno depends on cname'. We can also 'read' the
diagram from right to left. This time the arrowed line is read as
'functionally dependent on'. So we say, for example 'cname is
functionally dependent on cno'.
The attribute or group of attributes on the left-hand side are called the
determinant. The determinant of a value is not necessarily the primary
key. In the example, cno is a determinant of cname because knowing
the cleaner's number we can determine the cleaner's name.
Recognising the functional dependencies is an essential part of
understanding the meaning or semantics of the data. The fact that
cname, csalary and dno are functionally dependent on cno means that
each cleaner has one name, has one salary and works at precisely one
depot.
UFIE8K-15-M Data
Management 2008
7
Normalisation (7)
 Functional dependencies (2)
Composite attributes
The notion of functional dependence can be extended to cover the case
where the determinant (particularly the primary key) is composite, i.e. it
consists of more that one attribute.
Full functional dependence
An attribute y is defined to be fully fully functionally dependent on attribute x
if it is functionally dependent on x and not functionally dependent on any
subset of the attributes of x where it is a composite attribute.
Partial dependencies
The opposite of full functional dependence is partial dependence. Where
we have data values that depend on only a part of the primary key, then we
have a partial dependency.
Transitive dependencies
This occurs when the value of an attribute is not determined directly from
the primary key, but through the value of another attribute and this attribute
in turn is determined by the primary key.
UFIE8K-15-M Data
Management 2008
8
Normalisation (8)
 The normal forms
A number of normal forms have been proposed, but the first five normal
forms have been widely accepted.
The normal forms progress from first normal form, to second, and so
on. Data in second normal form implies that it is also in first normal
form - i.e. each level of normalisation implies that the previous level has
been met.
Other normal forms [have been proposed] such as Boyce-Codd
(BCNF) which is an extension of 3NF, [lying between 3NF and 4NF.]
UFIE8K-15-M Data
Management 2008
9
Normalisation (9)
Correspondence between the normal forms:
UFIE8K-15-M Data
Management 2008
10
Normalisation (10)
 Normal form example
Consider the following example forms that record information about
cleaners at the Middlesex Depot and the buses they look after. Note
that three extra attributes, roster number, roster date and job complete
have been added to the original model. The cleaner ticks against the
appropriate job after he/she has completed the cleaning of a particular
bus.
UFIE8K-15-M Data
Management 2008
11
Normalisation (11)
The un-normalised relation:
UFIE8K-15-M Data
Management 2008
12
Normalisation (12)
 First normal form (1 NF)
The next step in the normalisation process is to remove the repeating groups
from the unnormalised relation. A relation is in 1 NF if - and only if - all domains
contain only atomic or single values, i.e. all repeating groups of data are
removed.
A repeating group is a group of attributes that occurs a number of times for each
record in the relation. So for example, in the Roster relation, each roster record
has a group of buses (roster record 104 has 6 buses).
Selecting a suitable key for the table
In order to convert an un-normalised relation into first normal form, we must
identify the key attribute(s) involved. From the un-normalised relation we can
see that each roster has a roster no, each cleaner a cno, each depot a dno,
each bus a reg-no and each type a tno. In order to convert an un-normalised
relation into normal form, we also have to identify a key for the whole relation.
Bearing this definition in mind, on examination the primary key of the relation is
roster-no, reg no.
We now draw the determinacy diagram for the roster relation, showing the
attributes which are dependent on the primary key:
UFIE8K-15-M Data
Management 2008
13
Normalisation (13)
Determinacy diagram for the first normal form:
UFIE8K-15-M Data
Management 2008
14
Normalisation (14)
Roster relation in first normal form:
UFIE8K-15-M Data
Management 2008
15
Normalisation (15)
 The problems with 1 NF are:

Redundancy - e.g. roster date, cleaner name etc repeated

Insertion anomaly - a cleaner cannot be inserted into the
database unless he/she has a bus to clean

Deletion anomaly - deleting a tuple might lose information from the
database. For example, if a cleaner cleaning a particular bus leaves
the company, then we lose information for the buses he cleaned

Update anomaly - e.g. a change to the cleaner name means it
must change in all tuples which include that cleaner name.
UFIE8K-15-M Data
Management 2008
16
Normalisation (16)
 Second normal form (2NF)
We now describe the second step in the normalisation process using
the relation above which is in first normal form.
Firstly we determine the functional dependencies on the identifying
attributes (i.e. the primary key (roster_no, reg_no) and its parts.
If the key is composite, the other attributes must be functionally
dependent on the whole of the key. In other words we are looking for
partial functional dependencies. In the example, roster date is
functionally dependent on the partial key roster_no - there is only one
roster_date for a particular roster_no. Also cno, cname, dno, dname etc
are all functionally dependent on the partial key reg_no. The attribute
'status', however, is the only attribute fully functionally dependent on the
whole of the primary key.
UFIE8K-15-M Data
Management 2008
17
Normalisation (17)
Determinacy diagram for the second normal form:
UFIE8K-15-M Data
Management 2008
18
Normalisation (18)
Roster in second normal form:
UFIE8K-15-M Data
Management 2008
19
Normalisation (19)
2NF has less redundancy than 1NF as we have removed
repeating groups. However there are still a number of
problems:




Redundancy - for example, in the Bus relation, cleaner name is
repeated for each cleaner number
Insertion anomaly - a cleaner cannot be inserted into the
database unless he/she is responsible for at least one bus
Deletion anomaly - deleting a tuple might lose information from the
database. For example, if we delete a cleaner who is only
responsible for that one bus, then we lose information about the
cleaner
Update anomaly - e.g. a change to the cleaner name means
changes must be made in all tuples which include that cleaner
name.
UFIE8K-15-M Data
Management 2008
20
Normalisation (20)
 Third normal form (3NF)
A 3NF relation is in 2NF but also it must satisfy the non-transitive
dependency rule, which states that every non-key attribute must be
non-transitively dependent on the primary key. Another way of saying
this is that a relation is in 3NF if all its non-key attributes are directly
dependent on the primary key. Transitive dependencies are resolved by
creating new relations for each entity.
There are three transitive dependencies in the Bus relation above as is
illustrated by vertical lines in the 2NF determinacy diagram. For
example: cno is functionally dependent on reg_no; cname is
functionally dependent on reg no. Additionally, cname is functionally
dependent cno.
We therefore have the transitive dependency:
reg no determines cno and cno determines cname then
reg_no determines cname
Two other transitive dependencies are identified involving tname and
dname. The determinacy diagrams for third normal form are given
below:
UFIE8K-15-M Data
Management 2008
21
Normalisation (21)
Determinacy diagram for the third normal form:
UFIE8K-15-M Data
Management 2008
22
Normalisation (22)
Roster in third normal form:
UFIE8K-15-M Data
Management 2008
23
Normalisation (23)
Normal
form
What is it?
What does
process do?
How is it achieved?
1 NF
Relation in 1 NF if
- it contains scalar (atomic)
values only
- Removes
repeating groups
- Make a separate relation
for each group of related
attributes
- Give each new relation a
primary key
2 NF
Relation in 2NF if
- in I NF
- all non-key attributes are
dependent on the whole
of the primary key and
not part of it
- Removes
redundant data
- If an attribute depends on
only part of a multi
value key, remove it to a
separate table
3 NF
Relation in 3NF if
- in 2NF
- non-key attributes are
dependent on primary
key and independent of
each other
- i.e. non-key attribute
must be non-transitively
dependent on the primary
key
- a non-key attribute is
changed, that change
should not affect the
others
- Removes
attributes not
dependent on
the key thereby
further reducing
redundancy
- Make a separate relation
for attributes transitively
dependent on the
primary key
- Give each new relation a
primary key
- Original relation will
include a foreign key to
link to new relation
UFIE8K-15-M Data
Management 2008
24
Bibliography / Readings / Home based activities
Bibliography
-
An Introduction to Database Systems (8th ed.), C J Date, Addison Wesley 2004
Database Management Systems, P Ward & G Defoulas, Thomson 2006
Database Systems Concepts (4th ed.), A Silberschatz, H F Korth & S Sudarshan,
McGraw-Hill 2002
Readings
-
Introduction to SQL’ McGraw-Hill/Osbourne (handout)
Home based activities
-
Ensure you download xampp and install on home PC or laptop (if you have a
slow home internet connection – download to data key or CD here at UWE)
Copy the SQL Workbook onto your data key or CD.
Import the tables from the SQL Workbook into your home MySQL DB. Begin
working through some of the query examples in the workbook using
PHPMyAdmin.
UFIE8K-15-M Data
Management 2008
25
Download