Normalization - The University of North Carolina at Pembroke

advertisement
Normalization
CSC 3800
Fall 2008
Database Normalization

Database normalization is the process of
removing redundant data from your tables to
improve storage efficiency, data integrity, and
scalability.
 In the relational model, methods exist for
quantifying how efficient a database is. These
classifications are called normal forms (or
NF), and there are algorithms for converting a
given database between them.
 Normalization generally involves splitting
existing tables into multiple ones, which must
be re-joined or linked each time a query is
issued.
History

Edgar F. Codd first proposed the process of
normalization and what came to be known as
the 1st normal form in his paper A Relational
Model of Data for Large Shared Data Banks
Codd stated:
“There is, in fact, a very simple elimination
procedure which we shall call normalization.
Through decomposition nonsimple domains are
replaced by ‘domains whose elements are
atomic (nondecomposable) values.’”
Normal Form

Edgar F. Codd originally established
three normal forms: 1NF, 2NF and 3NF.
There are now others that are generally
accepted, but 3NF is widely considered
to be sufficient for most applications.
Most tables when reaching 3NF are also
in BCNF (Boyce-Codd Normal Form).
Why Normalize?

Flexibility


Structure supports many ways to look at
the data
Data Integrity

“Modification Anomalies”
Deletion
 Insertion
 Update


Efficiency

Eliminate redundant data and save space
Normalization Defined
“In relational database design, the process
of organizing data to minimize duplication.
 Normalization usually involves dividing a
database into two or more tables and
defining relationships between the tables.
 The objective is to isolate data so that
additions, deletions, and modifications of a
field can be made in just one table and then
propagated through the rest of the database
via the defined relationships.” - Webopedia,

http://webopedia.internet.com/TERM/n/normalization.html
The Normal Forms
A series of logical steps to take to
normalize data tables
 First Normal Form
 Second
 Third
 Boyce Codd
 There’s more, but beyond scope of this
class

Un-normalized Table
OrderDate
11/30/1998
Customer
Joe Smith
Items
Hammer, Saw, Nails
or
OrderDate
11/30/1998
Customer
Joe Smith
Item1
Hammer
Item2
Saw
Item3
Nails
First Normal Form

Remove horizontal redundancies
No two columns hold the same information
 No single column holds more than a single
item


Each row must be unique


Use a primary key
Benefits
Easier to query/sort the data
 More scalable
 Each row can be identified for updating

First Normal Form

All columns (fields) must be atomic

Means : no repeating items in columns
OrderDate
11/30/1998
OrderDate
11/30/1998
Customer
Joe Smith
Customer
Joe Smith
Items
Hammer, Saw, Nails
Item1
Hammer
Item2
Saw
Item3
Nails
Solution: make a separate table for each set of
attributes with a primary key (parser, append query)
Customers
CustomerID
Name
Orders
OrderID
Item
CustomerID
OrderDate
First Normal Form Tables
Customers
CustomerID
1
Name
Joe Smith
Orders
OrderID
1
1
1
Item
Hammer
Saw
Nails
CustomerID
1
1
1
OrderDate
11/30/1998
11/30/1998
11/30/1998
Second Normal Form (2NF)

In 1NF and every non-key column is fully
dependent on the (entire) primary key

Means : Do(es) the key field(s) imply the rest of
the fields? Do we need to know both OrderID and
Item to know the Customer and Date? Clue:
repeating fields
Orders
OrderID
1
1
1
Item
Hammer
Saw
Nails
CustomerID
1
1
1
OrderDate
11/30/1998
11/30/1998
11/30/1998
Second Normal Form
Table must be in First Normal Form
 Remove vertical redundancy



Composite keys


The same value should not repeat across
rows
All columns in a row must refer to BOTH
parts of the key
Benefits
Increased storage efficiency
 Less data repetition

Second Normal Form (2NF)

In 1NF and every non-key column is fully
dependent on the (entire) primary key

Means : Do(es) the key field(s) imply the rest of the fields? Do we
need to know both OrderID and Item to know the Customer and
Date? Clue: repeating fields
OrderID
1
1
1
Item
Hammer
Saw
Nails
CustomerID
1
1
1
OrderDate
11/30/1998
11/30/1998
11/30/1998
Solution: Remove to a separate table (Make Table)
Orders
OrderID
CustomerID
OrderDate
OrderDetails
OrderID
Item
Second Normal Form Tables
Customers
CustomerID
1
Name
Joe Smith
Orders
OrderID
1
CustomerID OrderDate
1
11/30/1998
OrderDetails
OrderID
1
1
1
Item
Hammer
Saw
Nails
Third Normal Form (3NF)

In 2NF and every non-key column is mutually
independent

means : Calculations
Item
Hammer
Saw
Nails
Quantity
2
5
8
Price
$10
$40
$1
Total
$20
$200
$8
Third Normal Form

Table must be in Second Normal Form

If your table is 2NF, there is a good chance
it is 3NF
All columns must relate directly to the
primary key
 Benefits


No extraneous data
Third Normal Form (3NF)

In 2NF and every non-key column is mutually
independent

means : Calculations
Item
Hammer
Saw
Nails
Quantity
2
5
8
Price
$10
$40
$1
Total
$20
$200
$8
•Solution: Put calculations in queries and forms
OrderDetails
OrderID
Item
Quantity
Price
Put expression in text control or in query:
=Quantity * Price
Third Normal Form (3NF)
OrderDetails
OrderID
1
1
1
Item
Hammer
Saw
Nails
Quantity
2
5
8
Price
$10
$40
$1
Put expression in text control or in query:
=Quantity * Price
SELECT OrderID, Item, Quantity, Price, Price*Quantity FROM OrderDetails
Kumar Madurai: http://www.mgt.buffalo.edu/courses/mgs/404/mfc/lecture4.ppt
Boyce-Codd Form (3NF) - Examples
 A more restricted version of 3NF (known as
Boyce-Codd Normal Form) requires that the
determinant of every functional dependency in
a relation be a key - for every FD: X => Y, X is
a key
 Consider the following relation:
STU-MAJ-ADV (Student-Id, Major, Advisor)
Advisor => Major, but Advisor is not a key
 Boyce-Codd Normal Form for above:
STU-ADV (Student-Id, Advisor)
ADV-MAJ (Advisor, Major)
2/16/98
MGS 404
10
Primary Key
 Unique
Identifier for every row in
the table
Integers vice Text to save memory,
increase speed
 Can be “composite”
 Surrogate is best bet!
 Meaningless, numeric column acting as
primary key in lieu of something like
SSN or phone number - (both can be
reissued!)

Relationships

One to many to enforce “Referential Integrity”
Two “foreign”
keys make a
composite primary
key and “relate”
many to many
tables
A look up table - it
doesn’t reference
any others
Table Prefixes Aid Development


First, we’ll get replace text PK with number
The Items table is a “look up” with tlkp prefix


tlkp “lookup” table (no “foreign keys”)
OrderDetails is renamed “trelOrderItem” a
“relational” table

trel “relational” (or junction or linking)

OrderDetails
OrderID
Item
two foreign keys make a primary
tlkpItems
ItemID
ItemName
trelOrderItem
OrderID
ItemID
tblOrders
OrderID
CustomerID
OrderDate
Referential Integrity

Every piece of “foreign” key data has a
primary key on the one site of the relationship



No “orphan” records. Every child has a parent
Can’t delete records from primary table if in related table
Benefits - Data Integrity and Propagation




If update fields in main table, reflected in all queries
Can’t add a record in related table without adding it to main
Cascade Delete: If delete record from primary table, all
children deleted - use with care! Better idea to “archive”
Cascade Update: If change the primary key field, will change
foreign key
When Not to Normalize

Want to keep tables simple so user can make
their own queries


Archiving Records



Avoid processing multiple tables
If no need to perform complex queries or “resurrect”
Flatten and store in one or more tables
Testing shows Normalization has poorer
performance


“Sounds Like” field example
Can also try temp tables produced from Make Table queries
Real World - School Data
Student
Last
Smith
Mills
Jones
Student
First
Renee
Lucy
Brendan
Street Address
5551 Private Hill
4902 Acme Ct
5304 Gains Street
Parent 1
Ann Jones
Barbara Mills
Jennifer Jones
Parent 2
Theodore Smith
Steve Mills
Stephen Jones
Previous Current
Teacher Teacher
Hamil
Burke
Hamil
Burke
Hamil
Burke
….
City
State
Annandale
Annandale
Fairfax
Postal Code
Virginia 22003Virginia 22003Virginia 22032-
Home Phone
(703) 323-0893
(703) 764-5829
(703) 978-1083
First Year Last Year
Age
Program Enrolled Attended Birthday inSept
PF
/0
0
6/25/93 5
PF
96/97
0
8/14/93 5
PH
96/97
0
6/13/94 4
Map Coord
22 A-3
21 F-3
21 A-4
Notes
….
One Possible Design
Examples
1. Eliminate Repeating Groups


In the original member list, each member name is followed by any databases that
the member has experience with. Some might know many, and others might not
know any. To answer the question, "Who knows DB2?" we need to perform an
awkward scan of the list looking for references to DB2. This is inefficient and an
extremely untidy way to store information.
Moving the known databases into a seperate table helps a lot. Separating the
repeating groups of databases from the member information results in first
normal form. The MemberID in the database table matches the primary key in
the member table, providing a foreign key for relating the two tables with a join
operation. Now we can answer the question by looking in the database table for
"DB2" and getting the list of members.
1NF Tables
Original Table
2. Eliminate Redundant Data





In the Database Table, the primary key is made up of the MemberID and the DatabaseID. This makes sense
for other attributes like "Where Learned" and "Skill Level" attributes, since they will be different for every
member/database combination. But the database name depends only on the DatabaseID. The same
database name will appear redundantly every time its associated ID appears in the Database Table.
Suppose you want to reclassify a database - give it a different DatabaseID. The change has to be made for
every member that lists that database! If you miss some, you'll have several members with the same
database under different IDs. This is an update anomaly.
Or suppose the last member listing a particular database leaves the group. His records will be removed from
the system, and the database will not be stored anywhere! This is a delete anomaly. To avoid these problems,
we need second normal form.
To achieve this, separate the attributes depending on both parts of the key from those depending only on the
DatabaseID. This results in two tables: "Database" which gives the name for each DatabaseID, and
"MemberDatabase" which lists the databases for each member.
Now we can reclassify a database in a single operation: look up the DatabaseID in the "Database" table and
change its name. The result will instantly be available throughout the application.
3. Eliminate Columns Not
Dependent On Key

The Member table satisfies first normal form - it contains no repeating
groups. It satisfies second normal form - since it doesn't have a
multivalued key. But the key is MemberID, and the company name and
location describe only a company, not a member. To achieve third normal
form, they must be moved into a separate table. Since they describe a
company, CompanyCode becomes the key of the new "Company" table.
 The motivation for this is the same for second normal form: we want to
avoid update and delete anomalies. For example, suppose no members
from the IBM were currently stored in the database. With the previous
design, there would be no record of its existence, even though 20 past
members were from IBM!
BCNF. Boyce-Codd Normal Form


Boyce-Codd Normal Form states mathematically that:
A relation R is said to be in BCNF if whenever X -> A
holds in R, and A is not in X, then X is a candidate key
for R.
BCNF covers very specific situations where 3NF
misses inter-dependencies between non-key (but
candidate key) attributes. Typically, any relation that is
in 3NF is also in BCNF. However, a 3NF relation won't
be in BCNF if (a) there are multiple candidate keys, (b)
the keys are composed of multiple attributes, and (c)
there are common attributes between the keys.
Basically, a humorous way to remember BCNF is that
all functional dependencies are:
"The key, the whole key, and nothing but the key, so
help me Codd."
Example 2
Example 2
The First Normal Form
For a table to be in first normal form, data must be broken
up into the smallest units possible. For example, the
following table is not in first normal form.
Name
Address
Phone
Sally Singer
123 Broadway New
York, NY, 11234
(111) 222-3345
Jason Jumper
456 Jolly Jumper St.
Trenton NJ, 11547
(222) 334-5566
Example 2
To conform to first normal form, this table would
require additional fields. The name field should be
divided into first and last name and the address
should be divided by street, city state, and zip like
this.
ID
First
Last
Street
City
State
Zip
Phone
564
Sally
Singer
123 Broadway
New York
NY
11234
(111) 222-3345
565
Jason
Jumper
456 Jolly Jumper St.
Trenton
NJ
11547
(222) 334-5566
Example 2
In addition to breaking data up into the smallest
meaningful values, tables in first normal form
should not contain repetitions groups of fields
such as in the following table.
Rep ID
Representative
Client 1
Time 1
Client 2
Time 2
Client 3
Time 3
TS-89
Gilroy Gladstone
US Corp.
14 hrs
Taggarts
26 hrs
Kilroy Inc.
9 hrs
RK-56
Mary Mayhem
Italiana
67 hrs
Linkers
2 hrs
Example 2
The problem here is that each representative can have multiple clients
not all will have three. Some may have less as is the case in the
second record, tying up storage space in your database that is not
being used, and some may have more, in which case there are not
enough fields. The solution to this is to add a record for each new
piece of information.
Rep ID
Rep First
Name
Rep Last
Name
Client
Time With
Client
TS-89
Gilroy
Gladstone
US Corp
14 hrs
TS-89
Gilroy
Gladstone
Taggarts
26 hrs
TS-89
Gilroy
Gladstone
Kilroy Inc.
9 hrs
RK-56
Mary
Mayhem
Italiana
67 hrs
RK-56
Mary
Mayhem
Linkers
2 hrs
Notice the splitting of the first and last name fields again.
Example 2
This table is now in first normal form. Note that by avoiding repeating groups of
fields, we have created a new problem in that there are identical values in the
primary key field, violating the rules of the primary key. In order to remedy this,
we need to have some other way of identifying each record. This can be done
with the creation of a new key called client ID.
Rep ID*
Rep First
Name
Rep Last
Name
Client
ID*
Client
Time With
Client
TS-89
Gilroy
Gladstone
978
US Corp
14 hrs
TS-89
Gilroy
Gladstone
665
Taggarts
26 hrs
TS-89
Gilroy
Gladstone
782
Kilroy Inc. 9 hrs
RK-56
Mary
Mayhem
221
Italiana
67 hrs
RK-56
Mary
Mayhem
982
Linkers
2 hrs
This new field can now be used in conjunction with the Rep ID field to create
a multiple field primary key. This will prevent confusion if ever more than
one Representative were to serve a single client.
Example 2
Second Normal Form
The second normal form applies only to tables with
multiple field primary keys. Take the following table
for example.
Rep ID*
Rep First
Name
Rep Last
Name
Client
ID*
Client
Time With
Client
TS-89
Gilroy
Gladstone
978
US Corp
14 hrs
TS-89
Gilroy
Gladstone
665
Taggarts
26 hrs
TS-89
Gilroy
Gladstone
782
Kilroy Inc.
9 hrs
RK-56
Mary
Mayhem
221
Italiana
67 hrs
RK-56
Mary
Mayhem
982
Linkers
2 hrs
RK-56
Mary
Mayhem
665
Taggarts
4 hrs
This table is already in first normal form. It has a primary key consisting of
Rep ID and Client ID since neither alone can be considered a unique value.
Example 2

The second normal form states that each field in a multiple field
primary key table must be directly related to the entire primary
key. Or in other words, each non-key field should be a fact about
all the fields in the primary key. Only fields that are absolutely
necessary should show up in our table, all other fields should
reside in different tables. In order to find out which fields are
necessary we should ask a few questions of our database. In our
preceding example, I should ask the question "What information
is this table meant to store?" Currently, the answer is not obvious.
It may be meant to store information about individual clients, or it
could be holding data for employees time cards. As a further
example, if my database is going to contain records of employees
I may want a table of demographics and a table for payroll. The
demographics will have all the employees personal information
and will assign them an ID number. I should not have to enter the
data twice, the payroll table on the other hand should refer to
each employee only by their ID number. I can then link the two
tables by a relationship and will then have access to all the
necessary data.
Example 2
In the table of the preceding example we are devoting three field to
the identification of the employee and two to the identification of the
client. I could identify them with only one field each -- the primary
key. I can then take out the extraneous fields and put them in their
own table. For example, my database would then look like the
following.
Rep ID* Client ID* Time With Client
TS-89
978
14 hrs
TS-89
665
26 hrs
TS-89
782
9 hrs
RK-56
221
67 hrs
RK-56
982
2 hrs
RK-56
665
4 hrs
The above table contains time card information.
Example 2
Rep ID*
First Name
Last Name
TS-89
Gilroy
Gladstone
RK-56
Mary
Mayhem
The above table contains Employee Information.
Example 2
Client ID*
Client Name
978
US Corp
665
Taggarts
782
Kilroy Inc.
221
Italiana
982
Linkers
The above table contains Client Information
These tables are now in normal form. By splitting off the
unnecessary information and putting it in its own tables,
we have eliminated redundancy and put our first table in
second normal form. These tables are now ready to be
linked through relationship to each other.
Example 2
Third Normal Form

Third normal form is the same as second
normal form except that it only refers to tables
that have a single field as their primary key. In
other words, each non-key field in the table
should be a fact about the primary key. Either
of the preceding two tables act as an example
of third normal form since all the fields in each
table are necessary to describe the primary
key.
 Once all the tables in a database have been
taken through the third normal form, we can
begin to set up relationships.
Example 3
Repeating Groups And Normalization To
First Normal Form (1nf)
SALES-INFORMATION
Invoice#
Date
Customer#
1001
1002
1003
7/1/92
7/1/92
7/1/92
456
329
897
Salesperso
n
John
Mary
Al
Region
Item#
Description
Price
Quantity
West
East
West
121
348
540
Widget
Gear
Bolt
$2.25
$3.70
$0.40
45
10
5
INVOICE-ITEMS (1NF)
Invoice#
Item#
INVOICES
(2NF)
Invoice#
Date
Customer#
Salesperso
n
Region
Description
Price
Quantity
What Is The Problem With
Description/Price?
Insert
anomalies
Delete anomalies
Update anomalies
Decomposition Of A
First-normal-form (1nf) Table
INVOICE-ITEMS (1NF)
Invoice#
Item#
Description
INVOICE-ITEMS-QTY (2NF)
Invoice#
Item#
Price
Quantity
ITEMS (2NF)
Item#
Description
Price
Quantity
You can only have a 2nd Normal Form problem if there is a composite primary Key
Database Normalization
 Functional
dependency is key in
understanding the process of
normalization. Functional
dependency means that if there is
only one possible value of Y for
every value of X, then Y is
functionally dependent on X.
Database Normalization
 Think
of an invoice table. Two fields
would be invoice # and date. Which
field is functionally dependent on the
other?
INVOICE #
DATE
Date is functionally dependent on invoice number.
Dependencies
Functional Dependency is “good”. With
functional dependency the primary key
(Attribute A) determines the value of all
the other non-key attributes (Attributes
B,C,D,etc.)
 Transitive dependency is “bad”.
Transitive dependency exists if the
primary key (Attribute A) determines nonkey Attribute B, and Attribute B
determines non-key Attribute C.

Decomposition Of A
Second-normal-form (2nf) Table
SALES (2NF)
Invoice#
Date
Customer#
Salesperson Region
This is a transitive
dependency which must
be eliminated for 3NF
INVOICES (3NF)
Invoice#
Date
SALESPERSON-REGION (3NF)
Customer#
Salesperson
Salesperson
Region
Summary Of 3nf Relations For
Sales Database
SALESPERSON-REGION (3NF)
INVOICES (3NF)
Invoice#
Date
Customer#
Salesperson
Salesperson
Region
1001
1002
1003
7/1/92
7/1/92
7/1/92
456
329
897
John
Mary
Al
John
Mary
Al
West
East
West
INVOICE-ITEMS-QTY (3NF)
ITEMS (3NF)
Invoice#
Item#
Quantity
Item#
Description
Price
1001
1002
1003
121
348
540
45
10
5
121
348
540
Widget
Gear
Bolt
$2.25
$3.70
$0.40
END
Other Slides
Table 1
Title
Author1
Autho
r2
Database
System
Concepts
Abraham
Silberschatz
Operating
System
Concepts
Abraham
Silberschatz
ISBN
Subject
Page
s
Publisher
Henry F. 0072958863
Korth
MySQL,
Computers
1168
McGrawHill
Henry F. 0471694665
Korth
Computers
944
McGrawHill
Orders Table Problems

This table is not very efficient with storage.

This design does not protect data integrity.

Third, this table does not scale well.
First Normal Form
In our Table, we have two violations of
First Normal Form:
 First, we have more than one author field,
 Second, our subject field contains more
than one piece of information. With more
than one value in a single field, it would
be very difficult to search for all books on
a given subject.

First Normal Table

Table 2
Title
Author
ISBN
Subject
Pages
Publisher
Database
System
Concepts
Abraham
Silberschatz
0072958863
MySQL
1168
McGraw-Hill
Database
System
Concepts
Henry F.
Korth
0072958863
Computers
1168
McGraw-Hill
Operating
System
Concepts
Henry F.
Korth
0471694665
Computers
944
McGraw-Hill
Operating
System
Concepts
Abraham
Silberschatz
0471694665
Computers
944
McGraw-Hill
Additional Problems
We now have two rows for a single book.
Additionally, we would be violating the
Second Normal Form…
 A better solution to our problem would be
to separate the data into separate tablesan Author table and a Subject table to
store our information, removing that
information from the Book table:

Second Normal Tables
Subject Table
Author Table
Subject_ID
Subject
Author_ID
Last Name
1
MySQL
1
Silberschatz Abraham
2
Computers
2
Korth
Book Table
ISBN
Title
Pages
Publisher
0072958863
Database System
Concepts
1168
McGraw-Hill
0471694665
Operating System
Concepts
944
McGraw-Hill
First Name
Henry
Additional Problems
Each table has a primary key, used for
joining tables together when querying the
data. A primary key value must be unique
with in the table (no two books can have
the same ISBN number), and a primary
key is also an index, which speeds up data
retrieval based on the primary key.
 Now to define relationships between the
tables

Relationships
Book_Author Table
Book_Subject Table
ISBN
Author_ID
ISBN
Subject_ID
0072958863
1
0072958863
1
0072958863
2
0072958863
2
0471694665
1
0471694665
2
0471694665
2
Second Normal Form

As the First Normal Form deals with
redundancy of data across a horizontal row,
Second Normal Form (or 2NF) deals with
redundancy of data in vertical columns.
 As stated earlier, the normal forms are
progressive, so to achieve Second Normal
Form, the tables must already be in First
Normal Form.
 The Book Table will be used for the 2NF
example
2NF Table
Publisher Table
Publisher_ID
Publisher Name
1
McGraw-Hill
Book Table
ISBN
Title
Pages
Publisher_ID
0072958863
Database System
Concepts
1168
1
0471694665
Operating System
Concepts
944
1
2NF

Here we have a one-to-many relationship
between the book table and the publisher. A
book has only one publisher, and a publisher will
publish many books. When we have a one-tomany relationship, we place a foreign key in the
Book Table, pointing to the primary key of the
Publisher Table.
 The other requirement for Second Normal Form
is that you cannot have any data in a table with a
composite key that does not relate to all portions
of the composite key.
Third Normal Form
Third normal form (3NF) requires that
there are no functional dependencies of
non-key attributes on something other
than a candidate key.
 A table is in 3NF if all of the non-primary
key attributes are mutually independent
 There should not be transitive
dependencies

Boyce-Codd Normal Form

BCNF requires that the table is 3NF and
only determinants are the candidate keys
Download