Normalization of a database

advertisement
Database Requires Normalization
• Related data in a database must be organized in a
set of related tables following certain relational
rules
– How do we do this? What (fields) should be in each table?
• Poorly designed databases will lose data integrity
over time, become slow, and lose ability to support
queries
A well-designed table is the one that:
• minimizes redundant data
• represents a single subject (e.g., sample, River, Country)
• has a primary key
• does not have multi-part fields (‘123 Nice Ave, Atlanta, Ga’)
– Should not have different items under the same column
• Does not have duplicate fields (e.g., Analysis1; Analysis2, …)
– Same thing in different columns
• Does not have fields that depend on fields (subkeys) other
than the PK
Normalization
• Is the gradual and sequential process of efficiently organizing
unstructured data in a database that follows the rules listed in
the previous slide
• Normalization commonly involves the following three schemes
(in order):
• First, Second, and Third Normal Form, or:
1NF, 2NF, 3NF
– This is commonly done during early stages of modeling on
UML class diagrams
– The next slide shows a database with one un-normalized
table with many problems!
Example of an un-normalized Student table
Student
Exam
Format
Grade
Instructor
TA
Date
Dobb, Min
Structural
Geology
Essay
A-
Babaie
Gabbri,
Boris
2012-08-02
Dobrin,
Garn
GIS
Lab Exercise
B+
Dai
Mafique,
Marie
2014-02-15
Petri, Tuff
Remote
Sensing
Essay
B-
Kiage
Karsto,
Travert
2010-09-18
Lac, Du
GIS
Lab Exercise
B
Dai
Mafique,
Marie
2014-02-15
Dobb, Min
GIS
Lab Exercise
A-
Dai
Mafique,
Marie
2014-02-15
Lac, Du
Petrology
Multiple choice
B+
Hidalgo
Phenos,
Meg
2014-05-12
Petri, Tuff
Petrology
Multiple Choice
B
Hidalgo
Phenos,
Meg
2014-05-12
Mixed
names
Repeated
types
Repeated types
repeated
Mixed & < -- problems
repeated
Goal of Normalization
• Eliminate redundant (duplication of) data (which make database large,
inefficient, and slow) which in turn prevents data manipulation (insert,
delete, update) anomalies and loss of data integrity
– If there are duplicate data in different rows, changes that happen in
different places may not be the same (can make mistake entering data)
– We want the change to happen in one place (one row) and then
propagate throughout
– Redundancy reduces flexibility
– Redundancy creates insert, delete, and update anomalies.
– Cannot change the name of Mafique Marie to Basique Marie in one row
if she marries.
– Cannot insert a new instructor since we do not have a table for
instructors
– Cannot delete a row without deleting other information
• So, we have to create other tables and assign PK for each
one, and make sure that each information shows up once in
the database
• The process eliminates redundant data (storing the same
data in more than one table) and ensures data
dependencies are logical (only storing related data in a
table; not shoes and frogs)
– Normalization reduces the amount of space a database
consumes and ensures data is logically stored
Alumni Database: The First Attempt
• In this set of slides we will design and normalize the first
version of a database called AlumniDB
– NOTE: You are going to build the Alumni DB in the E3 exercise!
• The initial AlumniDB database may just have a few tables,
like the single table in the next slide
• As you can see, this early version of the table has
redundancies, is inefficient, and therefore is not useful!
• It must be changed through the three ordered
normalization steps (1NF, 2NF, 3NF)
Alumni Table First Version: Inefficient
Alum
GradYear
CurrentJob
John
Sedi
2000
IBM
Google
Joe
Strat
2010
Liz
Hidro
1998
Rocky
Tuff
2002
Joe
Strat
2010
Donation
(USD)
50
CurrentSchool
workPhone
CellPhone
Address
111-2223333
222-3334444
123 2nd Ave, Los
Angeles, CA 90014
678-3456666
345 First Ave,
Richmond, VA 23219
Univ. of VA
HydroPool,
Chevron
456-3449988
100
444 Kelly St, Frankfort,
KY 40601
Univ. of MA
999-8874447
987 Red Rock St,
Waltham, MA 02154
Univ. of VA
678-3456666
345 First Ave,
Richmond, VA 23219
There are a few problems with this table (see items in red font)!
It first needs to go through the 1NF (see next slide)
First Normal Form (1NF)
• 1NF deals with duplicative data across multiple columns!
– NOTE: The two phone columns have the same type of data
• 1NF sets the very basic rules to make sure that:
– Separate tables are created for each group of related
data (e.g., Lake, IsotopicAge, Fold, Rock), i.e.,
– each table should represent a distinct entity (or subject)
INF ensures that:
We do not have multiple values in a single column or
We do not have multiple columns of similar data
1. Repeated columns are not allowed.
• Duplicative (repeating) columns in a table that contain the same type
of data are removed from the table
– There should be no repeated groups of related data:
Mineral1, Mineral2, Mineral3, or
cellPhone, homePhone, workPhone
• These should go to a new Mineral and Phone tables!
2. No multi-valued attributes (columns) are allowed.
• All columns contain a single value (i.e., are indivisible), i.e.,
– All attributes must be atomic (e.g., XRF,) not multi-valued (like the
address in the Alumni table or Multiple Choice and Essay in the
Student table).
• Otherwise, we will have problem retrieving data by a specified value.
• In other words, each cell must only have one value,
e.g., XRF, not ‘XRF, REE, Isotope’
3. There should be a set of one or more columns that uniquely
identify each row
i.e., there should be a primary key (PK)
The Alumni table is NOT in First Normal Form (1NF)
Alum
GradYear
CurrentJob
John
Sedi
2000
IBM
Google
Joe
Strat
2010
Liz
Hidro
1998
Rocky
Tuff
2002
Joe
Strat
2010
Donation
(USD)
50
CurrentSchool
workPhone
CellPhone
Address
111-2223333
222-3334444
123 2nd Ave, Los
Angeles, CA 90014
678-3456666
345 First Ave,
Richmond, VA 23219
Univ. of VA
HydroPool,
Chevron
456-3449988
100
444 Kelly St, Frankfort,
KY 40601
Univ. of MA
999-8874447
987 Red Rock St,
Waltham, MA 02154
Univ. of VA
678-3456666
345 First Ave,
Richmond, VA 23219
Problems with 1NF:
• Violates rule: “There should be no repeating columns”
We have repeating data types (workPhone and Cellphone)
• Violates rule: “Each column must have a single value”
There are two current jobs given for some people.
The Address field is complex
• Violates rule : “There must be a primary key to uniquely identify rows”
There is none!
Example of an un-normalized Student table
Student
Exam
Format
Grade
Instructor
TA
Date
Dobb, Min
Structural
Geology
Essay
A-
Babaie
Gabbri,
Boris
2012-08-02
Dobrin,
Garn
GIS
Lab Exercise
B+
Dai
Mafique,
Marie
2014-02-15
Petri, Tuff
Remote
Sensing
Essay
B-
Kiage
Karsto,
Travert
2010-09-18
Lac, Du
GIS
Lab Exercise
B
Dai
Mafique,
Marie
2014-02-15
Dobb, Min
GIS
Lab Exercise
A-
Dai
Mafique,
Marie
2014-02-15
Lac, Du
Petrology
Multiple choice
B+
Hidalgo
Phenos,
Meg
2014-05-12
Petri, Tuff
Petrology
Multiple Choice
B
Hidalgo
Phenos,
Meg
2014-05-12
Mixed
names
Repeated
types
Repeated types
repeated
Mixed &
repeated
Alumni Table: Modified; Satisfies 1NF
AlumID
Alum
GradYear
CurrentSchool
1
John
Sedi
2000
2
Joe
Strat
2010
3
Liz
Hidro
1998
4
Rocky
Tuff
2002
Univ. of MA
5
Joe
Strat
2010
Univ. of VA
Donation
(USD)
Address
123 2nd Ave, Los Angeles, CA 90014
Univ. of VA
50
345 First Ave, Richmond, VA 23219
444 Kelly St, Frankfort, KY 40601
987 Red Rock St, Waltham, MA 02154
100
345 First Ave, Richmond, VA 23219
This table is in First Normal Form (1NF); But, table is NOT in 2NF
• The Job, GradSchool, and phones are removed to their own tables
because they are not dependent on the PK (AlumId).
• Records for Joe Strat and Univ. of VA are repeated!
• Remove everything except Alum data (keep GradYear) in new
tables
Add first_name, last_name, etc. for the Alumni Table
Second Normal Form (2NF)
2NF deals with redundancy across multiple rows!
• 2NF helps to further remove duplicative data
• For a table to be in 2NF:
• It should meet all the requirements of the first normal form
• In addition to that: we should take the following steps:
– Identify columns whose data repeat in different places, and
remove them to their own table
• In the next slide, we see that data for Joe Strat is repeated.
Solution: Remove the alum column (with its address and
school into their own Table called Alum and School
– Every non-key attribute must be dependent on all parts of
the Primary Key (PK)
• If not, move them to a new table with their own PK and FK
2NF: Eliminate partial dependencies
• Non-key columns must refer to the entire
composite key (if it exists), not just part of it.
• For example, the PK in the Student table (copied
in next slide) is the composite (Student, Exam).
– The ExamFormat column depends on (i.e., is an
attribute of) the Exam, not on the Student.
– This means that the data belong to another table
– This is taken care of by the 2NF
Example of an un-normalized Student table
Student
Exam
ExamFormat
Grade
Instructor
TA
Date
Dobb, Min
Structural
Geology
Essay
A-
Babaie
Gabbri,
Boris
2012-08-02
Dobrin,
Garn
GIS
Lab Exercise
B+
Dai
Mafique,
Marie
2014-02-15
Petri, Tuff
Remote
Sensing
Essay
B-
Kiage
Karsto,
Travert
2010-09-18
Lac, Du
GIS
Lab Exercise
B
Dai
Mafique,
Marie
2014-02-15
Dobb, Min
GIS
Lab Exercise
A-
Dai
Mafique,
Marie
2014-02-15
Lac, Du
Petrology
Multiple choice
B+
Hidalgo
Phenos,
Meg
2014-05-12
Petri, Tuff
Petrology
Multiple Choice
B
Hidalgo
Phenos,
Meg
2014-05-12
Mixed
names
Repeated
types
Repeated types
repeated
Mixed &
repeated
Third Normal Form (3NF)
• Third normal form is about dependency
• For a table to be in the 3NF:
• It must meet all the requirements of the 2NF, and:
• Every non-key attribute must be mutually independent
– Changing one non-key column should not change the other columns
If it does, remove the interdependent attributes
• No transitive functional dependencies
– Remove columns that are not dependent upon the primary key, and
depend on other columns
• Remove columns that their values depend on columns other than
the PK
– This means: we have to remove the subkeys
– Create new tables
– Assign new primary keys and foreign keys after changes
3NF: Eliminate transitive dependencies
• If a non-key column refers not to (i.e., is
independent of) the PK but to another column, it
should be removed to another table.
• For example, the TA column in the Student table
does not depend on the PK (Student, Exam); it
depends on the Instructor column.
• TA is removed to the new Instructor table
3NF …
• There should be no partial functional dependencies
• If x  y, i.e., x functionally determines y, and y is
functionally dependent on x, then given x, we can find y.
– Example, in the Address table, given the nine-digit zip code, we
can find city and state because they are functionally dependent
on the zip code. The opposite is not true, given a city we cannot
find the zip code (Note: some cities have several zip codes;
same named city can be in different states)
• By definition, a super key (e.g., primary key) functionally
determines all other attributes in the table
• The zip code is a subkey (not a superkey) because it only
determine the city and state part of the Address table
not the other attributes
Student
Grade
Exam
StudentID
StudentID
ExamID
StudentFirst
ExamID
InstructorID
StudentMiddle
Grade
Exam
Date
StudentLast
Format
ExamID
Instructor
Format
InstructorID
Instructor
TA
All entities broken into separate tables
PKs defined (shown in bold; some are composite; e.g., in Exam)
Each table has unique information about something or subject
Alumni Table, modified again: Satisfies 2NF
AlumID
GradYear
Address
1
2000
123 2nd Ave, Los Angeles, CA 90014
2
2010
345 First Ave, Richmond, VA 23219
3
1998
444 Kelly St, Frankfort, KY 40601
4
2002
987 Red Rock St, Waltham, MA 02154
This table is in Second Normal Form (2NF)
But Not in 3NF: There is a subkey (zip code) upon which the city and
state depend. Zip code is not a PK.
• We remove the subkey and put it in a new table
• We break the Address data into the following tables:
ZipCodes, Cities, and States because these do not relate to
any specific alum.
• However, these are directly related to each other (street address
relies on city, city on state)
• To take care of the partial functional dependency issue take 3 steps:
– Remove all the attributes that depend on the subkey (e.g., zip code)
from the table (e.g., city and State from Address table)
– Move them into a new table (e.g., call it ZipLocations with zipCode,
city, and state attributes
– Keep a copy of the subkey attribute (i.e., zipCode) in the original
table as a foreign key
• The address table now has firstname, lastname, street (these 3
make the composite PK), and zipCode (as FK to the other table).
• Summary: Subkeys always result in redundant data and must be
removed!
• In other words, remove subsets of data that apply to multiple rows of a
table and place them in separate tables
– i.e., remove duplicative data
– For example, break address into its independent constituents that
do not depend on each other
• Create relationships between these new tables and their predecessors
through the use of foreign keys
Alumni Table, 4th attempt: Satisfies 3NF
3NF Alumni Table
AlumID
GradYear
StreetNumber
StreetName
Zip
1
2000
123
2nd Ave
90014
2
2010
345
1st Ave
23219
3
1998
444
Kelly St
40601
4
2002
987
Red Rock St
02154
Plus other
tables!
Zip
CityID
CityID
Name
StateID
StateID
Name
Abbrev
90014
1234
1234
Los Angeles
5
5
California
CA
23219
5678
5678
Richmond
46
46
Virginia
VA
40601
4321
4321
Frankfort
17
17
Kentucky
KY
02154
8765
8765
Waltham
21
21
Massachusetts
MA
3NF Zipcodes Table
3NF Cities Table
3NF States Table
Fourth Normal Form (4NF)
• Normalizing a database to the 3NF is usually sufficient
• The fourth normal form (4NF) has one additional
requirement
• Meet all the requirements of the third normal form
• A relation is in 4NF if it has no multi-valued
dependencies
Download