Table normalization

advertisement
Normalization: Kroenke Chapters 3
and 4
A relation is categorized by one of
several normal forms.
An aid to design
helps characterize relations that
experience anomalies in update
operations
Higher normal forms TEND to be better
design, but not guaranteed.
Remember the one fact-one place theme!
Deletion anomaly:
Deleting 1 fact inadvertently deletes another.
Insertion anomaly:
inserting 1 fact not possible without inserting
another seemingly unrelated fact.
First Normal Form – 1NF
A relation is 1NF if each attribute is
atomic
That is, attributes are simple types (int,
float, string, char, etc)
Second Normal Form – 2NF
How about this as a base table?
Primary key
Definition:
R is a relation;
X and Y are attributes of R.
Y is functionally dependent on X iff
each X-value in R has precisely one
Y-value in R associated with it.
A common notation is X  Y.
Example:
In the supplier’s S table, Status, City, and
Name are functionally dependent on S#.
In the SP table, Qty is functionally
dependent on the combined attributes of
S# and P#
Status
S#
Qty
P#
S#
City
Name
City and status are not functionally
dependent on each other.
There may be several entries containing
‘London’ but different status values.
There may be several entries containing a
status of 50 but have different cities.
QTY is not functionally dependent on
either P# or S#.
S1 might have multiple QTY values for
different parts
Similar for P1
Def: Y is Fully Functionally Dependent on
X if X  Y but Y is not functionally
dependent on any proper subset of X.
In S, (S#, Status)  City -- but not fully
because S# City
in SP: (S#, P#)  Qty -- fully because
neither S# nor P# by itself determines Qty.
The functional dependence requires BOTH
S# and P#.
Semantic notation. Must understand
meaning of data, NOT a consequence of
table data.
For example, suppose that each city in S
has the same status.
Is it coincidence or by design?
Why is this important?
What if we combined the relations S and
SP into a single relation, First as in a few
slides previous?
First(S#, P#, Status, City, QTY)
Underlined attributes represent the
primary key.
Insertion Anomalies:
Cannot enter fact that a supplier is located
in a city unless that supplier already
supplies some part.
Why?
Deletion Anomalies
Suppose S3 no longer supplies P2.
Delete (S3, P2, 10, Paris, 200)
if that is the ONLY part S# supplied, you
lose fact that S3 is in Paris.
Update Anomalies
S1 moves from London to Amsterdam.
May have to update many entries.
Violates the “one fact, one place”
guideline
These problems are caused by
dependences on a proper subset of the
primary key.
See also Kroenke’s example on page
95 and the text on page 96.
Second Normal Form (2NF)
A relation is 2NF iff it is 1NF and
every non-key attribute is fully
functionally dependent on the
primary key.
There are no attributes dependent on
a proper subset of the primary key.
Table First is NOT 2NF. Some nonkey
attributes are not fully dependent on the
primary key (S#, P#).
Some are dependent on S# only
The S and SP tables ARE 2NF.
They are a better design in this case.
Similar example in Fig 3-10 on page 106
of text.
How about this table?
Does it contain redundancy?
Are there update anomalies?
Transitive dependencies
Suppose a supplier status is determined by
the supplier’s city.
That is, City  Status.
Since also S#  City then S# status is a
result of these dependencies.
A Transitive dependency exists as shown
below.
Status
S#
City
Similarly, a Housing table that links a
student with a dorm and a residence fee
would also likely have a transitive
dependence.
Fee
SID
dorm
Insertion anomalies:
Cannot state fact that a supplier in Rome
must have a status of 50 unless there is
already a supplier there.
Cannot state fact a dorm has a specific
cost unless there is already a student
there.
Deletion anomalies
Delete (S5, 30, Athens)
If it’s the ONLY Athens, lose fact that
status for Athens must be 30.
Delete (100, Randolph, $3200) from the
Housing table.
If that’s the only “Randolph” then you
lose the connection between dorm and
cost.
Update anomalies
“Change status of London supplier” may
mean multiple updates.
Violates the “one fact” – “one place”
rule. i.e. that each fact should be stored in
one place.
Third Normal Form (3NF)
A relation is 3NF iff it is 2NF and every
non-key attribute is nontransitively
dependent on the primary key. i.e. nonkey attributes are mutually independent.
Again, it’s a consequence of the meaning
of the data, not the data itself.
Suppose all London suppliers had a status
of 50.
Is that coincidence?
Is it by design?
Question:
Is 3NF better than 2NF?
Maybe.
In the cases presented here, probably so.
An employee table where
EmpIDAddressZipCode is not 3NF.
We may not care about
AddressZip_Code unless it’s a UPS or
Post Office application.
Table Decomposition
Dividing a table into 2 or more tables to
achieve a higher normal form.
Previously we divided First into tables S
and SP to achieve 2NF.
Now we find that S is not 3NF, so we
should decompose S into two tables.
We have options:
SS(S#, Status) and CS(City, Status)
2. SC(S#, City) and SS(S#, Status) or
3. SC(S#, City) and CS(City, Status)
1.
Which is best?
Need to ask:
Does the decomposition result in a loss of
information?
For example, can we still relate the attributes
that have been separated into two tables?
Are the two relations independent of each
other?
Option 1:
SS(S#, Status) and CS(City, Status)
Cannot get the city of a supplier.
Can you see why?
Option 2:
SC(S#, City) and SS(S#, Status)
Relations not independent.
If two suppliers are in the same city, must make
sure they have the same status.
Requires monitoring of changes, possibly the
use of triggers. Extra work.
CAN get the status of a city but ONLY if
there’s a supplier there. Otherwise there’s
a loss of information.
Can’t store the status of a city unless
there’s a supplier there.
Option 3:
SC(S#, City) and CS(City, Status)
Two relations are independent.
No loss of information
Best option
Decompose the Housing table into one of
1.
2.
3.
SD(SID, Dorm) and DF(Dorm, Fee)
SD(SID, Dorm) and SF(SID, Fee)
SF(SID, Fee) and DF(Dorm, Fee)
Which is better?
Construct a similar argument
Fee
SID
dorm
Best decomposition frequently follows
the FD arrows.
This is a guideline, not an absolute rule.
Determinants
Consider SMA(SID, MID, AID) where a
student has one advisor for a major and
an advisor advises for one major.
This table is 3NF since there is only one
non-key attribute.
S2 drops Physics and you may lose the
fact that A3 advises for Physics.
SID
S1
S1
S2
S2
MID
Math
Phys
Math
Phys
AID
A1
A2
A1
A3
SID
MID
AID
Def: If Y is fully functionally dependent on
X then X is a determinant.
Def: A tuple is an entry from a relation. The
name is rooted in the historical development
by E.J. Codd who used mathematical
models to describe relations.
Def: An attribute is a candidate key if that
attribute uniquely identifies a tuple. A
primary key is chosen from a list of
candidates keys.
Every candidate key is a determinant.
Boyce-Codd Normal Form (BCNF)
A relation is BCNF if every determinant
is a candidate key.
SMA is NOT BCNF since AID is a
determinant but not a candidate key.
Possible decompositions:
SA(SID, AID) and AM(AID, MID)
No Loss but relations are not independent.
How do you “Find the major of S1”.
It requires a search of two tables which
seems somewhat counterintuitive.
SM(SID, MID) and AM(AID, MID).
Cannot get advisor of a student.
SA(SID, AID) and SM(SID, MID).
Cannot get who advises what.
None of the three possible
decompositions seems satisfactory.
Solution: Look at bigger picture (E-R
diagram)
redundant
Major (M)
Student (S)
Advisor (A)
Relations:
S, M, A (With a foreign key matching the
primary key in M), SM, and SA to
implement the many-many relationships
NOTE: With BOTH SM and SA, it is
possible for inconsistency to occur.
Could have (S1, M5) in SM;
(S1, A3) in SA;
and have M8 as a foreign key for advisor
A3 in the Advisor table.
Would need software or triggers to
assure consistency which adds to
overhead.
On the other hand, relationship between S
and M is derived from relationships
between S and A and between A and M.
This provides an argument that the
relationship between S and M should not
be shown as a separate relationship
Of course, then the fact that “ a student is
majoring is something” is NOT explicitly
stored.
The design is based on business rules
which we assume to be correct.
May not always be the case.
Maybe the business rule that states “a
student is majoring in something” is
flawed.
Allows a student to choose a major
without having an advisor first.
Perhaps a better rule is “a student has an
advisor, which determines the major”.
It would be a model that forces student to
choose an advisor, which may be a better
rule since many students do NOT seek
out advisors in timely fashion.
Multivalued Dependencies (MVDs)
Consider SMA(Student, Major,
Activity)
A student can have multiple majors and
participate in multiple activities.
This relation is BCNF vacuously (There
are no determinants)
Can’t store the major of a student unless
that student has an activity.
Student
S1
S1
S2
S2
Major
Math
Math
Phys
Math
Activity
Swimming
Football
Baseball
Baseball
Another example
CIX(Courses, Instructor, teXt)
To implement training programs or corporate
sponsored courses.
Courses taught by many instructors and an
instructor can lead many courses.
Similar for text and courses
Instructors do NOT choose texts
There are NO determinants in this table
Can’t store the text for a course unless there
is an instructor.
Yet another example on page 95-96.
Def:
Suppose A, B, and C are attributes of a
relation.
A Multivalued dependency (MVD)
AB holds in R if for each A there are
multiple B values which are independent of
any C values.
Fourth Normal Form (4NF)
A relation is 4NF if it is BCNF and has
no multi-valued dependencies.
SMA is NOT 4NF.
Would decompose into SM and SA.
No loss since there’s no connection
between M and A.
CIX is NOT 4NF.
Would decompose into CI and CX.
No loss since there’s no connection
between I and X.
There is a 5NF but we will not cover.
They rarely occur in practice.
Domain/Key Normal Form (DK/NF )
Landmark paper: Ronald Fagin, “A
Normal Fsorm for Relational Databases
That Is Based on Domains and Keys”,
ACM Transactions on Database Systems,
September 1981.
In this paper he
Defined DK/NF
Proved that a relation in DK/NF has NO
modification anomalies
A relation having no modification anomalies
must be in DK/NF
What is it? First, some definitions:
Constraint:
a rule governing static values of attributes.
e.g. rules such as 0<=gpa<=4; credits >=0;
functional dependencies; multivalued
dependencies.
key:
unique identifier of a tuple.
domain:
description of an attribute’s allowable values
Def: A relation is DK/NF (Domain
Key Normal Form) if every
constraint is a logical consequence of
the definition of its keys and
domains.
Without an example this probably
makes little sense.
Ex. (from a previous edition of
Kroenke):
Track students, faculty, and who advises
whom.
Possible relations:
Student(SID, Sname, FID)
and
Faculty(FID, Fname, FacStatus)
Constraints
FacStatus=0 or 1 (undergrad/graduate);
FID begins with 1;
SID must not begin with 1;
SID of grad students begins with 9.
Only graduate faculty can advise graduate
students.
Alternative constraint statement:
“Grad student must be advised by Grad
Faculty”  “If Sid starts with 9 then
FacultyStatus of the advisor must be 1”
Difficult to enforce through the database
design since the relevant data lies in two
distinct tables.
Each relation is still 1NF through 4NF
Decomposing tables:
Kroenke discussed Themes.
Each relation has a theme.
3 themes here:
Faculty
grad advising
undergrad advising
Possible Tables:
Faculty(FID, Fname, FacStatus)
G-ADV(GSID, Sname, GFID)
UG-ADV(UGSID, Sname, FID)
Domain Definitions
FID in CDDD where C=1; D=decimal
digit
This is a generic notation for our
purposes here.
In Access you’d write: FID like “1###”;
In SQL Server you’d write: FID like
“1[0-9][0-9][0-9]” (See F-Adv table in
the university database)
Fname in Char(30)
FacStatus in [0, 1]
GSID in CDDD where C=9; D=Decimal
digit
UGSID in CDDD where C!=1; C!=9, D=
decimal digit
See G-Adv and UG-ADV tables for exact
syntax
Sname in CHAR(30)
GFID in {Select FID of Faculty, where
FacStatus=1} (assuming the DBMS supports
this type of constraint)
There is a trigger in G-Adv to implement the
equivalent of this.
Relations & Keys:
All constraints are met by enforcing key
and domain restrictions. i.e. it is DK/NF
This relation is guaranteed to have NO
modification anomalies.
Examples:
A company hires student interns to work on various
projects under the guidance of company employees.
Semantics are as follows:
A student intern can work on several projects and a project
can use several interns.
A project can have several team leaders (or co-leaders)
which are company employees but an employee works on
only one project.
For each project on which an intern participates there is one
team leader to which that intern must report.
Consider a table, IPE, consisting of 3 attributes: Intern ID
(I#), Project ID (P#), and employee ID (E#). So for
example, if (I4, P5, E3) is an entry in this table then it
means that Intern I4 is working on project P5 and must
report to team leader E3 for that project.
List FDs; what should the primary key
be? Find the lowest normal form that is
violated?
(I#, P#)  E# and E#  P#
Primary key should be (I#, P#)
violates BCNF
(I#, P#)  E# and E#  P#
Consider three possible decompositions of the
above relation as follows. Primary keys are
underlined.
Table IE(I#, E#) and IP(I#, P#).
IE might contain (I2, E4) and (I2, E6); IP might
contain (I2, P3) and (I2, P5);
What project does E4 work on?
Lose the project to which an employee is
assigned.
(I#, P#)  E# and E#  P#
Table IE(I#, E#) and EP(E#, P#)
Since E#  P# we can get the project
that an intern is working on through a
join of these two tables.
(I#, P#)  E# and E#  P#
Table EP(E#, P#) and IP(I#, P#)
EP might contain (E2, P4) and (E4, P4); IP
might contain (I2, P4)
To which employee does I2 report?
Lose the employee to which an intern
reports.
Assume the following scenario in a university
in which a student is paid by a department to do
work for a faculty member. Semantics are as
follows:
A department can hire many students and a faculty member
can have many students working for him/her.
Each student can work for only one faculty member and is paid
through the faculty member’s department budget.
A department has many faculty members but each faculty
member is a member of one department.
There is a table consisting of 3 attributes: Student ID (SID),
Department ID (DID), and Faculty ID (FID). So for example,
if (S4, D5, F3) is an entry in this table then it means that
Student S4 is working for faculty member F3 who, in turn, is a
member of department D5.
List FDs; what should the primary key
be? Find the lowest normal form that is
violated?
SID FID  DID
Primary key should be SID
Violates 3NF
SID FID  DID
Consider three possible decompositions
of the above relation as follows:
SF(SID,FID) and SD(SID, DID): By doing
a join between these tables, you can get the
department of a faculty member
But only IF the faculty member has a student
employee.
Also, tables are not independent
SID FID  DID
SF(SID, FID) and FD(FID, DID):
By doing a join between these tables,
you can get the department that is
paying the student.
SID FID  DID
FD(FID, DID) and SD(SID, DID)
Can you construct an example that shows
you may not get the faculty member for
whom a student is working?
From a previous exam
An organization needs to track many ongoing projects,
the department responsible for each project, and which
employees are project leaders. Rules are as follows
Each project is the responsibility of a single department.
Each project has one project leader who is a member of the
department responsible for the project.
A department can have many employees and be responsible for
many projects.
An employee can be a project leader for several projects.
Each employee is assigned to one department.
Proceed as in the previous slides
Download