Lecture Notes

advertisement
Normalization
INLS 256-02, Database I
Normalization refers to rigorous standards for good design, designed formally, and
methods for testing a DB’s design.
4 Indications of Quality:
1. Semantics.
2. Redundancy. You want to avoid redundancy as much as possible.
Insertion Anomalies: Example: (s#, name, address, p#, qty), where s#, p# is the
key. Each time we want to enter a new part, you must be sure that the name and
address are the same as all the other names and addresses for that supplier
number. Otherwise, depending on which tuple you look at, you could get different
information. Another example: (VID, model, SSN, name, address, purchase-date),
where VID, SSN is the key. Each time we want to enter a new car, you must also
enter the SSN nad address information of the owner.
So the two problems of insertion are 1. redundancy gives opportunities for
inconsistency and 2. mixing information from different entities in one tuple means
that you have to “cheat” to represent an entity when you only have part of the
information.
Deletion Anomalies: Using the example above, what happens when we delete the
last shipment of a particular supplier? We lose the address and name information
of the supplier. Because we’ve mixed entities, we can lose “permanent”
information on one when we just mean to delete a single occurrence of something.
Modification Anomalies: In the example, suppose a company does move? We
must change its address in all the tuples that it appears in, or else we’ll have
inconsistent information. Similarly, if an owner moves, the address must be
changed in all records. Note that the cascade update option in enforcing
referential integrity might catch some of these changes, for anything that has been
declared as a foreign key, but not for other fields.
3. Null values.
4. Spurious tuples. A spurious tuple is one that you “invent” by joining two badly
designed tables together.
Functional Dependency
Functional dependency has to do with the semantics (domains) of the tables in your DB,
so in order to determine where the dependencies lie, you must understand the domain.
These are constraints that must hold at all times in the DB, so you can’t just tell what is
required or allowed by looking at some existing tuples.
In a relation R, an attribute Y is functionally dependent on attribute X if each X-value in
the relation is associated with only one Y-value at any one time. X and Y may be
composite attributes.
R.XR.Y
X functionally determines Y (note that Y may be associated with more than one X)
e.g. in suppliers:
S(s#, name, status, city)
Where status refers to the shipping speed.
S.s#name
S.s#status
S.s#city
If X is a candidate key, then all Y’s will necessarily be functionally dependent on X, by
definition of a candidate key (it is unique for each tuple). This is especially important in
terms of the primary key.
E.g. in shipments:
SP(s#, p#, qty)
(s# p#)qty
The composite key functionally determines quantity.
Note that this doesn’t mean that X HAS to be a candidate key. We can rephrase the
definition to say that Y is dependent on X if whenever two tuples have the same X value,
they will have the same Y value.
Full functional dependence: This is a slightly more restrictive definition. Y is fully
functionally dependent on X if it is functionally dependent on all of X, not just on a
subset. E.g., if X is composite, it must be functionally dependent on the combination of
attributes making up X, not just on one of them.
E.g.:
(s#, status)city
city is dependent on the combination of s# and status, but is only fully dependent on s#.
We will generally use the full functional dependence definition.
Normalization:
This is a process of looking at the tables in a DB to see if they pass a series of tests.
Generally, if a table fails a test, the solution is to break it up into smaller tables. Normal
forms are increasingly strict levels of tests, and are designed to avoid the problems of
anomalies, redundancy, and ambiguity we discussed earlier. Generally, you want to
normalize up to 3rd or BC normal form, although there are situations where, because of
the semantics or use of the DB, that won’t be practical. When you make a decision not to
normalize to some degree, you must make it consciously, and try to safeguard the DB
from the problems that could arise.
Normalization up to BC normal form is based on functional dependencies and key
constraints. We’ve just learned about functional dependency; here’s some review of
terms.
superkey: any combination of attributes that uniquely identifies each tuple in a relation.
candidate key: any minimal combination of attributes that uniquely identifies each tuple
in a relation – you can’t remove any attribute and still have unique identification.
primary key: the candidate key chosen to be the unique identifier for a relation.
secondary key: all the rest of the candidate keys.
prime attribute: an attribute that is a member of a candidate key.
1NF – First Normal Form: A relation is in 1NF if and only if all underlying domains
contain atomic values only. 1NF isn’t very interesting – it is a stepping stone to others.
For instance, if we have (s#, name, city), where each supplier may have several branch
locations in different cities, then the relation has a domain that allows sets of cities for
values, thus they aren’t atomic. The solution is to decompose the table into 2 tables: (s#,
name) and (s#, location).
s#
1
2
name city
AAA Durham
BBB Durham, Raleigh
2NF – Second Normal Form: A relation is in 2NF if and only if it is in 1NF and every
nonprime attribute is fully dependent on the primary key.
Generalized 2NF: if every nonprime attribute is fully dependent on any key.
Ex1(s#, tax, city, p#, qty) (where tax equals the tax rate of the city)
where primary key is (s#, p#), tax is determined by the city, and so is functionally
dependent on city. Note that the functional dependency diagram has dependency arrows
that don’t just come out of the key – they also come from other attributes. Also note that
the arrows to status and city only come out of s#, not out of the entire key. This structure
will have update anomalies.
Insert. You cannot enter the existence of a new supplier and city unless that supplier is
shipping a part. This is because of the integrity rule, that all fields in the key must have
values. p# is part of the key.
Delete. If a supplier has only one shipment, and it gets deleted, you also delete all
knowledge of that supplier, such as the city.
Update. Because of the redundancy, if the city of a supplier moves, then you must either
1. find all occurrences of the supplier and change the city or 2. change one occurrence,
and have an inconsistent DB.
The solution is to divide this into two tables, where the key of the new table will be the
one that was independently determining the values of some attributes. So now we have
Ex2a(s#, tax, city)
Ex2b(s#, p#, qty)
The functional dependency diagram shows that each of them now contain attributes that
are fully dependent on the primary key of each relation. Insert – can now insert the
existence of a supplier into Ex2a, without a shipment. Delete – can now delete a shipment
from Ex2b without losing information about the supplier. Update – can now update the
city in only one place.
3NF – Third Normal Form: A relation is in 3NF if and only if it is in 2NF and every
nonprime attribute is nontransitively dependent on the primary key.
Generalized 3NF: every nonprime attribute is fully functionally dependent on every
key, and nontransitively dependent on every key. Functionally, this can be tested as
being as a relation R being in 3NF if X->A, then either (a) X is a superkey of R, or (b) A
is a prime attribute of R.
A transitive dependence is when
r.ar.b and r.br.c
hold. Therefore, the transitive dependency
r.ar.c
also holds. This can be seen in the functional dependency for Ex2a.
Ex2a(s#, tax, city)
Tax rate is dependent on city. City is dependent on s#. Therefore, tax rate is dependent on
s# through city. This shows by the fact that there are arrows that originate from places
other than the key. This also gives anomalies.
Insert – cannot enter that a city has a tax rate unless we have a supplier there. Again, this
is because we cannot have a null primary key.
Delete – if there is only one supplier in a city, when we delete the supplier, we delete the
tax information for that city.
Update – if we change the tax rate for a city, we must either 1. find all suppliers in that
city and change the status for it or 2. change only one and have an inconsistent DB.
The solution is to break the relation into two relations. The point here is to get rid of the
extra arrows, and make simple functional dependencies. So the two new relations are
FK
Ex3a(s#, city)
Ex3b(city, status)
Now the functional dependency diagrams are simple, there are no transitive
dependencies, all attributes are fully dependent on the key, and they are in 3NF.
BCNF – Boyce-Codd Normal Form: A relation is in BCNF if and only if every
determinant is a candidate key. Functionally, this is just like 3NF except that it is stricter,
in that we have removed the exception for A to be a prime attribute of R. I.e. it can be
tested as being as a relation R being in BCNF if X->A, then X is a superkey of R. (all
attributes are functionally determined by keys).
See property example and discussion in book. (pages 491-495). Basically, BCNF is
3NF without the exception that X->A, A can be prime attribute. I.e. if A is prime
attribute it would still be a violation, if X
A determinant is any attribute on which another attribute is functionally dependent.
Multivalued Dependencies – 4NF
A problem with multivalued dependencies occurs when you are trying to express 2
independent 1:N relations, or multi-valued attributes, in the same relation. For example,
in your initial design process, you may have seen something like:
manager
manager
phone#
employee
e.g.,
Fred
999-1212
Fred George
999-1312
Linda
999-1313
Ellen
where the manager is associated (multidetermines) a set of phone numbers, and also a set
of employees, but the phone numbers and the employees have nothing to do with each
other.
Of course, you can’t have a relation that looks like the ones above – it is excluded by
1NF.
You are trying to express the idea that the manager is associated with a set of employees,
and a set of phone numbers, but that the employees and the phone numbers are
independent of one another.
So, you might design a relation that looks like
mgr
Fred
Fred
Fred
etc.
phone#
999-1212
999-1312
999-1313
emp
George
Linda
Ellen
But that implies a relationship (connection) between 999-1212 and George. To avoid that
appearance, you would have to store all combinations of phone# and employee.
Two 1:N relations (or multivalued attributes), A:B, A:C, where B and C are independent
of each other.
x->>y x determines a set of values y
x->>z x determines a set of values z
The only time a multi-valued dependency is a problem is when you have more than one
mvd, and the y and z values are independent.
A trivial mvd is one where
1. The y attribute(s) are a subset of the x attributes. That is, if you made them
distinct from each other, there would no longer be an mvd. E.g., abc->>b.
2. The union of x and y make up the entire relation – there are no other attributes in
the table.
Otherwise, you have a nontrivial mvd, and these are the potential problems.
There are lots of redundancies – room for anomalies.
Note that the relations with non-trivial mvd’s tend to be all-key relations, where the key
is the entire relation.
The cure: 4NF
A relation is in 4NF if for every nontrivial mvd x->>y, x is a superkey (any combination
of attributes that is a unique ID in the relation (non-minimal) for the relation. The
manager table used as an example above is not in 4NF. mgr, phone# is a nontrivial mvd
because phone# is not a subset of mgr, and there is also employee. Similarly, mgr,
employee is a nontrivial mvd.
As usual, the way to get a relation into 4NF is to decompose it, to get the mvds into
separate relations:
(manager, phone#)
(manager,employee)
which are now trivial mvds, making up the entire table. This decomposition will have the
lossless join property.
The Overall Idea
Remember that the goal here is to get a good design. Starting from an ER diagram is one
way, although you still mgiht want to check normalization of tables. But starting with a
bunch of tables and then normalizing them (or starting with one enormous table) is
another approach. We have been talking about normalization as something that you do
regarding just one table in the database. It is also important to look at your DB design in
terms of how the tables relate to each other, and how you can combine them. Merely
having a bunch of tables in 3NF or BCNF is not enough.
Some definitions: In a database design, we have a decomposition D of the universal
relation R. This is the way that all of the attributes have been decomposed into tables.
There is a set of functional dependencies F that hold over the attributes of R; this
depends on the semantics of the DB and how things work in the world it models.
Dependency Preserving Decomposition:
In decomposition, it is possible to lose a functional dependency – this is undesirable, so a
good decomposition will preserve dependencies. There are two ways of storing functional
dependencies: they can be in the same table, or they can be inferred from different tables.
Lossless (Additive) Joins:
Another important feature of a good decomposition is that it gives lossless joins. This is
the problem of spurious tuples. The term “lossless” refers to the problem of losing
information – the way that you lose information here is by getting noise (spurious tuples)
into your table.
Properties of lossless join decomposition:
1. For 2-relation DB schemas: the attributes in both relations must functionally
determine either those attributes that appear in only the first relation, or those that
appear in only the second relation.
2. Once you have established a decomposition with the lossless join property, you
can further decompose one of its tables without losing the property. So, to
decompose and maintain lossless joins:
For each table in the DB that isn’t in BCNF, find the functional dependency that
is in violation (that is, contains a determinant that is not part of a candidate key),
and break the relation into two. One relation contains the X and Y attributes from
the functional dependency. The other contains the rest of the attributes.
You can’t always perform the “ideal” decomposition, that is in BCNF and preserves
dependencies. You may only be able to get to 3NF. You must then decide whether to
leave it there, and build in protection for update anomalies, or to decompose even further,
with the resulting loss of performance.
In terms of design, remember that it isn’t a good idea to design a table that will get too
many nulls. It is better to break it up into another table. However, this could also result in
the problem of dangling tuples. The representation of a “thing” is broken up into 2
tables. To get the full information on the “thing”, you join the tables together. However,
if some tuples have either null values on a join attribute, or don’t appear at all in one
table, they won’t appear in the result, unless you know in advance that you should do an
outer join.
Download