Module 5 - Free Home Page

advertisement
Module 5: Normalization
Overview
In this module, we introduce the concept of functional dependency in the relational model and then examine the
important process of normalization. Through normalization we will see how a set of formal techniques is used to
decompose complex relational database relations that have redundancy and update anomalies into a set of
smaller, less redundant relations, without losing any information.
Module 5: Normalization
Objectives
After completing this module, you should be able to:
describe the concept of functional dependency in a relation
discuss the importance of Armstrong's rules and closure sets of functional dependencies and attributes
explain the purpose of finding a superkey for a relation
describe how non-loss decompositions of a relation can be accomplished using projections
explain what is meant by reversibility of decompositions using natural joins
explain the requirements for a relation to be in first normal form (1NF), second normal form (2NF) and
third normal form (3NF)
explain why normalization beyond 3NF may sometimes be required, and briefly describe Boyce-Codd
normal form (BCNF), fourth normal form (4NF) and fifth normal form (5NF)
decompose a relation into a set of 3NF relations
Module 5: Normalization
Commentary
I.
II.
III.
IV.
V.
VI.
VII.
VIII.
IX.
Functional Dependency
Update Anomalies
First Normal Form (1NF)
Second Normal Form (2NF)
Third Normal Form (3NF)
Boyce-Codd Normal Form
Fourth Normal Form (4NF)
Fifth Normal Form (5NF)
Normalization Case Study
I. Functional Dependency
Normalization is a formal process used with relational databases to remove redundancy from individual
relations and alleviate the serious problems this redundancy causes. In this module we will be using the formal
relational model terminology of relation, tuple, and attribute, although the concepts of normalization apply
equally well to relational database tables, rows, and columns.
Module 5: Normalization
In order to understand what makes a relation "well formed" (i.e., normalized), it is first necessary to discuss the
concept of functional dependency. A functional dependency is a relationship between sets of attributes in a
relation that exists for all time. Assume that we have a relation ORDERS (as shown in figure 5.1, below) to
record the orders that our customers place in an organization.
Figure 5.1
In this relation, attribute Order_ID is the primary key, uniquely identifying each of the tuples, with each tuple
containing data for one order. We can see that for any value of Order_ID we are always able to determine the
Customer_ID. Likewise if we are given the value for a Customer_ID we are always able to determine the
Last_Name and First_Name of the customer. We can express these two functional dependencies as:
Order_ID  Customer_ID
Customer ID  {Last_Name, First_Name}
In the first functional dependency we can see that Order_ID determines the Customer_ID. We record this
relationship by saying that Order_ID is the determinant and Customer_ID is the dependent. In the second
functional dependency, the dependent is a set of two attributes. When either the determinant or dependent is a
single attribute (i.e., a singleton set), we can drop the brackets around the attribute. Note that if we know the
Customer_ID we cannot necessarily determine the Order_ID, so the relationship Customer_ID  Order_ID is
not a functional dependency.
In these two relationships, and in others that exist in the ORDERS relation, knowing the values of the set of
attributes on the left-hand side of the relationship functionally determines the values of the set of attributes
on the right-hand side. A functional dependency is thus a many-to-one relationship between attribute sets in
a single relation. Many instances of values of the determinant set always yield the same value for the
dependent set.
We can show the functional dependencies that exist in the ORDERS relation via a functional dependency
diagram. Two examples of functional dependency diagrams are shown in Figure 5.2.
Figure 5.2A
Figure 5.2B
Module 5: Normalization
These two diagrams vary only as to whether the attributes in the relation are connected in a row, or not
touching. In both diagramming formats, however, the lines emanate from determinate attribute(s) and the
arrows point towards dependent attribute(s).
In the remaining discussion of this section, we will see how the theoretical aspects of functional dependency can
allow us to find two practical items—a primary key for a relation, and a minimal set of functional dependencies.
The two functional dependencies we listed earlier for the ORDERS relation are just a few of the many total
functional dependencies that exist for the relation. Among the others are:
Order_ID  Date_Placed
Customer_ID  Age
Customer_ID  Zip_Code
Zip_Code  {City, State}
The entire set of functional dependencies that exist for a relation is called the closure set of functional
dependencies. In order to determine the closure set of functional dependencies of a relation from some initial
set, we can apply the set of six well-known Armstrong's inference rules to infer additional functional
dependencies from those we already know to exist. One example of Armstrong's inference rules is the
transitive rule, which says if one set of attributes determines a second set and then this second set
determines a third set, that the first set determines the third set.
If we start with an initial set of attributes for a relation, and an initial set of functional dependencies, we can
iteratively apply Armstrong's inference rules until a closure set of attributes for our initial functional
dependencies is reached. As each additional functional dependency of the relation is discovered, our set of
attributes can become a larger subset of the entire set of attributes of the relation. The process stops when no
additional attributes can be added, thus achieving the closure set of attributes.
If the closure set of attributes contains all attributes of the relation, then the initial set contained a superkey.
We can then reduce the superkey by removing attributes not needed to determine uniqueness, until we are left
with a minimal set of attributes. At this point we have a candidate key, and it can be chosen as the primary
key of the relation.
Also, once we have determined the closure set of functional dependencies for a relation, we can use
Armstrong's inference rules and other functional dependency rules to derive an irreducible cover set of
functional dependencies. This set is the minimal set of functional dependencies the database designer needs
to enforce in the database to ensure that the total integrity of the relation has been maintained.
Module 5: Normalization
II. Update Anomalies
Although we are not given the source of the ORDERS database, it is typical of the "one-table" databases some
novice database users might develop with a spreadsheet, or similar non-RDBMS software. A quick glance at the
data shows that we have a serious problem of redundancy. For example, every time customer C00006 (or any
other customer) places an order, we need to repeat their Last_Name, First_Name, Age, Zip_Code, City, and
State. This not only wastes disk space but can easily lead to data entry errors, resulting in data inconsistency
existing in our database.
A more serious problem is that of the update anomalies that exist. For instance, we cannot add a new
customer until that customer places an order. This unfortunate situation is called an insertion anomaly.
If we decide to remove our orders from the database at the end of every year and archive them to a file, we
would probably also be losing customer data in the process. This would not be a desirable situation, because
our existing customers are likely to place future orders. This problem is called a deletion anomaly.
A final problem is the modification anomalies that exist when we need to change a customer's Last_Name,
First_Name, Age, Zip_Code, City, or State. Each tuple containing this data for the customer must be updated or
we will have another situation with data inconsistency.
The basic problem that we have with the ORDERS relation is that it contains too much data together. Relations
should contain data pertaining to only a single theme. Stated another way, "each database fact should be kept
in a single place in the database." But in our ORDERS relation we have lumped many unrelated attributes
together in one place. We have data regarding orders and data regarding customers all in the same relation.
We would have done better to have kept order data in one relation and customer data in another, and this likely
would have been accomplished had we used a technique such as entity/relationship diagramming of end-user
requirements, as discussed in module 3. Intuitively we know that we have serious problems with the ORDERS
relation. In the next section of this module we will discuss the formal techniques that exist to alleviate those
problems.
III. First Normal Form (1NF)
As mentioned above, normalization is a formal process used to reduce redundancy in a relation while still
maintaining the relation's functional dependencies. It involves decomposing a relation into two or more relations
using specific guidelines to achieve normal forms.
Normal forms range from the least restrictive first normal form (1NF) through second normal form (2NF),
third normal form (3NF), Boyce-Codd normal form (BCNF), fourth normal form (4NF), and fifth normal form
(5NF). The last two normal forms are based on the more general concepts of multi-valued dependencies and
join dependencies, respectively. This section discusses 1NF. Later sections in this module will discuss the higher
normal forms.
First normal form requires that all values in the relation be single, atomic values. Let us clarify this status by
looking at a structure that is not even in 1NF. This sort of structure can be shown by the modified,
unnormalized ORDERS structure shown in figure 5.3.
Figure 5.3
Module 5: Normalization
Notice in this structure that we have repeating groups for the Order_ID and Date_Placed attributes. Every
time a customer places a new order, it would add another Order_ID and Date_Placed set of values to their
"tuple." Such an unnormalized structure does not even qualify as a relation, since all relations are normalized
by definition. Relations are only allowed to have atomic values at each tuple/attribute intersection. Since 1NF
requires only atomic values, with no repeating groups, all relations are in at least 1NF.
The current interpretation of the relational model allows nested relations to exist inside other relations. For
example, we could create a nested relation consisting of just the attributes Order_ID and Date_Placed in the
first tuple of the structure in figure 5.3. This modified structure with a nested relation of two "attributes" and
two "tuples," although technically a relation, would not be in 1NF, because it doesn't contain atomic values. It is
possible to create a set of 1NF relations from a nested relation by unnesting the inner relations.
IV. Second Normal Form (2NF)
A relation is in second normal form (2NF) if it's in 1NF and there are no partial dependencies on the
primary key. In other words, all non-key (i.e., not part of the primary key) attributes must be dependent on the
full primary key.
We can see that the ORDERS relation of figure 5.1 is already in 2NF since there are no partial dependencies. All
attributes are dependent on the primary key of Order ID. This can be seen clearly on the functional dependency
diagrams of figure 5.2. In fact, any relation with a single-attribute primary key is automatically in 2NF.
If we had another relation, such as RATINGS, as shown in figure 5.4, we would have a situation with partial
dependencies.
Figure 5.4
The primary key of relation RATINGS is the composite of attributes SSN and Job, which is why an outer box is
drawn around both attributes in the functional dependency diagram of relation RATINGS in figure 5.5. Attribute
Skill_Level is dependent on both SSN and Job, the full primary key. This is shown by the arrow emanating from
the outer box containing the primary key attributes to attribute Skill_Level. But attributes Last_Name and
Module 5: Normalization
First_Name are dependent only on attribute SSN, part of the primary key, as shown by the arrow emanating
only from SSN in figure 5.5. Thus SSN  {Last_Name, First_Name} is a partial dependency.
Figure 5.5
Besides exhibiting some redundancy, relation RATINGS also has some update anomalies. If we need to put a
new EMPLOYEE into the relation, we can't do so until the employee has at least one JOB. This is an insertion
anomaly. Likewise we have a deletion anomaly because if we delete a JOB from the relation we might also
delete an employee. A modification anomaly exists because of the partial dependency of Last_Name and
First_Name on SSN. If we update a specific person's (i.e., SSN's) Last_Name or First_Name we need to make
sure we update all tuples with that SSN or we will have data inconsistency.
To remove these update anomalies we need to convert relation RATINGS into a set of 2NF relations by
performing a non-loss decomposition of the original relation into a set of two smaller (degree-wise) relations,
JOB_RATINGS and EMPLOYEES (as shown in figure 5.6). The decomposition is done via the relational algebra
operation of projection. The original functional dependencies of the RATINGS relation are maintained in the
two new relations. Notice also that these two relations have a one-to-many relationship.
Figure 5.6A
Figure 5.6B
In the new JOB_RATINGS and EMPLOYEES relations all non-key attributes are dependent on the full primary
key. We have removed the update anomaly problems with employees, since they are now kept in the
EMPLOYEES relation, whether or not they have a JOB. Updating an employee's Last_Name or First_Name is now
done in a single place in the database. Also notice that since EMPLOYEES is a relation, and relations do not have
duplicate tuples, that we only have two tuples (versus the three in the original relation).
Module 5: Normalization
We can assure ourselves that this was a non-loss decomposition by performing a natural join of the
JOB_RATINGS and EMPLOYEES relations to return to the original RATINGS relation. The relational algebra's
natural join operation is the reversibility operation for normalization.
If we had performed a decomposition by projection, but not followed the functional dependencies of the
RATINGS relation (for example SSN and Skill_Level for one relation), we would have found that when
performing the natural join we would have additional, spurious tuples, that were not part of the original
relation. The existence of spurious tuples is an indicator of a lossy decomposition.
Another situation where we might want to decompose a relation into two smaller relations is where we would
have a lot of nulls. One example might be for a relation with data about the states, or countries of the world,
including ocean shoreline and shipping data. For many land-locked states or countries we would have no
shoreline data. If shoreline data were a major part of the relation then the nulls would be significant. But by
creating a separate "shoreline data" relation, with a one-to-one relationship to the state or country relation, we
would save disk space. Note that this is technically not a normalization issue but rather a relation decomposition
issue.
V. Third Normal Form (3NF)
Our ORDERS relation in figures 5.1 and 5.2, although in 2NF, still has some update anomalies. For example, we
cannot insert, delete, or update customer data without possibly affecting the order data. This is because we
have a transitive dependency of Customer_ID on Last_Name, First_Name, Age, Zip_Code, City, and State. A
transitive dependency is a functional dependency between sets of non-key attributes. We also have a transitive
dependency of Zip_Code on City and State. Note that neither Customer_ID nor Zip_Code are part of the
primary key of ORDERS.
By removing transitive dependencies, and assuming we have no partial dependencies, we create relations in
third normal form (3NF). This is done for our ORDERS relation by first creating the ORDERS2 and CUSTOMERS
relations of figure 5.7. The ORDERS2 relation no longer has a transitive dependency and is now in 3NF.
Figure 5.7A
Figure 5.7B
Module 5: Normalization
The CUSTOMERS relation still has the transitive dependency with Zip_Code, so we need to create new relations
CUSTOMERS2 and ZIP_CODES as shown in figure 5.8.
Figure 5.8A
Figure 5.8B
From the single ORDERS relation we now have three 3NF relations of ORDERS2, CUSTOMERS2, and ZIP_CODES.
If we were to perform natural joins of these three relations we would end up with our original ORDERS relation.
In practice, for most online transaction processing (OLTP) applications, 3NF relations are the goal. But
since joining relations (tables really) is an expensive operation in a relational database, if we have applications
such as decision support systems (DSSs), where we might perform many ad hoc queries against tables, we
might wish to denormalize our 3NF tables back into 2NF tables for performance considerations.
How does normalization, as discussed in this module, compare to relational database design, using techniques
such as entity/relationship diagramming? Remember that when we used ERDs, we were working from user
Module 5: Normalization
specifications, not sets of data, so we were trying to define real-world entities for an application. In the case of
normalization, as in this module, we started with the data, but the problem was that too much data was stored
in one place. Although approached differently, both processes result in creating a desirable relational database
design of 3NF tables that has minimal redundancy by keeping facts in one place only and removing update
anomalies. Note that the process of normalization to 3NF can also be used after a database has initially been
designed via ERDs. For example, we may not have initially realized the Zip_Code to City and State transitive
dependencies in our ERDs.
VI. Boyce-Codd Normal Form (BCNF)
As stated in the last section, in practice 3NF is usually sufficient to minimize redundancy. Some relation
situations, however, have anomalies that require even higher normalization than 3NF. As originally defined, 3NF
did not address these cases, so the Boyce/Codd normal form (BCNF) was developed.
For example, consider the relation DB_DATA as shown in figure 5.9.
Figure 5.9
This relation has two overlapping, composite candidate keys as expressed by the following functional
dependencies:
{Student, Course}  Professor
{Student, Professor}  Course
We also have the functional dependency:
Professor  Course
Relation DB_DATA is in 3NF, but it still has update anomalies. For example, if we delete the fact that Jones is
studying IFSM 420, we delete Professor Anyanso. We could remove the anomaly by forming the two relations,
STUDENT_PROF and PROF_COURSE, as shown in figure 5.10.
Figure 5.10A
Figure 5.10B
Module 5: Normalization
Although these two new relations remove the previous deletion anomaly, unfortunately they cannot be updated
independently. For example, we cannot insert Jones and Pickering into the STUDENT_PROF relation because
Jones is already taking IFSM 420 from Professor Anyanso and this would violate the {Student, Course} 
Professor functional dependency.
A relation is in BCNF when all the determinants are candidate keys. Notice that the candidate key of
STUDENT_PROF is the composite (Student, Professor) and for PROF_COURSE is (Professor, Course). In fact,
these are the only determinants of each relation.
The requirement that all determinants be candidate keys is a stronger (i.e., more restrictive) definition than the
original 3NF proposed by Codd. As noted above, BCNF may be required when there are two or more candidate
keys and these candidate keys are composite and they overlap—an anomalous situation. Relations that are in
3NF but not in BCNF are rare in practice. Frequently, relations requiring normalization to BCNF result from poor
database design.
VII. Fourth Normal Form (4NF)
Some 3NF relations may have multivalued dependencies of the form:
X   {A, B}
This means that each value of attribute set X may have various values of the combination of attribute sets A
and B. For example a Course may have various Professors and various Texts. We cannot via functional
dependency state which unique Professor and Text a course might have, because many combinations are
legitimate.
To remove multivalued dependencies, we need to decompose a relation into multiple fourth normal form
(4NF) relations. In the above example we would have two relations with attributes (and primary keys) of X and
A, and X and B.
A multivalued dependency is actually a generalization of the functional dependency in the lower normal forms.
Relations with multivalued dependencies are rare in practice, and you will not be asked to perform these
normalizations in this course.
VIII. Fifth Normal Form (5NF)
An even more anomalous type of relation exhibits join dependencies, which are a generalization of
multivalued dependencies, and require decomposition to achieve fifth normal form (5NF). Note in the 5NF
case that three or more relations are needed in the decomposition process rather than just two, as has
previously been the case with the lower forms.
Fifth normal form is the ultimate normal form for decompositions based on the relational algebra project
operation and is guaranteed to be free of anomalies. Fifth normal form is also referred to as projection-join
normal form.
Module 5: Normalization
IX. Normalization Case Study
This section is intended to reinforce the concepts of normalization from 1NF through 3NF by presenting a
specific example of a relational database table (the relational database terms table, row, and column will be
used in this section) with redundancy and update anomalies.
Starting with the table in figure 5.11 for an equipment-rental application, we see an application showing several
rental transactions for pieces of equipment rented by customers, including when the equipment was rented,
when it was returned, what it cost, and which salesman handled the transaction. (Due to large width of this
single table, it is being displayed below in two parts, A and B.)
Figure 5.11A
Figure 5.11B
Module 5: Normalization
Note that the Equipment column actually appears only once in this table. It is repeated in figure 5.11B for
clarity only. You may notice is that this table has a lot of redundancy. The pieces of equipment and customers
are repeated several times even in this small sample of data. Each time a piece of equipment is rented, the
information about that equipment is repeated. Also, each time a customer rents a piece of equipment, all the
information about the customer is repeated. Information about the salesmen is also redundant.
What is the major problem with this table's design? Very simply, too much information is grouped together. It is
unlikely that an experienced database developer would design a table like this for a relational database, but it is
instructive for a normalization exercise. For our example, all we want is a good design; we already have a set of
data, so in light of the discussion on relational database design strategies earlier in this module, normalization is
probably the best way for us to proceed.
We will want to draw a functional dependency diagram of this application, but before we can draw it, we have to
determine the primary key for the RENTALS table (this name was chosen because the primary information
being stored for this application concerns rentals).
In thinking about this information, you may have concluded that a specific tool, rented by a specific customer,
on a specific day is probably the unique identifier for the table. This conclusion, however, assumes a business
rule that the same tool cannot be rented by the same customer on the same day more than once. Note that we
draw a box around the three attributes (i.e., columns) comprising the primary key for this table. The functional
dependency diagram for the RENTALS table is shown in figure 5.12.
Figure 5.12
Let's review some of the functional dependency relationships (stated as English sentences here, versus as
functional dependency expressions):
We need only to know the serial number to know the piece of equipment and the tool category.
If we know the piece of equipment, we know the category, but not vice versa.
Module 5: Normalization
If we know the serial number, customer account number, and date out, we know the salesman's ID.
If we know the salesman's ID, we know the salesman's name, but not vice versa.
We know the sales type if we know the salesman's ID or the sales position, but we do not know the
sales type if we know only the salesman's name.
What is the highest normal form of this table? Given that there are no repeating groups, it is certainly in 1NF.
Is it also in 2NF? Note that not all non-key attributes are dependent on the full primary key. DATE_IN,
RETN_COND, COST, and SALES_ID are dependent on the full primary key, but EQUIPMENT, NAME, etc., clearly
are not. This fact tells us that the table is only in 1NF.
Do you recognize any problems with this table as it appears?
Several update anomalies exist:
If we update a customer's address, we need to make sure that all occurrences of the address are
changed, to be consistent (the rows of the RENTALS table shown may only be a small part of an
extremely large table).
If we delete a piece of equipment, we also delete all customer information and salesman information
involved in the rental.
If we add a piece of equipment, we cannot enter that information into the table until the equipment is
rented by a customer (assuming that customer information cannot be NULL).
If we add a salesman, we cannot enter his information into the table until he rents something out
(assuming also that customers and rental information cannot be NULL).
Clearly, we need to remedy the redundancy in this table by normalizing the table to remove the update
anomalies above.
To normalize from 1NF to 2NF, we need to take non-loss decompositions. We do this by taking virtual
"horizontal slices" through the current primary key. For example, we "cut out" some of the key attributes from
the RENTALS table. We can see all the dependencies from the SERIAL_NO and ACCT_NO attributes, so they
should be "cut out." Specifically, we should create another relation (i.e., table) that has only SERIAL_NO as the
primary key. The ACCT_NO attribute is handled similarly. As shown in figure 5.13, this then forms the set of
relations.
Figure 5.13
Module 5: Normalization
Note that the tables RENTALS, EQUIPMENT, and CUSTOMERS are all in 2NF because all their non-key attributes
are fully dependent on all attributes of the primary key. EQUIPMENT and CUSTOMERS only have single attribute
primary keys so these cases are trivial!
Module 5: Normalization
To determine whether these three tables are in 3NF, we ask whether all of the non-key attributes are fully
dependent on the primary key, and only the primary key. In other words, have all the transitive dependencies
been removed? The answer is no. In RENTALS, for example, the primary key determines the SALES_ID, the
SALES_ID determines the SALES_POS, and the primary key also determines the SALES_POS. Similar situations
can be seen for the non-key attributes in tables EQUIPMENT and CUSTOMERS.
To convert from 2NF to 3NF, we take virtual "vertical slices" through the non-key attribute. For example, we
separate attribute SALES_POS from table RENTALS, CATEGORY from table EQUIPMENT, and CITY and STATE
from table CUSTOMERS, yielding the set of tables shown in figure 5.14.
Figure 5.14
Module 5: Normalization
Figure 5.14 shows seven tables that are all in 3NF. For several reasons, this is a better design than the single
RENTALS table or the 2NF tables. The redundancy is removed as well as the troublesome update anomalies:
Each fact is stored in only one place, thus requiring less disk storage for the database (note the data of
the seven tables in figure 5.15).
If we add a customer we need only add him or her to the CUSTOMERS table.
If a piece of equipment is deleted, only one table is affected.
Module 5: Normalization
You can see that no information is lost with this design because the projections can be reversed by joining
operations of the tables to yield the original RENTALS table.
Figure 5.15A
Figure 5.15B
Figure 5.15C
Module 5: Normalization
Figure 5.15D
Figure 5.15E
Figure 5.15F
Module 5: Normalization
Figure 5.15G
Return to top of page
Download