Module 5: Normalization Overview In this module, we introduce the concept of functional dependency in the relational model and then examine the important process of normalization. Through normalization we will see how a set of formal techniques is used to decompose complex relational database relations that have redundancy and update anomalies into a set of smaller, less redundant relations, without losing any information. Module 5: Normalization Objectives After completing this module, you should be able to: describe the concept of functional dependency in a relation discuss the importance of Armstrong's rules and closure sets of functional dependencies and attributes explain the purpose of finding a superkey for a relation describe how non-loss decompositions of a relation can be accomplished using projections explain what is meant by reversibility of decompositions using natural joins explain the requirements for a relation to be in first normal form (1NF), second normal form (2NF) and third normal form (3NF) explain why normalization beyond 3NF may sometimes be required, and briefly describe Boyce-Codd normal form (BCNF), fourth normal form (4NF) and fifth normal form (5NF) decompose a relation into a set of 3NF relations Module 5: Normalization Commentary I. II. III. IV. V. VI. VII. VIII. IX. Functional Dependency Update Anomalies First Normal Form (1NF) Second Normal Form (2NF) Third Normal Form (3NF) Boyce-Codd Normal Form Fourth Normal Form (4NF) Fifth Normal Form (5NF) Normalization Case Study I. Functional Dependency Normalization is a formal process used with relational databases to remove redundancy from individual relations and alleviate the serious problems this redundancy causes. In this module we will be using the formal relational model terminology of relation, tuple, and attribute, although the concepts of normalization apply equally well to relational database tables, rows, and columns. Module 5: Normalization In order to understand what makes a relation "well formed" (i.e., normalized), it is first necessary to discuss the concept of functional dependency. A functional dependency is a relationship between sets of attributes in a relation that exists for all time. Assume that we have a relation ORDERS (as shown in figure 5.1, below) to record the orders that our customers place in an organization. Figure 5.1 In this relation, attribute Order_ID is the primary key, uniquely identifying each of the tuples, with each tuple containing data for one order. We can see that for any value of Order_ID we are always able to determine the Customer_ID. Likewise if we are given the value for a Customer_ID we are always able to determine the Last_Name and First_Name of the customer. We can express these two functional dependencies as: Order_ID Customer_ID Customer ID {Last_Name, First_Name} In the first functional dependency we can see that Order_ID determines the Customer_ID. We record this relationship by saying that Order_ID is the determinant and Customer_ID is the dependent. In the second functional dependency, the dependent is a set of two attributes. When either the determinant or dependent is a single attribute (i.e., a singleton set), we can drop the brackets around the attribute. Note that if we know the Customer_ID we cannot necessarily determine the Order_ID, so the relationship Customer_ID Order_ID is not a functional dependency. In these two relationships, and in others that exist in the ORDERS relation, knowing the values of the set of attributes on the left-hand side of the relationship functionally determines the values of the set of attributes on the right-hand side. A functional dependency is thus a many-to-one relationship between attribute sets in a single relation. Many instances of values of the determinant set always yield the same value for the dependent set. We can show the functional dependencies that exist in the ORDERS relation via a functional dependency diagram. Two examples of functional dependency diagrams are shown in Figure 5.2. Figure 5.2A Figure 5.2B Module 5: Normalization These two diagrams vary only as to whether the attributes in the relation are connected in a row, or not touching. In both diagramming formats, however, the lines emanate from determinate attribute(s) and the arrows point towards dependent attribute(s). In the remaining discussion of this section, we will see how the theoretical aspects of functional dependency can allow us to find two practical items—a primary key for a relation, and a minimal set of functional dependencies. The two functional dependencies we listed earlier for the ORDERS relation are just a few of the many total functional dependencies that exist for the relation. Among the others are: Order_ID Date_Placed Customer_ID Age Customer_ID Zip_Code Zip_Code {City, State} The entire set of functional dependencies that exist for a relation is called the closure set of functional dependencies. In order to determine the closure set of functional dependencies of a relation from some initial set, we can apply the set of six well-known Armstrong's inference rules to infer additional functional dependencies from those we already know to exist. One example of Armstrong's inference rules is the transitive rule, which says if one set of attributes determines a second set and then this second set determines a third set, that the first set determines the third set. If we start with an initial set of attributes for a relation, and an initial set of functional dependencies, we can iteratively apply Armstrong's inference rules until a closure set of attributes for our initial functional dependencies is reached. As each additional functional dependency of the relation is discovered, our set of attributes can become a larger subset of the entire set of attributes of the relation. The process stops when no additional attributes can be added, thus achieving the closure set of attributes. If the closure set of attributes contains all attributes of the relation, then the initial set contained a superkey. We can then reduce the superkey by removing attributes not needed to determine uniqueness, until we are left with a minimal set of attributes. At this point we have a candidate key, and it can be chosen as the primary key of the relation. Also, once we have determined the closure set of functional dependencies for a relation, we can use Armstrong's inference rules and other functional dependency rules to derive an irreducible cover set of functional dependencies. This set is the minimal set of functional dependencies the database designer needs to enforce in the database to ensure that the total integrity of the relation has been maintained. Module 5: Normalization II. Update Anomalies Although we are not given the source of the ORDERS database, it is typical of the "one-table" databases some novice database users might develop with a spreadsheet, or similar non-RDBMS software. A quick glance at the data shows that we have a serious problem of redundancy. For example, every time customer C00006 (or any other customer) places an order, we need to repeat their Last_Name, First_Name, Age, Zip_Code, City, and State. This not only wastes disk space but can easily lead to data entry errors, resulting in data inconsistency existing in our database. A more serious problem is that of the update anomalies that exist. For instance, we cannot add a new customer until that customer places an order. This unfortunate situation is called an insertion anomaly. If we decide to remove our orders from the database at the end of every year and archive them to a file, we would probably also be losing customer data in the process. This would not be a desirable situation, because our existing customers are likely to place future orders. This problem is called a deletion anomaly. A final problem is the modification anomalies that exist when we need to change a customer's Last_Name, First_Name, Age, Zip_Code, City, or State. Each tuple containing this data for the customer must be updated or we will have another situation with data inconsistency. The basic problem that we have with the ORDERS relation is that it contains too much data together. Relations should contain data pertaining to only a single theme. Stated another way, "each database fact should be kept in a single place in the database." But in our ORDERS relation we have lumped many unrelated attributes together in one place. We have data regarding orders and data regarding customers all in the same relation. We would have done better to have kept order data in one relation and customer data in another, and this likely would have been accomplished had we used a technique such as entity/relationship diagramming of end-user requirements, as discussed in module 3. Intuitively we know that we have serious problems with the ORDERS relation. In the next section of this module we will discuss the formal techniques that exist to alleviate those problems. III. First Normal Form (1NF) As mentioned above, normalization is a formal process used to reduce redundancy in a relation while still maintaining the relation's functional dependencies. It involves decomposing a relation into two or more relations using specific guidelines to achieve normal forms. Normal forms range from the least restrictive first normal form (1NF) through second normal form (2NF), third normal form (3NF), Boyce-Codd normal form (BCNF), fourth normal form (4NF), and fifth normal form (5NF). The last two normal forms are based on the more general concepts of multi-valued dependencies and join dependencies, respectively. This section discusses 1NF. Later sections in this module will discuss the higher normal forms. First normal form requires that all values in the relation be single, atomic values. Let us clarify this status by looking at a structure that is not even in 1NF. This sort of structure can be shown by the modified, unnormalized ORDERS structure shown in figure 5.3. Figure 5.3 Module 5: Normalization Notice in this structure that we have repeating groups for the Order_ID and Date_Placed attributes. Every time a customer places a new order, it would add another Order_ID and Date_Placed set of values to their "tuple." Such an unnormalized structure does not even qualify as a relation, since all relations are normalized by definition. Relations are only allowed to have atomic values at each tuple/attribute intersection. Since 1NF requires only atomic values, with no repeating groups, all relations are in at least 1NF. The current interpretation of the relational model allows nested relations to exist inside other relations. For example, we could create a nested relation consisting of just the attributes Order_ID and Date_Placed in the first tuple of the structure in figure 5.3. This modified structure with a nested relation of two "attributes" and two "tuples," although technically a relation, would not be in 1NF, because it doesn't contain atomic values. It is possible to create a set of 1NF relations from a nested relation by unnesting the inner relations. IV. Second Normal Form (2NF) A relation is in second normal form (2NF) if it's in 1NF and there are no partial dependencies on the primary key. In other words, all non-key (i.e., not part of the primary key) attributes must be dependent on the full primary key. We can see that the ORDERS relation of figure 5.1 is already in 2NF since there are no partial dependencies. All attributes are dependent on the primary key of Order ID. This can be seen clearly on the functional dependency diagrams of figure 5.2. In fact, any relation with a single-attribute primary key is automatically in 2NF. If we had another relation, such as RATINGS, as shown in figure 5.4, we would have a situation with partial dependencies. Figure 5.4 The primary key of relation RATINGS is the composite of attributes SSN and Job, which is why an outer box is drawn around both attributes in the functional dependency diagram of relation RATINGS in figure 5.5. Attribute Skill_Level is dependent on both SSN and Job, the full primary key. This is shown by the arrow emanating from the outer box containing the primary key attributes to attribute Skill_Level. But attributes Last_Name and Module 5: Normalization First_Name are dependent only on attribute SSN, part of the primary key, as shown by the arrow emanating only from SSN in figure 5.5. Thus SSN {Last_Name, First_Name} is a partial dependency. Figure 5.5 Besides exhibiting some redundancy, relation RATINGS also has some update anomalies. If we need to put a new EMPLOYEE into the relation, we can't do so until the employee has at least one JOB. This is an insertion anomaly. Likewise we have a deletion anomaly because if we delete a JOB from the relation we might also delete an employee. A modification anomaly exists because of the partial dependency of Last_Name and First_Name on SSN. If we update a specific person's (i.e., SSN's) Last_Name or First_Name we need to make sure we update all tuples with that SSN or we will have data inconsistency. To remove these update anomalies we need to convert relation RATINGS into a set of 2NF relations by performing a non-loss decomposition of the original relation into a set of two smaller (degree-wise) relations, JOB_RATINGS and EMPLOYEES (as shown in figure 5.6). The decomposition is done via the relational algebra operation of projection. The original functional dependencies of the RATINGS relation are maintained in the two new relations. Notice also that these two relations have a one-to-many relationship. Figure 5.6A Figure 5.6B In the new JOB_RATINGS and EMPLOYEES relations all non-key attributes are dependent on the full primary key. We have removed the update anomaly problems with employees, since they are now kept in the EMPLOYEES relation, whether or not they have a JOB. Updating an employee's Last_Name or First_Name is now done in a single place in the database. Also notice that since EMPLOYEES is a relation, and relations do not have duplicate tuples, that we only have two tuples (versus the three in the original relation). Module 5: Normalization We can assure ourselves that this was a non-loss decomposition by performing a natural join of the JOB_RATINGS and EMPLOYEES relations to return to the original RATINGS relation. The relational algebra's natural join operation is the reversibility operation for normalization. If we had performed a decomposition by projection, but not followed the functional dependencies of the RATINGS relation (for example SSN and Skill_Level for one relation), we would have found that when performing the natural join we would have additional, spurious tuples, that were not part of the original relation. The existence of spurious tuples is an indicator of a lossy decomposition. Another situation where we might want to decompose a relation into two smaller relations is where we would have a lot of nulls. One example might be for a relation with data about the states, or countries of the world, including ocean shoreline and shipping data. For many land-locked states or countries we would have no shoreline data. If shoreline data were a major part of the relation then the nulls would be significant. But by creating a separate "shoreline data" relation, with a one-to-one relationship to the state or country relation, we would save disk space. Note that this is technically not a normalization issue but rather a relation decomposition issue. V. Third Normal Form (3NF) Our ORDERS relation in figures 5.1 and 5.2, although in 2NF, still has some update anomalies. For example, we cannot insert, delete, or update customer data without possibly affecting the order data. This is because we have a transitive dependency of Customer_ID on Last_Name, First_Name, Age, Zip_Code, City, and State. A transitive dependency is a functional dependency between sets of non-key attributes. We also have a transitive dependency of Zip_Code on City and State. Note that neither Customer_ID nor Zip_Code are part of the primary key of ORDERS. By removing transitive dependencies, and assuming we have no partial dependencies, we create relations in third normal form (3NF). This is done for our ORDERS relation by first creating the ORDERS2 and CUSTOMERS relations of figure 5.7. The ORDERS2 relation no longer has a transitive dependency and is now in 3NF. Figure 5.7A Figure 5.7B Module 5: Normalization The CUSTOMERS relation still has the transitive dependency with Zip_Code, so we need to create new relations CUSTOMERS2 and ZIP_CODES as shown in figure 5.8. Figure 5.8A Figure 5.8B From the single ORDERS relation we now have three 3NF relations of ORDERS2, CUSTOMERS2, and ZIP_CODES. If we were to perform natural joins of these three relations we would end up with our original ORDERS relation. In practice, for most online transaction processing (OLTP) applications, 3NF relations are the goal. But since joining relations (tables really) is an expensive operation in a relational database, if we have applications such as decision support systems (DSSs), where we might perform many ad hoc queries against tables, we might wish to denormalize our 3NF tables back into 2NF tables for performance considerations. How does normalization, as discussed in this module, compare to relational database design, using techniques such as entity/relationship diagramming? Remember that when we used ERDs, we were working from user Module 5: Normalization specifications, not sets of data, so we were trying to define real-world entities for an application. In the case of normalization, as in this module, we started with the data, but the problem was that too much data was stored in one place. Although approached differently, both processes result in creating a desirable relational database design of 3NF tables that has minimal redundancy by keeping facts in one place only and removing update anomalies. Note that the process of normalization to 3NF can also be used after a database has initially been designed via ERDs. For example, we may not have initially realized the Zip_Code to City and State transitive dependencies in our ERDs. VI. Boyce-Codd Normal Form (BCNF) As stated in the last section, in practice 3NF is usually sufficient to minimize redundancy. Some relation situations, however, have anomalies that require even higher normalization than 3NF. As originally defined, 3NF did not address these cases, so the Boyce/Codd normal form (BCNF) was developed. For example, consider the relation DB_DATA as shown in figure 5.9. Figure 5.9 This relation has two overlapping, composite candidate keys as expressed by the following functional dependencies: {Student, Course} Professor {Student, Professor} Course We also have the functional dependency: Professor Course Relation DB_DATA is in 3NF, but it still has update anomalies. For example, if we delete the fact that Jones is studying IFSM 420, we delete Professor Anyanso. We could remove the anomaly by forming the two relations, STUDENT_PROF and PROF_COURSE, as shown in figure 5.10. Figure 5.10A Figure 5.10B Module 5: Normalization Although these two new relations remove the previous deletion anomaly, unfortunately they cannot be updated independently. For example, we cannot insert Jones and Pickering into the STUDENT_PROF relation because Jones is already taking IFSM 420 from Professor Anyanso and this would violate the {Student, Course} Professor functional dependency. A relation is in BCNF when all the determinants are candidate keys. Notice that the candidate key of STUDENT_PROF is the composite (Student, Professor) and for PROF_COURSE is (Professor, Course). In fact, these are the only determinants of each relation. The requirement that all determinants be candidate keys is a stronger (i.e., more restrictive) definition than the original 3NF proposed by Codd. As noted above, BCNF may be required when there are two or more candidate keys and these candidate keys are composite and they overlap—an anomalous situation. Relations that are in 3NF but not in BCNF are rare in practice. Frequently, relations requiring normalization to BCNF result from poor database design. VII. Fourth Normal Form (4NF) Some 3NF relations may have multivalued dependencies of the form: X {A, B} This means that each value of attribute set X may have various values of the combination of attribute sets A and B. For example a Course may have various Professors and various Texts. We cannot via functional dependency state which unique Professor and Text a course might have, because many combinations are legitimate. To remove multivalued dependencies, we need to decompose a relation into multiple fourth normal form (4NF) relations. In the above example we would have two relations with attributes (and primary keys) of X and A, and X and B. A multivalued dependency is actually a generalization of the functional dependency in the lower normal forms. Relations with multivalued dependencies are rare in practice, and you will not be asked to perform these normalizations in this course. VIII. Fifth Normal Form (5NF) An even more anomalous type of relation exhibits join dependencies, which are a generalization of multivalued dependencies, and require decomposition to achieve fifth normal form (5NF). Note in the 5NF case that three or more relations are needed in the decomposition process rather than just two, as has previously been the case with the lower forms. Fifth normal form is the ultimate normal form for decompositions based on the relational algebra project operation and is guaranteed to be free of anomalies. Fifth normal form is also referred to as projection-join normal form. Module 5: Normalization IX. Normalization Case Study This section is intended to reinforce the concepts of normalization from 1NF through 3NF by presenting a specific example of a relational database table (the relational database terms table, row, and column will be used in this section) with redundancy and update anomalies. Starting with the table in figure 5.11 for an equipment-rental application, we see an application showing several rental transactions for pieces of equipment rented by customers, including when the equipment was rented, when it was returned, what it cost, and which salesman handled the transaction. (Due to large width of this single table, it is being displayed below in two parts, A and B.) Figure 5.11A Figure 5.11B Module 5: Normalization Note that the Equipment column actually appears only once in this table. It is repeated in figure 5.11B for clarity only. You may notice is that this table has a lot of redundancy. The pieces of equipment and customers are repeated several times even in this small sample of data. Each time a piece of equipment is rented, the information about that equipment is repeated. Also, each time a customer rents a piece of equipment, all the information about the customer is repeated. Information about the salesmen is also redundant. What is the major problem with this table's design? Very simply, too much information is grouped together. It is unlikely that an experienced database developer would design a table like this for a relational database, but it is instructive for a normalization exercise. For our example, all we want is a good design; we already have a set of data, so in light of the discussion on relational database design strategies earlier in this module, normalization is probably the best way for us to proceed. We will want to draw a functional dependency diagram of this application, but before we can draw it, we have to determine the primary key for the RENTALS table (this name was chosen because the primary information being stored for this application concerns rentals). In thinking about this information, you may have concluded that a specific tool, rented by a specific customer, on a specific day is probably the unique identifier for the table. This conclusion, however, assumes a business rule that the same tool cannot be rented by the same customer on the same day more than once. Note that we draw a box around the three attributes (i.e., columns) comprising the primary key for this table. The functional dependency diagram for the RENTALS table is shown in figure 5.12. Figure 5.12 Let's review some of the functional dependency relationships (stated as English sentences here, versus as functional dependency expressions): We need only to know the serial number to know the piece of equipment and the tool category. If we know the piece of equipment, we know the category, but not vice versa. Module 5: Normalization If we know the serial number, customer account number, and date out, we know the salesman's ID. If we know the salesman's ID, we know the salesman's name, but not vice versa. We know the sales type if we know the salesman's ID or the sales position, but we do not know the sales type if we know only the salesman's name. What is the highest normal form of this table? Given that there are no repeating groups, it is certainly in 1NF. Is it also in 2NF? Note that not all non-key attributes are dependent on the full primary key. DATE_IN, RETN_COND, COST, and SALES_ID are dependent on the full primary key, but EQUIPMENT, NAME, etc., clearly are not. This fact tells us that the table is only in 1NF. Do you recognize any problems with this table as it appears? Several update anomalies exist: If we update a customer's address, we need to make sure that all occurrences of the address are changed, to be consistent (the rows of the RENTALS table shown may only be a small part of an extremely large table). If we delete a piece of equipment, we also delete all customer information and salesman information involved in the rental. If we add a piece of equipment, we cannot enter that information into the table until the equipment is rented by a customer (assuming that customer information cannot be NULL). If we add a salesman, we cannot enter his information into the table until he rents something out (assuming also that customers and rental information cannot be NULL). Clearly, we need to remedy the redundancy in this table by normalizing the table to remove the update anomalies above. To normalize from 1NF to 2NF, we need to take non-loss decompositions. We do this by taking virtual "horizontal slices" through the current primary key. For example, we "cut out" some of the key attributes from the RENTALS table. We can see all the dependencies from the SERIAL_NO and ACCT_NO attributes, so they should be "cut out." Specifically, we should create another relation (i.e., table) that has only SERIAL_NO as the primary key. The ACCT_NO attribute is handled similarly. As shown in figure 5.13, this then forms the set of relations. Figure 5.13 Module 5: Normalization Note that the tables RENTALS, EQUIPMENT, and CUSTOMERS are all in 2NF because all their non-key attributes are fully dependent on all attributes of the primary key. EQUIPMENT and CUSTOMERS only have single attribute primary keys so these cases are trivial! Module 5: Normalization To determine whether these three tables are in 3NF, we ask whether all of the non-key attributes are fully dependent on the primary key, and only the primary key. In other words, have all the transitive dependencies been removed? The answer is no. In RENTALS, for example, the primary key determines the SALES_ID, the SALES_ID determines the SALES_POS, and the primary key also determines the SALES_POS. Similar situations can be seen for the non-key attributes in tables EQUIPMENT and CUSTOMERS. To convert from 2NF to 3NF, we take virtual "vertical slices" through the non-key attribute. For example, we separate attribute SALES_POS from table RENTALS, CATEGORY from table EQUIPMENT, and CITY and STATE from table CUSTOMERS, yielding the set of tables shown in figure 5.14. Figure 5.14 Module 5: Normalization Figure 5.14 shows seven tables that are all in 3NF. For several reasons, this is a better design than the single RENTALS table or the 2NF tables. The redundancy is removed as well as the troublesome update anomalies: Each fact is stored in only one place, thus requiring less disk storage for the database (note the data of the seven tables in figure 5.15). If we add a customer we need only add him or her to the CUSTOMERS table. If a piece of equipment is deleted, only one table is affected. Module 5: Normalization You can see that no information is lost with this design because the projections can be reversed by joining operations of the tables to yield the original RENTALS table. Figure 5.15A Figure 5.15B Figure 5.15C Module 5: Normalization Figure 5.15D Figure 5.15E Figure 5.15F Module 5: Normalization Figure 5.15G Return to top of page