Chapter 5 – Normalization

Module 5 – Logical Design: Table Mapping and Normalization Goals: To learn the process of table mapping normalization and understand why it is part of the database development process. Objectives: To be able to apply these concepts to a variety of problems. Introduction: Now that the systems analysis and design and conceptual design have been completed, it is possible to progress to the next step, logical design. We begin by using table mapping techniques to convert the ERD to a diagram that exposes which entities and attributes are related, or functionally dependent, on others in the ERD. It is with this knowledge that can proceed to the technique of normalization, which validates the ERD. Let’s look at the steps. Transforming ER Diagrams into Relations Step 1: Map Regular Entities Each regular entity type is transformed into a relation. Composite attributes should be broken down into each individual attribute. Multivalued Attributes: Two new relations are created when a multivalued attribute exists for an entity. The first relation contains all of the attributes of the entity type except the multivalued attribute. The second relation contains the two attributes that form the primary key of the second relation. The first attribute is the primary key of the first relation, which becomes a foreign key to the first relation. The second attribute is the multivalued attribute. The name of the second relation should capture the meaning of the multivalued attribute. Step 2: Map Weak Entities For each weak entity type, create a new relation and include all of the simple attributes as attributes of this relation. Include the primary key of the owner relation as a foreign key attribute in the new relation. The primary key of the new relation is a combination of the primary key of the owner and the partial identifier of the weak entity type. Step 3: Map Binary Relationships A. Map Binary One-to-Many Relationships First, create a relation for each of the two entity types participating in the relationship. Next, include the primary key attribute of the entity on the one-side of the relationship as a foreign key on the manyside of the relationship. B. Map Binary Many-to-Many Relationships Suppose that there is a binary many-to-many relationship between two entities, A and B. For such a relationship, we create a new relation C. Include as a foreign key attribute in C the primary key for each of the two participating entity types. These attributes become the primary key of C. Copyright Lisa MacLean, Wentworth Institute of Technology, 2001. All rights reserved. Unauthorized use or copying strictly prohibited without consent. -1- C. Map Binary One-to-One Relationships First, two relations are created. Second, the primary key of one of the relations is included as a primary key in the other relation. Step 4: Map Associative Entities First, create three relations, one for each of the participating entity types and a third for the associative entity. The second step depends on whether an identifier was assigned to the associative entity. Identifier Not Assigned Default primary key for associative relation is the two primary key attributes of the other two relations. These are then foreign keys into the other two relations. Identifier Assigned Identifier is the primary key. The primary keys of the two participating entity types are foreign keys in the associative relation. Step 5: Map Unary Relationships A. Unary One-To-Many Relationships The entity type is mapped to a relation as in step 1. A foreign key attribute is added which references the primary key values. B. Unary Many-To-Many Relationships Two relations are created. One to represent the entity type in the relationship and the other an associative relation to represent the M:N relationship itself. The primary key of the associative relation consists of two attributes. These attributes both take their name from the primary key of the other relation. They are both foreign keys into the other relation. Step 6: Map Ternary (and n-ary) Relationships Convert to an associative entity. Create the three relations. Create a fourth relation for the associative entity. The primary key consists of the primary keys of the other three relations. These attributes are then foreign keys into the three relations. Example Now that we have gone over all of the rules, let’s look at an example. Figure 1 shows an ER Diagram for Wilber’s Bait World. We wish to convert this from the ER Model into a logical model. Let’s take each part piece by piece: Copyright Lisa MacLean, Wentworth Institute of Technology, 2001. All rights reserved. Unauthorized use or copying strictly prohibited without consent. -2- Figure 1 You will notice a few things about this ER diagram. First, there are two associative entities: Cust_Order and the Has between Baitshop and Item. You will also notice that there is a weak entity, Dependents. Also, notice that we have several composite attributes, all for addresses. Finally, we have a multi-valued attribute on employee, called skill. Let’s take this step by step. Copyright Lisa MacLean, Wentworth Institute of Technology, 2001. All rights reserved. Unauthorized use or copying strictly prohibited without consent. -3- 1. Map Regular Entities 1.1. Baitshop Baitshop has a composite attribute, address. We will need to break this down into separate fields. Also, we will underline Shop_No since it is the primary key. Figure 2 shows the relation, or table, that will result from the Baitshop entity: SHOP_NO SHOP_NAM STREET CITY STAT ZIPCO TELEPHO E EFigure 2 DE NE 1.2. Employee First, let’s create a new relation for the employee entity: EMP_ID F K NAME STREET MANAGERI D TAXRA SHOPRE TE NT CITY STAT ZIPCOD TELEPHON E E Figure 3E Employee also has a multi-valued attribute, so we need to create a second table. We will call this employee_skill. As we have been instructed in Step 1, we should include the primary key of the Employee table as well as the multivalued attribute itself, as shown in Figure 4: EMP_ID Skill Figure 4 Emp_ID is a foreign key to the employee table 1.3. Customer Like the Employee entity, the Customer entity has a composite attribute, address. The table that results from this entity is shown in Figure 5: CUSTOMER_ NAM STRE CIT STAT ZIPCOD WTELEPHO HTELEPHO ID E ET Y EFigure 5E NE NE CREDIT_LI MIT 1.4. Item Figure 6 shows the table that results from the Item entity: ITEM_NO 2. F K DESCRIPTI TYPE PRIC WHOLESALER_I ON Figure 6 E D Map Weak Entities We have one weak entity, Dependents. Figure 7 shows the table that results from the Dependents entity: EMP_ID NAM GENDE R Figure 7E Note that the primary key is composed of EMP_ID and NAME. Also, EMP_ID is a foreign key to the employee table. Copyright Lisa MacLean, Wentworth Institute of Technology, 2001. All rights reserved. Unauthorized use or copying strictly prohibited without consent. -4- 3. Map Binary Relationships 3.1. Map Binary One-to-Many Relationships We have one binary one-to-many relationship between Baitshop and Employee. Remember our rule, the primary key from the one side migrates to the many side. So, we change the employee relation to include the primary key from the Baitshop relation. The new tables are represented in Figure 8: SHOP_NO SHOP_NAME STREET CITY STATE ZIPCODE TELEPHONE MANAGERID TAXRATE SHOPRENT FK EMP_ID NAME STREET CITY STATE ZIPCODE TELEPHONE SHOP_NO Figure 8 3.2. Map Binary Many-to-Many Relationships The only binary Many-to-Many relationship that we have is between Item and the associative entity, Cust_Order. We will map this later, once we resolve the associative entity. 3.3. Map Binary One-to-One Relationships We have a one-to-one relationship between employee and baitshop. We can choose to place the primary key of either one in the other. However, since all baitshops must have a manager, it would be best to place the employee id in the baitshop table. We will call this ManagerID. We represent this in Figure 9 SHOP_NO SHOP_NAME STREE T EMP_ID NAME CITY STATE ZIPCODE STREET TELEPHONE MANAGERID TAXR SHOP MANAGERID ATE RENT FK CITY STATE ZIPCODE TELEPHONE SHOP_NO Figure 9 4. Map Associative Entities 4.1. Identifier Not Assigned To Baitshop In this example, we have both cases. Let’s look at the first case, identifier not assigned. The Has associative entity between BaitShop and Item does not have an identifier. We will convert this to a relation, called Baitshop_Invent. According to our rules, we need to create a primary key that consists of the two primary keys from BaitShop and Item. These become foreign keys into Baitshop and Item. There are also 2 attributes, Quantity and Last_Order. These need to be included in the relation also.Figure 10 shows this new relation. SHOP_NO ITEM_NO To Item QUANTITY LAST_ORDER Figure 10 Copyright Lisa MacLean, Wentworth Institute of Technology, 2001. All rights reserved. Unauthorized use or copying strictly prohibited without consent. -5- 4.2. Identifier Assigned In this case, we use the identifier as the primary key. In our case study, we have an associative entity between Baitshop and Customer. We will also include the primary keys from both of these, but not as a composite primary key but rather as foreign keys. We will also include the Purchase_Date as a field. The relation is represented in Figure 11 and we will call it Cust_Order. Figure 11 ORDER_I D SHOP_NO CUSTOMER_I D PURCHASE_DATE To To Customer Baitsho p Relationships 5. Map Unary We have no Unary relationships in our example. 6. Map Ternary Relationships We have no Ternary relationships. 7. Miscellaneous One final relationship that we have to clean up, the Has between Item and Cust_Order. This is a many-to-many relationship. Therefore, we need to create a new relation, which we will call Order_Line. Order_Line is represented in Figure 12 and has a composite primary key composed of the primary key from Order_Line and the primary key from Item. We also include the quantity attribute. ORDER_I D ITEM_NO QUANTITY Figure 12 Copyright Lisa MacLean, Wentworth Institute of Technology, 2001. All rights reserved. Unauthorized use or copying strictly prohibited without consent. -6- The Complete Logical Model We will now put the entire logical model together. It is represented below: Baitshop SHOP_NO SHOP_NAME STREE T Employee CITY STATE ZIPCODE TELEPHONE MANAGERID TAXR SHOP MANAGERID ATE RENT FK FK NAME EMP_ID STREET CITY STATE ZIPCODE TELEPHONE SHOP_NO FK Emp_Skill EMP_ID Skill Dependent EMP_ID NAM GENDER E Customer CUSTOMER_ NAME STREET CITY STATE ZIPCODE WTELEPHONE HTELEPHONE CREDIT_LIMIT ID Item ITEM_NO DESCRIPTION TYPE PRICE WHOLESALER_ID FK Baitshop_Invent SHOP_NO ITEM_NO QUANTITY LAST_ORDER FK FK Cust_Order FK ORDER_I D SHOP_NO CUSTOMER_I D ITEM_NO QUANTITY PURCHASE_DATE Order_Line ORDER_I D FK Normalization Normalization is an important part of the design of relational databases. Normalization is the process of applying a set of rules against a proposed set of tables to identify and eliminate problems in the proposed data structure. This is Copyright Lisa MacLean, Wentworth Institute of Technology, 2001. All rights reserved. Unauthorized use or copying strictly prohibited without consent. -7- part of the process of converting our conceptual design, as represented by our ER diagram, into a logical design before physical implementation. Normalization is used to validate, not replace, the ER diagram. If the process of normalization reveals errors in the ERD, the ERD must then be updated. The resulting relations are then turned into a table map. The table map is another way to graphically represent the database and how the tables relate to each other. This table map, which will be covered in this text, is generally considered part of appropriate system documentation. It is also a good tool for the database developer to keep on hand to remind him or her of the structure of the database. It is especially handy for large projects, unless one has a photographic memory with abundant storage. Many beginners and lazy database designers proceed to the logical step by simply composing a table map of the entities and skipping the normalization process. This is a sloppy and dangerous practice, since normalization often reveals subtle but important nuances of how data in relational tables behave. These oversights will never remain buried once the implementation is in use. It is also a way for the database developer to guarantee s/he has a firm grasp of the inherent meaning of the data. Review of Keys As previously discussed, the relational model represents data as a series of tables. Data is represented as a set of rows comprised of columns. The order of the columns is insignificant. All rows must have a column that acts as a primary key (with its properties of never being blank and always unique). Each column contains an attribute, or a descriptor of a piece of information relevant to the database. Each table, made up of these rows, is called a relation. Any attribute, or combination of attributes, that could possibly serve as a key is called a candidate key. It is from the pool of potential keys, the candidates, that primary, secondary and foreign keys are chosen. A candidate key has the properties that it will uniquely identify the row, whether it is a simple attribute or combination of attributes. It also must be nonredundant; that is, no attribute of the key can be deleted without losing its unique identification status. Copyright Lisa MacLean, Wentworth Institute of Technology, 2001. All rights reserved. Unauthorized use or copying strictly prohibited without consent. -8- Primary keys are assigned for several reasons: to serve as a value to locate the exact relation desired; to guarantee uniqueness of each row; and to be used in referential integrity, to be discussed later in the text. We denote primary keys by underlining them in their rows, like this: CUSTOMER (Customer_ID, Customer_last, Customer_first, Street, City, Zip, Telephone) There are cases in which the primary keys may be more than one attribute. This is the case with weak entities, which inherit the primary key from the strong entity in addition to its own identifier. This is known as a composite key, and is denoted in this manner: DEPENDENT (Employee_ID, Dependent_ID, Dependent_name, date_of_birth) Remember from our discussion of keys that there can be foreign keys as well. A foreign key is a primary key from another relation that has emigrated to another relation. This can serve two purposes: one, the insertion of the primary key prevents the need to duplicate attributes from another table. As an example, we need to know which customer has placed an order, but we don’t want to duplicate this customer information in the order table. By simply storing the Customer_ID, we can join the two tables when necessary to recreate a customer’s order. The second use of a foreign key is referential integrity. Referential integrity can also be programmed into the back end of the database at physical implementation. What referential integrity means is that every child must have a parent. Using the same example of the customers and the orders, we would never want to insert an order record containing a Customer_ID that did correspond back to a valid relation in the table. Foreign keys are denoted with a broken line: ORDERS (Order_ID, Customer_ID, Date) Why Normalize? Copyright Lisa MacLean, Wentworth Institute of Technology, 2001. All rights reserved. Unauthorized use or copying strictly prohibited without consent. -9- Every time a change is made to a database, consider it a chance for the database to be damaged. There are three points of entry for errors that will lead to inconsistency and duplication in the database: when inserting a new record, changing the value of an existing record, and deleting records from the file. Proper design in this stage will help prevent these problems, called anomalies. There are ways during the physical implementation that database developers can further protect the database from damage. For now, we rely on normalization. Consider the following table: StudentID S_Name TeacherID T_Name InstruID Instrument 01474 01423 01233 12100 Luigi Ming Veronique Victor 143002 143002 122179 144300 Wolf Wolf Greylock Blair 31 31 22 93 Oboe Oboe Violin Cello This design was quite common at one time. This file structure, called a flat file, evolved from when data was kept on punched cards. Punched cards had eighty columns. All data pertaining to one record was on one card. The stack of punched cards was the database! This is also the format of data in spreadsheets. It is also an easy way to design a file structure: simply place all necessary fields in one big record. It is not out of the realm of possibility that a database developer today will encounter such a file structure. Projects often spring from spreadsheets that can no longer hold the size of the database necessary, or when the ability to extract data from the spreadsheet reaches a point of diminishing returns. The data may be exported from an older technology, such as a mainframe. Or, the newly assigned database developer may simply be the victim of an untrained or lazy database developer. This is a table in use by the ersatz Passageways Music School, a nonprofit organization that provides instruments and musical instruction to low income students. Each student chooses one instrument type to learn, but can choose another. Each instructor may teach one, or more, instruments. Each week, their assigned teacher meets with them Copyright Lisa MacLean, Wentworth Institute of Technology, 2001. All rights reserved. Unauthorized use or copying strictly prohibited without consent. - 10 - after school for lessons. The instruments have an ID on them to discourage theft. Each student comes in, signs out the assigned instrument, has a lesson, then signs the instrument back in. The school assigns only one instrument of a type to a student, since different instruments sometimes feel differently to the students. What’s wrong with this picture? Look at the structure and ask yourself why this is a bad design. Remember that one always keeps the bottom line in mind when designing data. The bottom line is generally how the data can be more productively used or more information extracted. Ask yourself the following questions:  What will happen when a student takes more than one instrument?  What will happen when an instructor is able to teach more than one instrument or student?  What will happen when an instrument is assigned to different students for their individual sessions? In each case, the answer is the same: when there is another record inserted into this table, the data pertaining to the other columns gets repeated. Consider the first question; Ming is doing so well with the oboe that she decides to try the violin as well. The record that reflects this decision is inserted. StudentID S_Name TeacherID T_Name InstruID Instrument 01474 01423 01233 12100 01423 Luigi Ming Veronique Victor Ming 143002 143002 122179 144300 122179 Wolf Wolf Greylock Blair Greylock 31 31 22 93 24 Oboe Oboe Violin Cello Violin When a student takes more than one instrument, his or her information will be repeated. So is the information for the instructor. This is unnecessary duplication. Why do we care if information is duplicated? Consider the issue of space. Repeating information unnecessarily takes up space on our physical server. While this may seem an insignificant matter, when databases increase in size it becomes quite significant. Larger files cost in terms of disk space, and amount of time to join, back up and restore. It will also take more time to search when searching on a single attribute. To retrieve a listing of students being taught by Mr. Greylock, for example, a search of the entire database must take place. Copyright Lisa MacLean, Wentworth Institute of Technology, 2001. All rights reserved. Unauthorized use or copying strictly prohibited without consent. - 11 - And what happens if this search for Mr. Greylock’s students hits a record where his name is misspelled as Mr. Graylock? That record would be skipped, since Greylock and Graylock do not match. The more times an attribute is entered, the more chances for user error. Some intrepid flat-filers at this point might try simply repeating columns instead of rows. If a student takes a second instrument, add a column called Instru2. This requires repeating the columns TeacherID, T_name, and InstruID as well. These are called repeating groups. It also inserts a blank value for every student not taking a second instrument. This will work for spreadsheet applications, but not for modern back ends that will reject the blank keys. This is known as an insert anomaly. This awkward structure will also cause very awkward queries to be written to extract needed information. Repeating groups correspond to multivalued attributes in our ERD. It is possible, and in fact likely, that a student could be learning more than one instrument, or that a teacher could teach more than one instrument, just as it is likely that an employee have more than one skill or one dependent. Repeating groups are always eliminated in relational databases. And what happens when Passageways accepts a new student who has not been assigned a teacher as yet? We know that keys are necessary to make a row unique. In this example, a composite key is needed to guarantee uniqueness: StudentID + TeacherID + InstruID. Since we know that the properties of a key require that they can never be null, we can’t insert this new student until a teacher and an instrument are assigned. This is again an insert anomaly. Now assume that Miss Wolf has gotten married and is changing her name to Fox. How many times does the change need to be executed? Once for every student, or twice in this example. This consumes systems resources as the file is scanned for all values of “Wolf.” Physical disk retrieval is the slowest part of any system. We want to go to the disk as little as possible. We are also increasing the chance for errors with each update. Copyright Lisa MacLean, Wentworth Institute of Technology, 2001. All rights reserved. Unauthorized use or copying strictly prohibited without consent. - 12 - Finally, if one record is skipped for one reason or the other—say a new student for Miss Wolf had not yet been posted to the database while the name change is running—we now have a database with both values, Wolf and Fox. The database is said to be inconsistent—the values are not the same for all matching rows. These types of errors are called update anomalies. Look at the record that details Victor taking cello lessons from Mrs. Blair. What if Victor decides to drop out, and we need to delete his record? Now we have lost all data pertaining to Mrs. Blair; that she exists and that she teaches the cello. This is a delete anomaly. This delete anomaly also costs us data if Mrs. Blair decides to leave, or decides to stop teaching the cello. It is the problem of repetition, wasted space, and insert, update and delete anomalies that require normalization to take place. The relational model is built upon a strong foundation of relational algebra and calculus, but it is not necessary to know the math to perform a theory-based normalization process. Let’s begin. The Start of Normalization: 0NF (Zero Normal Form) The start of normalization is called 0NF (Zero Normal Form). This step is often skipped. 0NF is simply all columns written out, bad design, repeating groups and all. 0NF is the structure we would see if we allowed the database designer to insert the attempted Instru2, Instru3 columns. These should have been eliminated at the ERD level. In cases of designs by people unfamiliar with ERD techniques, this occurs often. When converting legacy file structures of this type, the first step is to eliminate these repeating groups. 1NF (First Normal Form) All repeating groups have been eliminated. In addition, all composite attributes are broken down into simple attributes (Address may be broken down into Number, Street, City, and Zip). At this point, 1NF (First Normal Form) has been reached. 1NF is the table design that we see at the beginning of this chapter. Copyright Lisa MacLean, Wentworth Institute of Technology, 2001. All rights reserved. Unauthorized use or copying strictly prohibited without consent. - 13 - In order to progress to second normal form, we need to understand functional dependencies. In 2NF, there are no partial functional dependencies. In robust relational tables, we want full functional dependencies. What is meant by full functional dependencies? An attribute is functionally dependent upon another attribute if X determines Attribute Y. Y is said to be functionally dependent upon the value of X. Knowing the value of X would be sufficient to find out the value of Y. Also, for any value of Y, there is only one value of X. X, the determing value, is the determinant. The notation is XY. We can say that Social Security Number is a determinant of a person’s Last name and First Name. For any SSN that we see, we can determine the first and last name of the person. Also, for any given person (as represented by first and last name, which admittedly is not unique) there is only one SSN. SSN  Person’s Name We draw this out using dependency maps like this: SSN Name Partial functional dependency is where a nonkey attribute depends on part of the key but not all of it. Leaving partial functional dependencies in will create problems in the relational model. It is not possible to have partial functional dependencies in relations with a simple key (since there is no other part of the key for a nonkey attribute to depend upon). There is a composite key for the above table. The dependency map looks like this: StudentID S_name TeacherID T_Name InstruID Instrument Copyright Lisa MacLean, Wentworth Institute of Technology, 2001. All rights reserved. Unauthorized use or copying strictly prohibited without consent. - 14 - The student name is determined by the student ID, the teacher name is determined by the teacher ID, and the instrument is determined by the instrument ID. For each value of student name, teacher name, and instrument, there is only one ID. The problem is that all of these values are in one table. The purpose of this database is keep track of which student is taking which instrument with which teacher, and that purpose it meets. However, it is full of partial key dependencies. The key is StudentID + TeacherID + InstruID. While the student’s name is dependent upon the student ID, it is not dependent on the teacher ID. The student’s name depends upon only one of the three attributes in the key and thus violates the rule against having partial key dependencies. The same goes for the teacher’s name and instrument. How do we solve this? By breaking the relation into three relations, each with its own key and only those attributes that meet the full functional dependency requirement. We now have three relations: STUDENT (StudentID, S_name) TEACHER (TeacherID, T_name) INSTRUMENT (InstruID, Instrument) Each nonkey attribute in the relation is fully functionally dependent upon its key, or ID. We can now progress to 3NF. 3NF (Third Normal Form) Third normal form (3NF) is 2NF with the addition of a new rule: no transitive dependencies. Remember how full functional dependency works: The value of X will always determine the value of Y, and for every Y there is only one X. If only part of the dependency is met, the dependency is said to be transitive. Let’s say that now that we have a better file structure, Passageways has decided to add more attributes. The new structure is: Copyright Lisa MacLean, Wentworth Institute of Technology, 2001. All rights reserved. Unauthorized use or copying strictly prohibited without consent. - 15 - STUDENT (StudentID, S_last, S_first, S_Street, City, Zip) TEACHER (TeacherID, T_name, T_first, T_Street, City, Zip) INSTRUMENT (InstruID, Instrument) The zip code is the transitive dependency. If we know the StudentID, we can find out the zip code, but for any given zip code, we cannot identify the student. There may be more than one student in a given zip code. So, even though zip code describes the employee, it is a transitive dependency, and in theory, should be removed. In this case, splitting the tables would cause another join when the student or teacher address is extracted. STUDENT (StudentID, S_last, S_first, S_Street, Zip) TEACHER (TeacherID, T_name, T_first, T_Street, Zip) ZIPCODE (Zip, City, State) INSTRUMENT (InstruID, Instrument) The zip field is underlined to denote that it is now a foreign key. It may sometimes be worth leaving in a transitive dependency to save on performance and simplicity. Putting the information about the city and zip code back into the original relations is called denormalization. We will learn more about how to make these decisions in later chapters. When we move from 2NF to 3NF, it is necessary for us to know something about the meaning of the data and how it will be used. It is also important not to lose any information when we decompose the tables. The goal is “lossless decomposition,” or, checking to see if decomposing the original relations into several smaller relations has lost any meaning from the data. In this case, we have. Remember that the goal of this database is to keep track of which students are taking which lessons from which instructors. Do you see any way of extracting this information from this design? We need to have another table. Let’s call it “Lessons,” since it reflects a student taking lessons from one instructor with one instrument. What needs to go into this relation? LESSONS (StudentID, TeacherID, InstruID) Copyright Lisa MacLean, Wentworth Institute of Technology, 2001. All rights reserved. Unauthorized use or copying strictly prohibited without consent. - 16 - Why are all the attributes keys? Since we must have uniqueness, and we don’t want duplicate records inserted. The back end will reject student assignments for different teachers or instruments. This database is now in 3NF. Go back and verify full functional dependency. Advanced Normal Forms It was thought that 3NF was a far as one needed to go to ensure a robust design. Experience showed that still anomalies existed beyond 3NF. These are Boyce-Codd Normal Form (BCNF), Fourth Normal Form (4NF) and Fifth Normal Form (5NF). 4NF (Fourth Normal Form) Boyce-Codd Normal Form is a stronger form of 3NF. Since this example does not lend itself to a straightforward demonstration of BCNF, we will return to BCNF after demonstrating 4NF and 5NF. Our complete database now looks like this: STUDENT (StudentID, S_last, S_first, S_Street, Zip) TEACHER (TeacherID, T_name, T_first, T_Street, Zip) ZIPCODE (Zip, City, State) INSTRUMENT (InstruID, Instrument) LESSONS (StudentID, TeacherID, InstruID) Refer back to the original narrative; the database also exists to keep track of which student has which instrument during a lesson. Look at the Lessons relation. We can see that it is possible to extract that information from the relation. But look at the relation again; we know that a student signs out an instrument, meets with an instructor, and then signs the instrument back in. Doesn’t this relation imply a fixed relationship between the student, the instructor, and the instrument? (It actually also implies that an instructor only teaches one instrument, which is not likely to be true all the time.) Copyright Lisa MacLean, Wentworth Institute of Technology, 2001. All rights reserved. Unauthorized use or copying strictly prohibited without consent. - 17 - Since moving from lower to higher normal forms always requires knowing the meaning of the data stored, and the purpose of the data stored, we must know if there is a fixed relationship. For the previous normal forms, that assumption was correct. Unfortunately now, the director of Passageways tells us that this is no longer true. There are not enough teachers to guarantee that each student will always get the same one each time. A teacher may be desired during the same hour for more than one student, so one will get the teacher s/he prefers and one will not. This will require you to move to 4NF. We need to progress to 4NF when there are multivalued attributes A, B and C, where the sets of values for B and C are independent of one another. In this case, the instrument being taught (attribute C) is independent of the teacher (attribute B). To move to 4NF, we again decompose the relations by splitting them. The resulting relations are: STUDENT_INSTRUMENT (StudentID, InstruID) STUDENT_TEACHER (StudentID, TeacherID) With this structure, we can keep track of which student has used an instrument, however many s/he plays, without redundancy or null values. We can also keep track of which student saw what teacher, regardless of how many teachers have instructed that student. The underlying issue is that the previous relation of Assignments was representing two 1:M relationships. 1 Student can play Many instruments, and 1 Student can have Many teachers. Two relations with one 1:M relationship each replaced this one relation with two underlying 1:M relationships. As data entry took place, think of what would happen considering the number of instruments being learned, and the number of teachers giving lessons in that instrument. The number of records in the database would have expanded exponentially. The resulting two relations will save considerable space while adding flexibility. This is the definition of 4NF: it is 3NF plus the removal of more than one 1:M relationships with a relation. Copyright Lisa MacLean, Wentworth Institute of Technology, 2001. All rights reserved. Unauthorized use or copying strictly prohibited without consent. - 18 - 5NF (Fifth Normal Form) 5NF occurs when constraints are placed against the data. Looking at the resulting two relations of STUDENT_INSTRUMENT and STUDENT_TEACHER, couldn’t an inference be made that if a student takes a given instrument, and sees a given teacher, that that teacher seen plays any of the instruments the student is taking? Even if that inference could not be made, it’s possible that a new volunteer attempting to assign different students to different instructors could use some help. We actually lost information when we decomposed the LESSONS (StudentID, TeacherID, InstruID) relation to meet the requirements of 4NF. We could certainly have discovered which teacher could teach which instrument from this relation; if a teacher was in a lesson with a student, we could assume that teacher could play it. When we split the table into the STUDENT_INSTRUMENT and STUDENT_TEACHER relations, we lost this ability. We needed more meaning, that is, to know the constraint that a teacher could not teach every instrument offered. 5NF occurs when we have constraints placed upon independent multivalued attributes. The independent multivalued attributes in this case are the instruments and the teachers, and they are independent of one another. We can easily model this constraint by adding another relation: TEACHER_INSTRUMENT (TeacherID, InstruID) So, our final design looks like this: STUDENT (StudentID, S_last, S_first, S_Street, Zip) TEACHER (TeacherID, T_name, T_first, T_Street, Zip) ZIPCODE (Zip, City, State) INSTRUMENT (InstruID, Instrument) STUDENT_INSTRUMENT (StudentID, InstruID) STUDENT_TEACHER (StudentID, TeacherID) TEACHER_INSTRUMENT (TeacherID, InstruID) Copyright Lisa MacLean, Wentworth Institute of Technology, 2001. All rights reserved. Unauthorized use or copying strictly prohibited without consent. - 19 - We wound up with seven tables from one original table. BCNF (Boyce-Codd Normal Form) Boyce-Codd Normal Form (BCNF) is a stronger form of 3NF. BCNF is based on the principle that every determinant (value that determines another) in the relation must be a candidate key (able to uniquely identify the row). Relations that have only one candidate key are automatically in 3NF, just like relations that have no nonkey attributes are automatically in 3NF. Every relation in BCNF will be in 3NF, but not every 3NF is in BCNF. Relations in 3NF that violate BCNF are quite rare. Violations of BCNF occur when there two or more candidate keys that overlap each other, that is to say the candidate keys share some attribute in common. BCNF states that for a dependency XY to remain in a relation, X must be a candidate key. To test for BCNF, identify all determinants and make sure they are all candidate keys. Passageways now wants us to model the student room assignments. A student may only receive one lesson per day, and there is only one student in a room at a time. Teachers are assigned one room each day, but there can be more than one teacher using a room on that day. Consider the following relation, which they believe to be in 3NF: STUDENT_APPOINTMENT (StudentID, Appt_date, Appt_time, TeacherID, Room) Assume that a student can have only one appointment per day. The dependency map looks like this: Student ID Appt_date Appt_time TeacherID Room This candidate key has been chosen as the primary key. This is appropriate, since we are modeling students and their appointments. We don’t need to include appointment time in the key since students are limited to one appointment per day. Copyright Lisa MacLean, Wentworth Institute of Technology, 2001. All rights reserved. Unauthorized use or copying strictly prohibited without consent. - 20 - Room Appt_date Appt_time Student ID This is a second functional dependency found in this relation. This one may remain, since this combination of attributes is also a candidate key. Room Appt_date Appt_time TeacherID Student ID This third functional dependency could also be a candidate key. Appt_date TeacherID Room This functional dependency poses a problem. The room is determined by the teacher and the date, but this is not a candidate key because it does not uniquely identify the row—there can be more than one teacher in a room per day, so knowing the room number may get us more than one combination of teacher and date. Why would this cause problems? Let’s populate the table and see what may happen. StudentID Appt_date Appt_time TeacherID Room 01474 01423 01233 12100 10-02-00 10-02-00 10-02-00 10-03-00 10:00 AM 11:00 AM 01:00 PM 12:00 PM Wolf Wolf Greylock Blair A A A B Ask yourself the following questions:  What will happen when a teacher calls in to cancel all the day’s appointments?  What will happen if the teacher is reassigned to a different room?  How many times are you storing teacher information for one day? This relation causes duplication and insert anomalies. Our solution is to decompose the original relation into two relations as follows: Copyright Lisa MacLean, Wentworth Institute of Technology, 2001. All rights reserved. Unauthorized use or copying strictly prohibited without consent. - 21 - STUDENT_APPOINTMENTS (StudentID, Appt_date, Appt_time, TeacherID) ROOM_ASSIGNMENTS (TeacherID, Appt_date, Room) Now, the information concerning a teacher’s daily room assignments are not repeated. Is it always a good practice to decompose into BCNF? We should only split tables if all functional dependencies are preserved. If a determinant is separated from the attributes that it determines, we might not decompose. In this case, we have lost a functional dependency. We have lost the functional dependency that tells us that: Room + Appt_date + Appt_time  TeacherID, StudentID, since the information about what student is in the appointment. In this case, losing this functional dependency saves us some update problems, so we will leave the relations as they stand. We should do out Wilbur’s Bait world here Copyright Lisa MacLean, Wentworth Institute of Technology, 2001. All rights reserved. Unauthorized use or copying strictly prohibited without consent. - 22 -

Chapter 5 – Normalization

Related documents

Products

Support

Chapter 5 – Normalization

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib