Normalization INLS 256-02, Database I Normalization refers to rigorous standards for good design, designed formally, and methods for testing a DB’s design. 4 Indications of Quality: 1. Semantics. 2. Redundancy. You want to avoid redundancy as much as possible. Insertion Anomalies: Example: (s#, name, address, p#, qty), where s#, p# is the key. Each time we want to enter a new part, you must be sure that the name and address are the same as all the other names and addresses for that supplier number. Otherwise, depending on which tuple you look at, you could get different information. Another example: (VID, model, SSN, name, address, purchase-date), where VID, SSN is the key. Each time we want to enter a new car, you must also enter the SSN nad address information of the owner. So the two problems of insertion are 1. redundancy gives opportunities for inconsistency and 2. mixing information from different entities in one tuple means that you have to “cheat” to represent an entity when you only have part of the information. Deletion Anomalies: Using the example above, what happens when we delete the last shipment of a particular supplier? We lose the address and name information of the supplier. Because we’ve mixed entities, we can lose “permanent” information on one when we just mean to delete a single occurrence of something. Modification Anomalies: In the example, suppose a company does move? We must change its address in all the tuples that it appears in, or else we’ll have inconsistent information. Similarly, if an owner moves, the address must be changed in all records. Note that the cascade update option in enforcing referential integrity might catch some of these changes, for anything that has been declared as a foreign key, but not for other fields. 3. Null values. 4. Spurious tuples. A spurious tuple is one that you “invent” by joining two badly designed tables together. Functional Dependency Functional dependency has to do with the semantics (domains) of the tables in your DB, so in order to determine where the dependencies lie, you must understand the domain. These are constraints that must hold at all times in the DB, so you can’t just tell what is required or allowed by looking at some existing tuples. In a relation R, an attribute Y is functionally dependent on attribute X if each X-value in the relation is associated with only one Y-value at any one time. X and Y may be composite attributes. R.XR.Y X functionally determines Y (note that Y may be associated with more than one X) e.g. in suppliers: S(s#, name, status, city) Where status refers to the shipping speed. S.s#name S.s#status S.s#city If X is a candidate key, then all Y’s will necessarily be functionally dependent on X, by definition of a candidate key (it is unique for each tuple). This is especially important in terms of the primary key. E.g. in shipments: SP(s#, p#, qty) (s# p#)qty The composite key functionally determines quantity. Note that this doesn’t mean that X HAS to be a candidate key. We can rephrase the definition to say that Y is dependent on X if whenever two tuples have the same X value, they will have the same Y value. Full functional dependence: This is a slightly more restrictive definition. Y is fully functionally dependent on X if it is functionally dependent on all of X, not just on a subset. E.g., if X is composite, it must be functionally dependent on the combination of attributes making up X, not just on one of them. E.g.: (s#, status)city city is dependent on the combination of s# and status, but is only fully dependent on s#. We will generally use the full functional dependence definition. Normalization: This is a process of looking at the tables in a DB to see if they pass a series of tests. Generally, if a table fails a test, the solution is to break it up into smaller tables. Normal forms are increasingly strict levels of tests, and are designed to avoid the problems of anomalies, redundancy, and ambiguity we discussed earlier. Generally, you want to normalize up to 3rd or BC normal form, although there are situations where, because of the semantics or use of the DB, that won’t be practical. When you make a decision not to normalize to some degree, you must make it consciously, and try to safeguard the DB from the problems that could arise. Normalization up to BC normal form is based on functional dependencies and key constraints. We’ve just learned about functional dependency; here’s some review of terms. superkey: any combination of attributes that uniquely identifies each tuple in a relation. candidate key: any minimal combination of attributes that uniquely identifies each tuple in a relation – you can’t remove any attribute and still have unique identification. primary key: the candidate key chosen to be the unique identifier for a relation. secondary key: all the rest of the candidate keys. prime attribute: an attribute that is a member of a candidate key. 1NF – First Normal Form: A relation is in 1NF if and only if all underlying domains contain atomic values only. 1NF isn’t very interesting – it is a stepping stone to others. For instance, if we have (s#, name, city), where each supplier may have several branch locations in different cities, then the relation has a domain that allows sets of cities for values, thus they aren’t atomic. The solution is to decompose the table into 2 tables: (s#, name) and (s#, location). s# 1 2 name city AAA Durham BBB Durham, Raleigh 2NF – Second Normal Form: A relation is in 2NF if and only if it is in 1NF and every nonprime attribute is fully dependent on the primary key. Generalized 2NF: if every nonprime attribute is fully dependent on any key. Ex1(s#, tax, city, p#, qty) (where tax equals the tax rate of the city) where primary key is (s#, p#), tax is determined by the city, and so is functionally dependent on city. Note that the functional dependency diagram has dependency arrows that don’t just come out of the key – they also come from other attributes. Also note that the arrows to status and city only come out of s#, not out of the entire key. This structure will have update anomalies. Insert. You cannot enter the existence of a new supplier and city unless that supplier is shipping a part. This is because of the integrity rule, that all fields in the key must have values. p# is part of the key. Delete. If a supplier has only one shipment, and it gets deleted, you also delete all knowledge of that supplier, such as the city. Update. Because of the redundancy, if the city of a supplier moves, then you must either 1. find all occurrences of the supplier and change the city or 2. change one occurrence, and have an inconsistent DB. The solution is to divide this into two tables, where the key of the new table will be the one that was independently determining the values of some attributes. So now we have Ex2a(s#, tax, city) Ex2b(s#, p#, qty) The functional dependency diagram shows that each of them now contain attributes that are fully dependent on the primary key of each relation. Insert – can now insert the existence of a supplier into Ex2a, without a shipment. Delete – can now delete a shipment from Ex2b without losing information about the supplier. Update – can now update the city in only one place. 3NF – Third Normal Form: A relation is in 3NF if and only if it is in 2NF and every nonprime attribute is nontransitively dependent on the primary key. Generalized 3NF: every nonprime attribute is fully functionally dependent on every key, and nontransitively dependent on every key. Functionally, this can be tested as being as a relation R being in 3NF if X->A, then either (a) X is a superkey of R, or (b) A is a prime attribute of R. A transitive dependence is when r.ar.b and r.br.c hold. Therefore, the transitive dependency r.ar.c also holds. This can be seen in the functional dependency for Ex2a. Ex2a(s#, tax, city) Tax rate is dependent on city. City is dependent on s#. Therefore, tax rate is dependent on s# through city. This shows by the fact that there are arrows that originate from places other than the key. This also gives anomalies. Insert – cannot enter that a city has a tax rate unless we have a supplier there. Again, this is because we cannot have a null primary key. Delete – if there is only one supplier in a city, when we delete the supplier, we delete the tax information for that city. Update – if we change the tax rate for a city, we must either 1. find all suppliers in that city and change the status for it or 2. change only one and have an inconsistent DB. The solution is to break the relation into two relations. The point here is to get rid of the extra arrows, and make simple functional dependencies. So the two new relations are FK Ex3a(s#, city) Ex3b(city, status) Now the functional dependency diagrams are simple, there are no transitive dependencies, all attributes are fully dependent on the key, and they are in 3NF. BCNF – Boyce-Codd Normal Form: A relation is in BCNF if and only if every determinant is a candidate key. Functionally, this is just like 3NF except that it is stricter, in that we have removed the exception for A to be a prime attribute of R. I.e. it can be tested as being as a relation R being in BCNF if X->A, then X is a superkey of R. (all attributes are functionally determined by keys). See property example and discussion in book. (pages 491-495). Basically, BCNF is 3NF without the exception that X->A, A can be prime attribute. I.e. if A is prime attribute it would still be a violation, if X A determinant is any attribute on which another attribute is functionally dependent. Multivalued Dependencies – 4NF A problem with multivalued dependencies occurs when you are trying to express 2 independent 1:N relations, or multi-valued attributes, in the same relation. For example, in your initial design process, you may have seen something like: manager manager phone# employee e.g., Fred 999-1212 Fred George 999-1312 Linda 999-1313 Ellen where the manager is associated (multidetermines) a set of phone numbers, and also a set of employees, but the phone numbers and the employees have nothing to do with each other. Of course, you can’t have a relation that looks like the ones above – it is excluded by 1NF. You are trying to express the idea that the manager is associated with a set of employees, and a set of phone numbers, but that the employees and the phone numbers are independent of one another. So, you might design a relation that looks like mgr Fred Fred Fred etc. phone# 999-1212 999-1312 999-1313 emp George Linda Ellen But that implies a relationship (connection) between 999-1212 and George. To avoid that appearance, you would have to store all combinations of phone# and employee. Two 1:N relations (or multivalued attributes), A:B, A:C, where B and C are independent of each other. x->>y x determines a set of values y x->>z x determines a set of values z The only time a multi-valued dependency is a problem is when you have more than one mvd, and the y and z values are independent. A trivial mvd is one where 1. The y attribute(s) are a subset of the x attributes. That is, if you made them distinct from each other, there would no longer be an mvd. E.g., abc->>b. 2. The union of x and y make up the entire relation – there are no other attributes in the table. Otherwise, you have a nontrivial mvd, and these are the potential problems. There are lots of redundancies – room for anomalies. Note that the relations with non-trivial mvd’s tend to be all-key relations, where the key is the entire relation. The cure: 4NF A relation is in 4NF if for every nontrivial mvd x->>y, x is a superkey (any combination of attributes that is a unique ID in the relation (non-minimal) for the relation. The manager table used as an example above is not in 4NF. mgr, phone# is a nontrivial mvd because phone# is not a subset of mgr, and there is also employee. Similarly, mgr, employee is a nontrivial mvd. As usual, the way to get a relation into 4NF is to decompose it, to get the mvds into separate relations: (manager, phone#) (manager,employee) which are now trivial mvds, making up the entire table. This decomposition will have the lossless join property. The Overall Idea Remember that the goal here is to get a good design. Starting from an ER diagram is one way, although you still mgiht want to check normalization of tables. But starting with a bunch of tables and then normalizing them (or starting with one enormous table) is another approach. We have been talking about normalization as something that you do regarding just one table in the database. It is also important to look at your DB design in terms of how the tables relate to each other, and how you can combine them. Merely having a bunch of tables in 3NF or BCNF is not enough. Some definitions: In a database design, we have a decomposition D of the universal relation R. This is the way that all of the attributes have been decomposed into tables. There is a set of functional dependencies F that hold over the attributes of R; this depends on the semantics of the DB and how things work in the world it models. Dependency Preserving Decomposition: In decomposition, it is possible to lose a functional dependency – this is undesirable, so a good decomposition will preserve dependencies. There are two ways of storing functional dependencies: they can be in the same table, or they can be inferred from different tables. Lossless (Additive) Joins: Another important feature of a good decomposition is that it gives lossless joins. This is the problem of spurious tuples. The term “lossless” refers to the problem of losing information – the way that you lose information here is by getting noise (spurious tuples) into your table. Properties of lossless join decomposition: 1. For 2-relation DB schemas: the attributes in both relations must functionally determine either those attributes that appear in only the first relation, or those that appear in only the second relation. 2. Once you have established a decomposition with the lossless join property, you can further decompose one of its tables without losing the property. So, to decompose and maintain lossless joins: For each table in the DB that isn’t in BCNF, find the functional dependency that is in violation (that is, contains a determinant that is not part of a candidate key), and break the relation into two. One relation contains the X and Y attributes from the functional dependency. The other contains the rest of the attributes. You can’t always perform the “ideal” decomposition, that is in BCNF and preserves dependencies. You may only be able to get to 3NF. You must then decide whether to leave it there, and build in protection for update anomalies, or to decompose even further, with the resulting loss of performance. In terms of design, remember that it isn’t a good idea to design a table that will get too many nulls. It is better to break it up into another table. However, this could also result in the problem of dangling tuples. The representation of a “thing” is broken up into 2 tables. To get the full information on the “thing”, you join the tables together. However, if some tuples have either null values on a join attribute, or don’t appear at all in one table, they won’t appear in the result, unless you know in advance that you should do an outer join.