The Relational Model – Functional Dependencies & Normalization Objectives Optimal Database Design Selection Of Appropriate Relations/Tables For A Given Set Of Attributes Minimize Update Anomalies Redundancy Update Inconsistent Additions Deletions Data Definition of Anomaly Something that deviates from our expectations Example CUSTNUMB CUSTNAME CUSTADDR 123 456 461 489 514 Jones, R. Lan, J. Chu, W. Obie, S. Wise, R 19 Oak St. ... 999 ... Side, E. ... 4 Pine St. 22 Main St. 76 High St. 17 Birch St. 87 Bay St. SNUMB SLSRNAME 3 6 12 6 3 Adams, M. ... 12 ... Smith, R. Brown, M. Smith, R. Adams, M. Brown, N. Specific Anomalies In This Relation Redundancy Why repeat the Sales Rep Name for Adams in each record? Suppose Adams has 500 customers? That means 500 times you repeat Adams’ name! Update Suppose Slsr Mary Adams marries and changes her name? How many rows do we need to update? Inconsistent data Notice Brown's first initial varies : M, N Additions New Slsr J. Doe can't be entered until he has a customer Deletion Delete all customers of Adams, and we lose the name of the salesrep Adams Decomposition Of Relations The previous table can be decomposed into the following two tables CUSTNUMB 123 456 461 489 514 ... 999 CUSTNAME Jones, R. Lan, J. Chu, W. Obie, S. Wise, R ... Side, E. SNUMB 3 6 12 CUSTADDR 19 Oak St. 4 Pine St. 22 Main St. 76 High St. 17 Birch St. ... 87 Bay St. SLSRNAME Adams, M. Smith, R. Brown, M. SNUMB 3 6 12 6 3 ... 12 Notice That This Decomposition Resolved All Database Anomalies REDUNDANCY NONE EXISTS UPDATE JUST CHANGE MARY ADAMS' LAST NAME (ONCE) IN salesrep relation INCONSISTENT DATA IMPOSSIBLE - M. BROWN'S NAME APPEARS ONLY ONCE! ADDITIONS ADD NEW SLSR J. DOE TO salesrep relation DELETIONS WE CAN DELETE ALL OF ADAMS' CUSTOMERS AND STILL HAVE ADAMS IN salesrep Conceptual Tools Needed For Decomposition Functional Dependencies Lossless Join Decomposition Normal Forms Functional Dependencies Common Issue in Designing a New Database From Existing Data We have obtained one or more tables of existing data (such as from a spreadsheet or extracts from an existing corporate database). The data is to be stored in a new database. DATABASE DESIGN QUESTION: Should the data be stored as received, or should it be transformed for storage? Should We Combine ORDER_ITEM and SKU_DATA into One Table (SKU_DATA)? Should we store these two tables as they are, or should we combine them into one table in our new database? But First— We need to understand: The relational model Relational model terminology The Relational Model Introduced in 1970 Created by E.F. Codd He was an IBM engineer The model used mathematics known as “relational algebra” Now the standard model for commercial DBMS products. Important Relational Model Terms Entity Relation Functional Dependency Determinant Candidate Key Composite Key Primary Key Surrogate Key Foreign Key Referential integrity constraint Normal Form Multivalued Dependency (new for us) Entity An entity is some identifiable thing that users want to track: Customers Computers Sales Relations A relation is a two-dimensional table that has the following characteristics: Rows contain data about an entity. Columns contain data about attributes of the entity. All entries in a column are of the same kind. Each column has a unique name. Cells of the table hold a single value. The order of the columns is unimportant. The order of the rows is unimportant. No two rows may be identical A Typical Relation Tables That Are Not Relations: Multiple Entries per Cell Tables That Are Not Relations: Table with Required Row Order A Valid Relation with Values of Different Length An INVALID relation (Cells in a valid relation are supposed to hold a single value, but the Phone “cell” for Employees 400 and 700 have multiple phone numbers) Alternative Terminology Although not all tables are relations, as we have seen on the previous slides, the terms table and relation are generally used interchangeably. The following sets of terms are equivalent: Functional Dependency A functional dependency occurs when the value of one (set of) attribute(s) determines the value of a second (set of) attribute(s): StudentID StudentName StudentID (DormName, DormRoom, Fee) The attribute on the left side of the functional dependency is called the determinant. Functional dependencies may be based on equations: ExtendedPrice = Quantity X UnitPrice (Quantity, UnitPrice) ExtendedPrice But, function dependencies are definitely not equations! Functional Dependencies Are Not Equations: An Example We can deduce the following set of Functional Dependencies from the above diagram ObjectColor Weight ObjectColor Shape ObjectColor (Weight, Shape) But, does Shape functionally determine anything? (NO!) Composite Determinants Composite determinant: a determinant of a functional dependency that consists of more than one attribute. (StudentName, ClassName) (Grade) Functional Dependency Rules (Not a complete list) If A (B, C), then A B and A C If (A,B) C, then neither A nor B determines C by itself Functional Dependency Review A functional dependency occurs when the value of one (or set of) attribute(s) determines the value of a second (or set of) attribute(s): StudentID StudentName StudentID (DormName, DormRoom, Fee) The attribute on the left side of the functional dependency is called the determinant, the attribute on the right side is called the dependent. Functional dependencies may be based on equations: ExtendedPrice = Quantity X UnitPrice (Quantity, UnitPrice) ExtendedPrice Function dependencies are not equations Composite Determinants Composite determinant: A determinant of a functional dependency that consists of more than one attribute Example of a Composite Determinant: (StudentName, ClassName) (Grade) Find the functional dependencies in the SKU_DATA Table Ask yourself the question – if we know the value of a particular attribute, will that value determine a unique value of some other attribute? (If “yes,” then we have a functional dependency between the attributes.) Functional Dependencies in the SKU_DATA Table SKU (SKU_Description, Department, Buyer) SKU_Description (SKU, Department, Buyer) Buyer Department Find the functional dependencies in the ORDER_ITEM Table Functional dependencies in ORDER_ITEM Table (OrderNumber, SKU) (Quantity, Price, ExtendedPrice) Note that OderNumber by itself does not functionally determine any other attribute While SKU, from the data, does appear to functionally determine Price, we always need to be very careful in making inferences from data. Prices may change in the future, and the price might often be tied to a particular order. So, we would prefer to use the composite of SKU and OrderNumber as a determinant in a functional dependency, rather than SKU by itself. (Quantity, Price) (ExtendedPrice) Note that this is derived from the equation ExtendedPrice = Quantity * Price When are determinant values unique? A determinant has unique values (i.e., all values are different) in a relation if, and only if, it functionally determines every other attribute in the relation So, in SKU_Data, SKU has all different (unique) values, and it functionally determines every attribute in the table. On the other hand, Buyer, though a determinant, does not have unique values, and does not functionally determine all the other attributes in the relation. So, you cannot find the determinants of all functional dependencies simply by looking for unique values in one column A a(1) a(1) a(2) a(2) a(2) B b(1) b(1) b(1) b(2) b(2) C c(1) c(2) c(1) c(1) c(2) D d(1) d(1) d(1) d(2) d(3) E e(1) e(1) e(1) e(1) e(2) BC ----> D (True or False?) B ----> A (True or False?) D ----> BE (True or False?) AB ----> C (True or False?) The Answers BC ----> D (True or False?) B ----> A (True or False?) D ----> BE (True or False?) AB ----> C (True or False?) Deducing Functional Dependencies Since BC ----> D and D ----> BE, can we conclude that BC ----> BE ? YES! (We will call this transitivity) If BC ----> D and BC ----> A, can we conclude that D ----> A ? NO! Nor can we conclude A ----> D. Superkeys & FD's A superkey is an attribute or a set of attributes that identify an entity UNIQUELY. In a relation (table), a SUPERKEY is any column or set of columns whose values can be used to distinguish one row from another. Since a superkey identifies each item uniquely, it functionally determines all the attributes of a relation. STUID is a superkey SOCSEC is a superkey STUNAME is NOT a superkey STUID,STUNAME IS a superkey STUID,ANY OTHER SET OF ATTRIBUTES is a superkey The Formal Theory Definition Of A Superkey A set of attributes K is a superkey of relation (table) R, if K ----> R In other words, a superkey functionally determines all the attributes in R More On Superkeys A superkey is a candidate key if it is minimal, i.e., if X is a superkey, then X minus {any attribute of X} is NOT a superkey. A primary key is a candidate key which we choose to be THE "key." Superkeys, Candidate Keys And Primary Keys Superkey: a set of attributes which functionally determines all of the attributes in the relation Candidate key:from the set of superkeys, we eliminate all those superkeys which have "extra" attributes (a superkey will have an "extra" attribute if, when we remove this attribute, the resulting set of attributes is also a superkey). Primary key: if there is more than 1 candidate key, then the candidate key we choose for THE key is called the primary key - if there is exactly 1 candidate key, then that candidate key is the primary key. Example - Obtain Candidate Keys Consider the following scheme from an airline database system: ( P(pilot) , F(flight# ), D(date), T (scheduled time to depart) ) We have the following FD's : F ----> T PDT ----> F FD ----> P Provide some superkeys: PDT is a superkey, and FD is a superkey. Is PDT a candidate key? PD is not a superkey, nor is DT, nor is PT. So, PDT is a candidate key. FD is also a candidate key, since neither F or D are superkeys. Surrogate Keys A surrogate key is an artificial attribute/column added to a relation to serve as a primary key: Often DBMS supplied Short, numeric and never changes – an ideal primary key! Has artificial values that are meaningless to users Normally hidden in forms and reports Example of Surrogate Keys (NOTE: The primary key of the relation is underlined below) RENTAL_PROPERTY without surrogate key: RENTAL_PROPERTY (Street, City, State/Province, Zip/PostalCode, Country, Rental_Rate) RENTAL_PROPERTY with surrogate key: RENTAL_PROPERTY (PropertyID, Street, City, State/Province, Zip/PostalCode, Country, Rental_Rate Trivial FD's A functional dependency is defined to be trivial if it is satisfied by every relation Example of a trivial functional dependency: AB ----> A is satisfied by every relation involving A. Trivial Fd's Generalization and rule for trivial FD's: An FD is trivial if it has the form: X ----> Y, where Y is a subset of X. So, ABCD ----> ABC is a trivial FD. A trivial FD does not make a significant statement about real world constraints - we are thus only interested in non-trivial FD's. Another FD “Rule” If (A,B) C, then neither A nor B by itself will functionally determine C. Normal Forms There are numerous "normal forms" which are categorizations based upon the kinds of “problems” that relations have. These will be discussed: First Normal Form (1NF) Second Normal Form (2NF) Third Normal Form (3NF) Boyce-Codd Normal Form (BCNF) FIRST NORMAL FORM A relation is in first normal form (1NF) iff every attribute in every row can contain only a single value. A 1NF relation cannot have any row that contains a repeating grouping of attribute values. Example Of A Relation Not In 1NF Ordnumb 12489 12491 Orddte 30109 30209 12495 30409 Partnumb AX12 BT04 BZ66 CX11 Numbord 11 1 1 2 *We can convert the above table to 1NF by flattening * Ordnumb 12489 12491 12491 12495 Orddte 30109 30209 30209 30409 Partnumb AX12 BT04 BZ66 CX11 Numbord 11 1 1 2 Second Normal Form Definition: an attribute is a non-key attribute if it is not a part of the primary key Definition: A relation is in second normal form (2NF) if it is in first normal form and no non-key attribute is dependent on only a portion of the primary key (when the primary key is composite consisting of 2 or more attributes) Example Of A Relation In 1NF, But Not 2NF Ordnumb Orddte 12489 90509 12491 90509 12491 90509 12495 90709 Partnumb AX12 BT04 BZ66 AX12 PartDesc Numbord MOUSE 11 DRV270G 1 DRV180G 1 MOUSE 4 Quoprice 14.95 120.99 80.95 14.95 *****The following FD's hold on this relation******* Ordnumb ----> Orddte Partnumb ---> PartDesc Ordnumb, Partnumb ----> Numbord, Quoprice ******The relation is NOT in 2NF because ...********* PartDesc is dependent on only a portion of primary key, and similarly for Orddte Transform Relation To 2NF First, take each subset of the set of attributes which make up the primary key, and begin a relation with this subset as its primary key (Ordnumb) (Partnumb) (Ordnumb, Partnumb) Then, place each of the other attributes with the appropriate primary key, i.e., place each one with the minimal collection on which it depends (Ordnumb, Orddte) (Partnumb, Partdesc) (Ordnumb, Partnumb, Numbord, Quoteprice) Third Normal Form A relation is in Third Normal Form (3NF) iff it is in Second Normal Form and there is no non-key attribute which is functionally dependent upon another non-key attribute in any functional dependency ("each non-key attribute must depend upon the key, the whole key, and nothing but the key") Example Of Relation In 2NF, But Not 3NF Consider STUDENT(STUID, STUNAME, MAJOR, CREDITS, FSJS) with the following FD's: Stuid ----> Stuname, Major, Credits, FSJS Credits---> FSJS Since attribute FSJS depends on credits, student is not in 3NF To create 3NF here, form a new relation (STATS) with the functionally dependent attribute and its determinant STU2 ( Stuid, Stuname, Major, Credits) R1 STATS ( Credits, FSJS ) R2 Boyce-Codd Normal Form (BCNF) Reminder: a determinant is an attribute (or collection of attributes) that functionally determines another attribute (or set of attributes), i.e., it is the LHS of a functional dependency Example: in sosec ---------> stuname, sosec is a determinant Def.: A relation is in Boyce-Codd normal form if every determinant is a candidate key Another Example Of 2NF Relation (Not In 3NF And Not In BCNF) GIVEN: PC (TAGNUM, COMPID, EMPNUM,EMPNAME,LOCATION) and given the following functional dependencies: FD1: TAGNUM ---->COMPID,EMPNUM,EMPNAME,LOCATION. FD2: EMPNUM-----> EMPNAME This Relation Satisfies 2NF, But Not 3NF Or BCNF TAGNUM COMPID EMPNUM EMPNAME LOCATION 32808 M759 611 DINH, M. ACCOUNTING 37691 B121 124 ALVAREZ, R SALES 57772 C007 567 FEINSTEIN, B INFO SYSTEMS 124 ALVAREZ, R HOME 59836 B221 77740 M759 567 FEINSTEIN, B HOME Some Anomalies Present In This Relation UPDATE: If Betty Feinstein gets married, must change more than 1 record INCONSISTENT DATA: Potential problem due to redundancy ADDITIONS: New employee 347 cannot be added until a pc is assigned Why Is The PC Relation Not In 3NF Or Boyce Codd Normal Form? 1) It is in 2NF (there is no non-key attribute dependent on only a portion of the primary key, since the primary key consists of only 1 attribute) 2) The primary key is TAGNUM. 3) The only candidate key is TAGNUM. 4) There are 2 determinants - TAGNUM AND EMPNUM . 5) Since EMPNUM is a determinant but not a candidate key, the relation is not in BCNF. And it's not in 3NF either. Changing Our PC Relation To 3NF PC (TAGNUM, COMPID, EMPNUM, EMPNAME, LOCATION) is replaced by PC (TAGNUM, COMPID, EMPNUM, LOCATION) and EMPLOYEE (EMPNUM, EMPNAME) Transforming A 3NF Relation To BCNF 1) For each determinant that is not a candidate key, remove from the relation the attributes which are functionally determined by this determinant. 2) Create a new table containing all the attributes from the original relation which were functionally determined by this determinant. 3) Make the determinant the primary key of this new relation. Important Points A relation in 3NF may or may not be in Boyce Codd Normal Form BUT, a relation in Boyce Codd Normal Form will ALWAYS be in 3NF. {Some textbooks consider Boyce Codd Normal Form to be "the" third Normal Form. Ours does not. } Example of a relation in 3NF which is NOT in BCNF Suppose that, in a given university: 1. Students may have one or more majors. 2. A major may have several faculty members as as advisers. 3. A faculty member can advise in only one major area. SID 100 150 200 250 300 300 MAJOR Math Psychology Math Math Psychology Math FACNAME Cauchy Jung Riemann Cauchy Perls Riemann Things to note from this example The primary key is not SID !! The primary key consists of two attributes: SID and MAJOR. There is an important functional dependency corresponding to the statement "A Faculty member can advise students in only one major area." FACNAME -----> MAJOR The relation IS in 2NF, since there are no non-key attributes dependent on only a portion of the primary key. The relation is in 3NF, but NOT in BCNF. The ADVISOR relation transformed to Boyce Codd Normal Form STU-ADV(SID, FACNAME) SID FACNAME 100 150 200 250 300 300 Cauchy Jung Riemann Cauchy Perls Riemann ADV-MAJOR(FACNAME, Major) FACNAME MAJOR Cauchy Jung Math Psychology Riemann Perls Math Psychology Going Directly to BCNF Example 1 of Going Directly to BCNF The SKU_DATA TABLE Working Through The Example SKU_DATA (SKU, SKU_Description, Department, Buyer) Identify the FDs: a) SKU (SKU_Description, Department, Buyer) b) SKU_Description (SKU, Department, Buyer) c) Buyer Department SKU and SKU_Description are candidate keys, Buyer is NOT a candidate key, so SKU_DATA is not in BCNF. Placing the columns of the problem FD (c) into a separate relation, with the determinant Buyer as the primary key, and making Buyer a foreign key in the SKU_DATA relation, we obtain: SKU_DATA2 (SKU, SKU_Description, Buyer) BUYER (Buyer, Department) Where BUYER.Buyer must exist in SKU_DATA2.Buyer The Resulting Populated SKU_DATA2 and BUYER Relations, in BCNF Example 2 of Going Directly to BCNF The EQUIPMENT_REPAIR table Working Through The Example EQUIPMENT_REPAIR (ItemNumber, Type, AcquisitionCost, RepairNumber, RepairDate, RepairAmount) Identify the FDs: a) ItemNumber (Type, AcquisitionCost) b) RepairNumber (ItemNumber, Type, AcquisitionCost, RepairDate, RepairAmount) RepairNumber is a candidate key, ItemNumber is NOT a candidate key, so EQUIPMENT_REPAIR is not in BCNF. Placing the columns of the problem FD (a) into a separate relation, with the determinant ItemNumber as the primary key, and making ItemNumber a foreign key in the REPAIR relation, we obtain: ITEM (ItemNumber, Type, AcquisitionCost) REPAIR (RepairNumber, RepairDate, RepairAmount, ItemNumber, ) Where REPAIR.ItemNumber must exist in ITEM.ItemNumber The Resulting Populated REPAIR and ITEM Relations, in BCNF SUMMARY OF NORMAL FORMS WE HAVE COVERED 1NF – A table that qualifies as a relation is in 1NF 2NF – A relation is in 2NF if all of its nonkey attributes are dependent on all of the primary key 3NF – A relation is in 3NF if it is in 2NF and there is no nonkey attribute which is functionally dependent upon another non-key attribute in any functional dependency, or, equivalently, there are no determinants except the primary key, (or, equivalently, there are no transitive dependencies {i.e., there are no FDs where A B and B C} ) Boyce-Codd Normal Form (BCNF) – A relation is in BCNF if every determinant is a candidate key