CS 405G: Introduction to Database Systems Lecture 9: Normal Forms Instructor: Chen Qian Mid project report due after Spring Break. 7/1/2016 Chen Qian @ University of Kentucky 2 Database Normalization Database normalization relates to the level of redundancy in a relational database’s structure. The key idea is to reduce the chance of having multiple different version of the same data. Well-normalized databases have a schema that reflects the true dependencies between tracked quantities. Any increase in normalization generally involves splitting existing tables into multiple ones, which must be re-joined each time a query is issued. 7/1/2016 Chen Qian @ University of Kentucky 3 Normalization A normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency. A normal form is a certification that tells whether a relation schema is in a particular state 7/1/2016 Chen Qian @ University of Kentucky 4 Normal Forms Edgar F. Codd originally established three normal forms: 1NF, 2NF and 3NF. 3NF is widely considered to be sufficient. Normalizing beyond 3NF can be tricky with current SQL technology as of 2005 Full normalization is considered a good exercise to help discover all potential internal database consistency problems. 7/1/2016 Chen Qian @ University of Kentucky 5 First Normal Form ( 1NF ) NF is to characterize a relation (not an attribute, a key, etc…) We can only say “this relation or table is in 1NF” A relation is in first normal form if the domain of each attribute contains only atomic values, and the value of each attribute contains only a single value from that domain. 7/1/2016 Chen Qian @ University of Kentucky 6 7/1/2016 Chen Qian @ Univ of Kentucky 2nd Normal Form An attribute A of a relation R is a nonprimary attribute if it is not part of any key in R, otherwise, A is a primary attribute. R is in (general) 2nd normal form if every nonprimary attribute A in R is not partially functionally dependent on any key of R 7/1/2016 Chen Qian @ University of Kentucky 8 Redundancy Example If a key will result a partial dependency of a nonprimary attribute. e.g. EID, PID-> Ename In this case, the attribute (Ename) should be separated with its full dependency key (EID) to be a new table. So, to check whether a table includes redundancy. Try every nonprimary attribute and check whether it fully depends on any key. 7/1/2016 Chen Qian @ University of Kentucky 9 7/1/2016 Chen Qian @ Univ of Kentucky Second normal Form ( 2NF ) 2NF prescribes full functional dependency on the primary key. It most commonly applies to tables that have composite primary keys, where two or more attributes comprise the primary key. It requires that there are no non-trivial functional dependencies of a non-key attribute on a part (subset) of a candidate key. A table is said to be in the 2NF if and only if it is in the 1NF and every non-key attribute is irreducibly dependent on the primary key 7/1/2016 Chen Qian @ University of Kentucky 11 Decomposition EID PID Ename email Pname Hours 1234 10 John Smith jsmith@ac.com B2B platform 10 1123 9 Ben Liu bliu@ac.com CRM 40 1234 9 John Smith jsmith@ac.com CRM 30 1023 10 Susan Sidhuk Decomposition ssidhuk@ac.com B2B platform 40 Foreign key EID Ename email EID PID Pname Hours 1234 John Smith jsmith@ac.com 1234 10 B2B platform 10 1123 Ben Liu bliu@ac.com 1123 9 CRM 40 1023 Susan Sidhuk ssidhuk@ac.com 1234 9 CRM 30 1023 10 B2B platform 40 Decomposition eliminates redundancy To get back to the original relation, use natural join. 7/1/2016 Chen Qian @ University of Kentucky 12 Decomposition Decomposition may be applied recursively 7/1/2016 EID PID Pname Hours 1234 10 B2B platform 10 1123 9 CRM 40 1234 9 CRM 30 1023 10 B2B platform 40 PID Pname EID PID Hours 10 B2B platform 1234 10 10 9 CRM 1123 9 40 1234 9 30 1023 10 40 Chen Qian @ University of Kentucky 13 Unnecessary decomposition EID Ename email 1234 John Smith jsmith@ac.com 1123 Ben Liu bliu@ac.com 1023 Susan Sidhuk ssidhuk@ac.com EID Ename EID email 1234 John Smith 1234 jsmith@ac.com 1123 Ben Liu 1123 bliu@ac.com 1023 Susan Sidhuk 1023 ssidhuk@ac.com Fine: join returns the original relation Unnecessary: no redundancy is removed, and now EID is stored twice-> 7/1/2016 Chen Qian @ University of Kentucky 14 Bad decomposition EID PID Hours 1234 10 10 1123 9 40 1234 9 30 1023 10 40 EID PID EID Hours 1234 10 1234 10 1123 9 1123 40 1234 9 1234 30 1023 10 1023 40 Association between PID and hours is lost Join returns more rows than the original relation 7/1/2016 Chen Qian @ University of Kentucky 15 Lossless join decomposition Decompose relation R into relations S and T attrs(R) = attrs(S) attrs(T) S = πattrs(S) ( R ) T = πattrs(T) ( R ) The decomposition is a lossless join decomposition if, given known constraints such as FD’s, we can guarantee that R = S T Any decomposition gives R S T (why?) A lossy decomposition is one with R S T 7/1/2016 Chen Qian @ University of Kentucky 16 Loss? But I got more rows-> “Loss” refers not to the loss of tuples, but to the loss of information Or, the ability to distinguish different original tuples 7/1/2016 EID PID Hours 1234 10 10 1123 9 40 1234 9 30 1023 10 40 EID PID EID Hours 1234 10 1234 10 1123 9 1123 40 1234 9 1234 30 1023 10 1023 40 Chen Qian @ University of Kentucky 17 Questions about decomposition When to decompose How to come up with a correct decomposition (i.e., lossless join decomposition) 7/1/2016 Chen Qian @ University of Kentucky 18 Third normal form • 3NF requires that there are no non-trivial functional dependencies of non-key attributes on something other than a superset of a candidate key. • Recall: non-trivial FD means LHS has no intersection with RHS. • In summary, all non-key attributes are mutually independent. 7/1/2016 Chen Qian @ University of Kentucky 19 Ename and email has no FD 7/1/2016 EID Ename email 1234 John Smith jsmith@ac.com 1123 Ben Liu bliu@ac.com 1023 Susan Sidhuk ssidhuk@ac.com Chen Qian @ University of Kentucky 20 Boyce-Codd normal form (BCNF) • BCNF requires that there are no non-trivial functional dependencies of attributes on something other than a superset of a candidate key (called a superkey). • All attributes are dependent on a key, a whole key and nothing but a key (excluding trivial dependencies, like A->A). 7/1/2016 Chen Qian @ University of Kentucky 21 • A table is said to be in the BCNF if and only if it is in the 3NF and every non-trivial, leftirreducible functional dependency has a candidate key as its determinant. • In more informal terms, a table is in BCNF if it is in 3NF and the only determinants are the candidate keys. 7/1/2016 Chen Qian @ University of Kentucky 22 Non-key FD’s Consider a non-trivial FD X -> Y where X is not a super key Since X is not a super key, there are some attributes (say Z) that are not functionally determined by X X Y Z a b c a b d That b is always associated with a is recorded by multiple rows: redundancy, update anomaly, deletion anomaly 7/1/2016 Chen Qian @ University of Kentucky 23 Dealing with Nonkey Dependency: BCNF A relation R is in Boyce-Codd Normal Form if When to decompose For every non-trivial FD X -> Y in R, X is a super key That is, all FDs follow from “key -> other attributes” As long as some relation is not in BCNF How to come up with a correct decomposition Always decompose on a BCNF violation (details next) Then it is guaranteed to be a lossless join decomposition 7/1/2016 Chen Qian @ University of Kentucky 24 BCNF decomposition algorithm Find a BCNF violation Decompose R into R1 and R2, where That is, a non-trivial FD X -> Y in R where X is not a super key of R R1 has attributes X Y R2 has attributes X Z, where Z contains all attributes of R that are in neither X nor Y (i.e. Z = attr(R) – X – Y) Repeat until all relations are in BCNF 7/1/2016 Chen Qian @ University of Kentucky 25 BCNF decomposition example WorkOn (EID, Ename, email, PID, hours) BCNF violation: EID -> Ename, email Student (EID, Ename, email) BCNF 7/1/2016 Grade (EID, PID, hours) BCNF Chen Qian @ University of Kentucky 26 Another example WorkOn (EID, Ename, email, PID, hours) BCNF violation: email -> EID StudentID (email, EID) BCNF StudentGrade’ (email, Ename, PID, hours) BCNF violation: email -> Ename StudentName (email, Ename) Grade (email, PID, hours) BCNF BCNF 7/1/2016 Chen Qian @ University of Kentucky 27 Exercise Property(Property_id#, County_name, Lot#, Area, Price, Tax_rate) Property_id#-> County_name, Lot#, Area, Price, Tax_rate County_name, Lot# -> Property_id#, Area, Price, Tax_rate County_name -> Tax_rate area -> Price 7/1/2016 Chen Qian @ University of Kentucky 28 Exercise Property(Property_id#, County_name, Lot#, Area, Price, Tax_rate) BCNF violation: County_name -> Tax_rate LOTS1 (County_name, Tax_rate ) BCNF LOTS2 (Property_id#, County_name, Lot#, Area, Price) BCNF violation: Area -> Price LOTS2A (Area, Price) BCNF LOTS2B (Property_id#, County_name, Lot#, Area) BCNF 7/1/2016 Chen Qian @ University of Kentucky 29 Why is BCNF decomposition lossless Given non-trivial X -> Y in R where X is not a super key of R, need to prove: Anything we project always comes back in the join: R πXY ( R ) πXZ ( R ) Sure; and it doesn’t depend on the FD Anything that comes back in the join must be in the original relation: R πXY ( R ) πXZ ( R ) Proof makes use of the fact that X -> Y 7/1/2016 Chen Qian @ University of Kentucky 30 Recap Functional dependencies: a generalization of the key concept Partial dependencies: a source of redundancy Use 2nd Normal form to remove partial dependency Non-key functional dependencies: a source of redundancy BCNF decomposition: a method for removing ALL functional dependency related redundancies Plus, BNCF decomposition is a lossless join decomposition 7/1/2016 Chen Qian @ University of Kentucky 31 Normalization There is a sequence to normal forms: 1NF is considered the weakest, 2NF is stronger than 1NF, 3NF is stronger than 2NF, and BCNF is considered the strongest Also, any relation that is in BCNF, is in 3NF; any relation in 3NF is in 2NF; and any relation in 2NF is in 1NF. 7/1/2016 32 Normalization 1NF a relation in BCNF, is also in 3NF 2NF a relation in 3NF is also in 2NF 3NF BCNF 7/1/2016 a relation in 2NF is also in 1NF 33 First Normal Form The following is not in 1NF EmpNum 123 333 679 EmpPhone 233-9876 233-1231 233-1231 EmpDegrees BA, BSc, PhD BSc, MSc EmpDegrees is a multi-valued field: employee 679 has two degrees: BSc and MSc employee 333 has three degrees: BA, BSc, PhD 7/1/2016 34 First Normal Form EmployeeDegree Employee EmpNum 123 333 679 EmpPhone 233-9876 233-1231 233-1231 EmpNum EmpDegree 333 BA 333 333 679 679 BSc PhD BSc MSc An outer join between Employee and EmployeeDegree will produce the information we saw before 7/1/2016 35 Second Normal Form Consider this InvLine table (in 1NF): InvNum LineNum ProdNum InvNum, LineNum InvNum There are two candidate keys. LineNum InvDate Qty is the only nonkey attribute, and it is dependent on InvNum Since there is a determinant that is not a candidate key, InvLine is not BCNF InvLine is not 2NF since there is a partial dependency of InvDate on InvNum 7/1/2016 InvDate ProdNum Qty InvNum, ProdNum Qty InvLine is only in 1NF 36 Second Normal Form InvLine InvNum LineNum ProdNum Qty InvDate The above relation has redundancies: the invoice date is repeated on each invoice line. We can improve the database by decomposing the relation into two relations: 7/1/2016 InvNum LineNum InvNum InvDate ProdNum Qty 37 Third Normal Form Consider this Employee relation EmpNum EmpName DeptNum Candidate keys are? … DeptName EmpName, DeptNum, and DeptName are non-key attributes. DeptNum determines DeptName, a non-key attribute, and DeptNum is not a candidate key. Is the relation in 3NF? … no Is the relation in BCNF? … no Is the relation in 2NF? … yes 7/1/2016 38 Third Normal Form EmpNum EmpName DeptNum DeptName We correct the situation by decomposing the original relation into two 3NF relations. Note the decomposition is lossless. EmpNum EmpName DeptNum DeptNum DeptName Verify these two relations are in 3NF. Are they in BCNF? 7/1/2016 39 Boyce-Codd Normal Form Boyce-Codd Normal Form BCNF is defined very simply: a relation is in BCNF if it is in 1NF and if every determinant is a candidate key. 7/1/2016 40 In 3NF, but not in BCNF: Instructor teaches one course only. student_no course_no instr_no Student takes a course and has one instructor. {student_no, course_no} instr_no instr_no course_no since we have instr_no course-no, but instr_no is not a Candidate key. 7/1/2016 41 student_no course_no instr_no student_no instr_no course_no instr_no {student_no, instr_no} student_no {student_no, instr_no} instr_no instr_no course_no 7/1/2016 42 Boyce-Codd Normal Form InvNum LineNum InvNum, LineNum ProdNum ProdNum Qty InvNum, ProdNum Qty {InvNum, LineNum} and {InvNum, ProdNum} are the two candidate keys. LineNum There are two candidate keys. Since every determinant is a candidate key, the relation is in BCNF This relation is about Invoice lines only. 7/1/2016 43 inv_no 7/1/2016 line_no prod_no prod_desc qty 44 2NF, but not in 3NF, nor in BCNF: inv_no line_no prod_no prod_desc qty since prod_no is not a candidate key and we have: prod_no prod_desc. 7/1/2016 45 EmployeeDept ename 7/1/2016 ssn bdate address dnumber dname 46 Summary Philosophy behind BCNF, 4NF: Data should depend on the key, the whole key, and nothing but the key! Philosophy behind 3NF: … But not at the expense of more expensive constraint enforcement! 7/1/2016 47