Normalisation Relation ABCDEF 1NF? Relation1 AB Relation2 A* C D E* Help me Codd!! Reading: Connolly and Begg 13 & 14 (4th ed), Relation3 EF id lecturer_ name lecturer_ address qual position requires title result code student_ address sex student_ name From this… regno Normalisation 55101 55144 55633 55633 55633 55981 55981 55981 55981 55981 Smith Brown Brown Brown Brown Adams Adams Adams Adams Adams Edinburgh London Abingdon Abingdon Abingdon London London London London London BSc BSc PhD PhD PhD Meng Meng Meng Meng Meng Lecturer Lecturer Reader Reader Reader Lecturer Lecturer Lecturer Lecturer Lecturer 43414 Jones Female Edinburgh 40986 42331 40986 40986 42331 40986 42331 Jones Smith Jones Jones Smith Jones Smith MaleOxford Female London MaleOxford MaleOxford Female London MaleOxford Female London …to this In 3+ easy(?) steps 3011 3011 3080 3025 3025 3081 3081 3082 65 Data Structures 72 Data Structures Spreadsheets 78 Databases 81 Databases 76 Artificial Intelligence Artificial Intelligence Software Engineering 3005 3005 3011 3011 3011 2080 2080 What is normalisation? A method for database design – – – Takes a set of attributes and derives the relational model – By separating out the required tables Completely different approach to ERM – Theory examines how “good” is a schema? Transform non-normalised schemas Minimise storage But should get the same result A minimum of 3 steps are used: For each stage, the normal form gets stronger (i.e. removes redundancy) so less open to update anomalies All based on functional dependencies Functional Dependency Underpins normalisation process If every value of column A uniquely determines the value in column B, then – – B is functionally dependent on A (B depends on A) A determines B, or, formally, A B (A is called the determinant) For example, – – EmpID Age, Dept (AB,C) Employee ID, Project Role (X, Y Z) Note multiple attributes are often involved EmpID Project Age Dept Dsize Budget Role Rules for functional dependency A B does NOT automatically mean B A – E.g. student ID name but not name ID Transitive dependency: If AB and BC then AC Many other rules – – E.g. if X,YZ but XZ also In this case Z is partially dependent on X,Y “Transitive” and “partial” dependency are two key concepts of the normalisation process A Question for you! EmpID Project A B C D EmpID Project E1 E1 E2 E2 P2 P1 P1 P2 Age Age 33 33 34 34 Dept Dsize Budget Role Dept Dsize Budget Role D2 D2 D5 D5 10 10 10 20 100 200 200 100 Analyst Prog. Prog. Analyst Which functional dependency is violated by the data? Unnormalised Form Relation contains: – non-atomic attribute values ID Employee 1 Grey 2 Brown 3 White 4 Black Salary 31000 35000 55000 47000 Project A B,C A,B,C A,C Violation of 1NF non-atomic values First Normal Form ID Employee 1 Grey 2 Brown 2 Brown 3 White 3 White 3 White 4 Black 4 Black Salary 31000 35000 35000 55000 55000 55000 47000 47000 redundancy ID Employee Salary 1 Grey 31000 2 Brown 35000 3 White 55000 4 Black 47000 Project A B C A B C A C Budget 10 5 5 5 5 5 10 5 Permits only single (atomic) attribute values Repeating ID (fk) 1 2 2 3 3 3 4 4 Project A B C A B C A C Budget 10 5 5 5 5 5 10 5 Remove Repeating Group along with primary key from other Table Second Normal Form Full Functional Dependency (FFD) X Y is FFD – X Y is partially dependent – if removal of attribute from X leaves the dependency intact 2NF test – if removal of any attribute from X removes the dependency involves testing for partial dependency on the PK (therefore PK MUST be composite to test for 2NF) Relation R is in 2NF if: – every non-primary-key attribute in R is FFD on the primary key of R EmpID Project Age Dept Dsize Budget Role So which FD’s are violating 2NF? “Second Normalised” by: – removing non-primary-key attributes and forming a FFD on appropriate part of primary key {EmpID ,Age, Dept , Dsize} {Project , Budget} {EmpID*, Project*, Role} 2NF Third Normal Form Remove Transitive Dependency Conditions – A non-primary-key attribute Z is transitively dependent on primary key X if: X Y; Y Z (Y attribute provides the transition to the PK) A [EmpID* Project* B [EmpID Age Dept Role] Dsize] Budget] C [Project D None of the above Which of the above could have transitive dependency? Here is an un-normalised Table Ord# 1 1 2 2 2 3 Date Cust# 12/1/01 1 12/1/01 1 13/1/01 2 13/1/01 2 13/1/01 2 13/1/01 1 Name Jones Jones Black Black Black Jones Prod# Desc 1 Disk 2 CD 1 Disk 2 CD 3 Mouse 3 Mouse Qty 3 5 1 1 1 1 Supplier X Y X Y X X Tel 101 223 101 223 101 101 Normalise it to 1NF Ord# Date Cust# Name Prod# Desc Qty 1 1 2 2 2 3 Jones Jones Black Black Black Jones 12/1/01 12/1/01 13/1/01 13/1/01 13/1/01 13/1/01 1 1 2 2 2 1 Ord# Date Cust# Name 1 2 3 Jones Black Jones 12/1/01 13/1/01 13/1/01 1 2 1 1 2 1 2 3 3 Disk CD Disk CD Mouse Mouse Supplier Tel 3 5 1 1 1 1 X Y X Y X X 101 223 101 223 101 101 fk Ord# Prod# Desc Qty Supplier Tel 1 1 2 2 2 3 3 5 1 1 1 1 1 2 1 2 3 3 Disk CD Disk CD Mouse Mouse X Y X Y X X 101 223 101 223 101 101 Ord# Date Cust# Name Ord# Prod# Desc Qty Supplier Tel 1 2 3 Jones Black Jones 1 1 2 2 2 3 12/1/01 13/1/01 13/1/01 1 2 1 Already in 2NF Prod# Desc 1 2 3 1 2 1 2 3 3 Disk CD Disk CD Mouse Mouse 3 5 1 1 1 1 X Y X Y X X 101 223 101 223 101 101 Supplier Tel Disk X CD Y Mouse X 101 223 101 Now we normalise this to 2NF remembering to test on the PK for any partial dependency Ord# Prod# Qty 1 1 2 2 2 3 1 2 1 2 3 3 fk fk 3 5 1 1 1 1 So, any transitive dependency? Ord# Date Cust# Name 1 2 3 Jones Black Jones 12/1/01 13/1/01 13/1/01 1 2 1 Prod# Desc Supplier Tel Ord# Prod# Qty 1 1 2 2 2 3 1 2 1 2 3 3 fk fk 3 5 1 1 1 1 1 2 3 Disk X CD Y Mouse X 101 223 101 Yes! But not in all ……………. Ord# Date Cust# Name 1 2 3 Jones Black Jones 12/1/01 13/1/01 13/1/01 1 2 1 Prod# Desc 1 2 3 Supplier Tel Disk X CD Y Mouse X 101 223 101 Ord# Prod# Qty Cust# Name 1 2 Jones Black 1 1 2 2 2 3 1 2 1 2 3 3 3 5 1 1 1 OK! 1 Supplier Tel X Y Ord# Date Cust# (fk) Prod# Desc 1 2 3 1 2 3 12/1/01 13/1/01 13/1/01 1 2 1 101 223 Supplier (fk) Disk X CD Y Mouse X Final Decomposition Ord#{fk} Prod#{fk} Qty 1 1 2 2 2 3 1 2 1 2 3 3 3 5 1 1 1 1 Ord# Date Cust# (fk) 1 2 3 12/1/01 13/1/01 13/1/01 Cust# Name 1 2 Prod# Desc 1 2 3 Supplier (fk) Disk X CD Y Mouse X 1 2 1 Jones Black Supplier Tel X Y 101 223 Now in 3NF The underlying E-R Model ….. Ord# Date Cust# Name Prod# Desc Qty 1 1 2 2 2 3 Jones Jones Black Black Black Jones 12/1/01 12/1/01 13/1/01 13/1/01 13/1/01 13/1/01 1 1 2 2 2 1 1 2 1 2 3 3 Disk CD Disk CD Mouse Mouse 3 5 1 1 1 1 Supplier Tel X Y X Y X X 101 223 101 223 101 101 makes Customer Order 1..1 0..* 0..* has How many 0..* despatches tables would Product Supplier you get from 1..* 1..1 mapping? So Normalisation to 3NF is Normal!! Remember, 2NF and 3NF disallow partial and transitive dependencies respectively on the PK, otherwise they are open to update anomalies But ….. even at 3NF, a relation may be open to update anomalies on rare occasions due to redundancy too So we look briefly at these – – Boyce-Codd 4NF Boyce-Codd NF Is a stronger normalised form then 3NF Definition: A relation is in BCNF, if and only if, every determinant is a candidate key And remember that a candidate key is any key that could become the PK of the relation (i.e. there may be competition for it!) Potential to violate BCNF comes from: – – A relation containing at least 2 composite candidate keys Or candidate keys overlapping (i.e. they have at least one attribute in common) BCNF Example Consider the candidate keys for: clientNo interviewDate interviewTime staffNo roomNo CR76 13/5/08 10.30 SG5 G101 CR56 13/5/08 12.00 SG5 G101 CR74 13/5/08 12.00 SG37 G102 CR56 1/7/08 10.30 SG5 G102 FD1 {PK}: clientNo, interviewDate interviewTime, staffNo, roomNo FD2 {CK}: staffNo, interviewDate, interviewTime clientNo FD3 {CK}: roomNo, interviewDate, interviewTime staffNo, clientNo FD4: staffNo, interviewDate roomNo PK is primary key and CK is candidate key. But what about FD4? It is not a CK Adapted from Connolly and Begg, 2005, 4th ed. Page 420 So new decomposition? clientNo interviewDate* interviewTime staffNo* CR76 13/5/08 10.30 SG5 CR56 13/5/08 12.00 SG5 CR74 13/5/08 12.00 SG37 CR56 1/7/08 10.30 SG5 interviewDate staffNo roomNo 13/5/08 SG5 G101 13/5/08 SG37 G102 1/7/08 SG5 G102 So duplication in the room number is now eradicated 4NF Comes from 2 multivalued attributes in a relation E.g. for each value of A there is a set of values for B and a set for C, while B and C remain independent of each other Branch BranchNo staffName[1..*] ownerName[1..*] So if you model your databases from ERM’s this type of dependency should not arise. Example of 4NF branchNo staffName ownerName C003 Anne Carol C003 David Carol C003 Anne Tina C003 David Tina branchNo* staffName C003 Anne C003 David branchNo* ownerName C003 Carol C003 Tina Note: if step 9 applied to multi-valued attributes then we should map this correctly and avoid such redundancy as the two tables on the right would be the result of the mapping! Adapted from Connolly and Begg, 2005, 4 th ed. Page 428 Normal Form Summary A Relation’s degree of normalisation Stronger in format at each stage – First Normal Form (1NF) – – – The relation has no transitive dependencies Boyce-Codd – The relation has no partial dependencies All non-key attributes are fully functionally dependent on the PK 3rd Normal Form (3NF) – The relation has no non-atomic values Or the relation has “no repeating group” 2nd Normal Form (2NF) – less vulnerable to update anomalies Every determinant is a candidate key 4NF – no multi-valued dependencies