The Relational Model and Normalization 1 The Relational Model Page 113 Broad, flexible model Basis for almost all DBMS products E.F. Codd defined well-structured “normal forms” of relations, “normalization” Relational Data Model A relational data model organizes data as a set of relations, or two-dimensional tables. A relation is viewed as a two-dimensional table, with following properties: Each column contains values about the same attribute, and each table cell must be simple Each column has a distinct name (attribute name), and the order of columns is immaterial Each row is distinct, duplicate rows are not allowed The sequence of the rows is immaterial An Example Relation Key Candidate Key Foreign Key Non-key Attribute Non- key Attribute Employee Employee Number Name 28719 Smith Tom Department Number 172 Salary 18,000 Date Started 12/03/84 53730 Jones Bill 044 20,000 01/05/83 79313 Ropley Ed 044 11,000 18/09/81 51616 Fair Carolyn 090 50,000 05/12/79 61930 Hall Albert 090 25,000 21/06/82 Terminology in a Relation Tuple - a row or record Column - values of an attribute Domain - a set of possible values for an attribute Terminology in a Relation Key primary key (unique ID) Concatenated key - use two or more attributes to identify a record (e.g.. Student ID & Course ID to identify a Grade record) Foreign key (cross reference key) a foreign key is a non-key attribute in one relation that also appears as a primary key in another relation An E-R Model for Student Registration System Course Number Instructor ID Description Room Course Attributes Rank Teaches M Name Instructor 1 1 1 Advises M M Student Course Enrollment M Course Number Grade Student Number 1 Major Student Number Student Name 7 Covert E-R Model to Relational Tables Create one table for each entity with key and attributes Introduce foreign key into the “many” side to represent 1:m relation A Relational Model For Student Registration System Course Table Course ID Description Credit Instructor ID Instructor Table Instructor ID Instructor Name Rank Student Table Student ID Student Name Major Enrollment Table Course ID Student ID Grade Advisor ID Relational Database Advantages Easy to understand and use Powerful data manipulation capability Implicit association to meet different needs. Flexible, best for DSS Normalization theory for database design Disadvantages Redundantly store keys as logical pointers for implementing relationship Inefficiency for high-volume transaction processing Lack of semantic quality control Equivalent Relational Terms Page 114 Figure 5-1 © 2000 Prentice Hall Normalization Reduce complex user views to a set of small, stable data structures Eliminate errors and inconsistencies related to the adding, deleting or updating of record occurrences Modification Anomalies Insertion anomalies - cannot add a record because of a missing value for one or more fields Deletion anomalies - the deletion of a record causes an unintended deletion of information Update anomalies - updating as made needlessly complicated due to redundancy Functional Dependence Given a relation R, attribute Y of R is functionally dependent on attribute X of R if and only if, whenever two tuples of R agree on their X- value, they must necessarily agree on their Y-value. We write R.X --> R.Y Example: (Student ID, Student Name, Course ID, Course Title, Grade) Student ID --> Student Name, Course ID --> Course Title Student ID -?-> Course ID Course Title -?-> Student Name Student ID -?-> Grade Course ID -?-> Grade Normal Forms A relation is said to be in a particular normal form if it satisfies a certain specified set of constraints Normal Forms 1 NF (no repeating groups) 2 NF (no partial dependencies) 3 NF (no transitive dependencies) Boyce-Codd NF 4 NF (no multi-value dependencies) 5 NF Domain-Key NF First Normal Form A relation is in first normal form if it contains no repeating groups First Normal Form An un-normalized relation contains repeating groups First Normal Form Grade Report with repeating group of courses for each student (Student ID, Student Name, Campus Address, Major, Course ID, Course Title, Instructor Name, Instructor Location, Grade) Remove repeating group (Student ID, Student Name, Campus Address, Major) (3NF) (Student ID, Course ID, Course Title, Instructor Name, Instructor Location, Grade) (1NF) First Normal Form Second Normal Form A relation is in second normal form if it is already in first normal form and any partial functional dependencies on the primary key have been removed Second Normal Form A B C D partial functional dependencies on the primary key A B B D C Second Normal Form (Student ID, Course ID, Course Title, Instructor Name, Instructor Location, Grade) (1NF) Primary key is Student ID + Course ID Student ID + Course ID --> Grade Course ID --> Course Title (partial dependency) Removing partial dependencies (Student ID, Course ID, Grade) (3NF) (Course ID, Course Title, Instructor Name, Instructor Location ) (2NF) Second Normal Form Third Normal Form A relation is in third normal form if it is already in second normal form and contains no transitive dependencies transitive dependency - One nonkey attribute is dependent on one or more nonkey attributes Third Normal Form A B C D transitive dependencies A B C D C Third Normal Form (Course ID, Course Title, Instructor Name, Instructor Location ) (2NF) Course ID --> Instructor Name --> Instructor Location Instructor Name is nonkey Instructor Location is dependent on Instructor Name Remove transitive dependency (Course ID, Course Title, Instructor Name) (3NF) (Instructor Name, Instructor Location ) (3NF) Third Normal Form Third Normal Form “if it is in second normal form and has no transitive dependencies” Figure 5-7 © 2000 Prentice Hall Practice: Mountain View Community Hospital Mountain View Community Hospital Physician Report Physician: A Campbell Specialty: Internal Medicine Date Patient-Code Patient-Name Procedure Charge ---------------------------------------------------------------------------------------------10/17/96 32968 Baker, Marry S. Examination 35.00 X-ray 75.00 10/17/96 39271 Emery, Nancy Examination 35.00 Chemotherapy 50.00 10/18/96 32968 Baker, Marry S. Examination 35.00 ---------------------------------------------------------------------------------------------- Normalize a table Report (Doctor Name, Specialty, Date, Patient Code, Patient Name, Procedure Name, Charge) Analyzing functional dependency: Assume no duplicate Doctor Name. Otherwise introduce a doctor ID Assume no duplicate Procedure Name. Otherwise introduce a Procedure code Assume charge is determined by procedure. Assume a patient may visit a doctor more than once during the same day. Answer Doctors (Doctor ID, Doctor Name, Specialty) Patients (Patient Code, Patient Name) Visit (Visit ID Doctor ID, Patient Code, Date) Treatment (Visit ID, Procedure ID) Procedure (Procedure ID, Procedure Name, Charge) Here the Visit ID is automatically generated by the system A E-R Model for Hospital Treatment Charge Procedure ID Doctor ID Description Name Rate Specialty Procedure 1 1 Doctors Doctor ID M M Treatment M 1 Patients Visit Visit ID M Patient Code Visit ID Procedure ID 1 Date/Time Patient Code Patient Name E-R model improvement criteria vs. Normalization Theory Each entity must have a key (simple or composite) (basic requirement of a relational table) Introduce composite entity to convert a m:n relation into two 1:m relations. Introduce a composite key (the way of presenting m:n relationships in relational database) Convert a multivalued attribute into an attribute entity or weak entity (1 NF) E-R model improvement criteria vs. Normalization theory Make each entity represent a simple object or concept (2 NF and 3NF) Divide complex entity into several related simple entities (2 NF and 3 NF) Make each attribute associate with only one entity unless it is a foreign key (3 NF) A good E-R model usually satisfies 3 NF. Boyce-Codd Normal Form “if every determinant is a candidate key” Figure 5-8 © 2000 Prentice Hall Boyce-Codd Normal Form (Student, Major, Advisor) (3NF) or (Student, Advisor, Major) (1NF) Student may have more than one major with one advisor in each major Student + Major Advisor Student + Advisor Major Advisor Major (Advisor determines major but Advisor is not candidate key) (Student, Advisor) (BCNF) (Advisor, Major) (BCNF) Boyce-Codd Normal Form A relation is in BCNF if and only if it is in 3NF and every determinant is a candidate key A determinant is any attribute (simple or composite) on which some other attribute is fully functionally dependent Situation: 1. Multiple candidate keys 2. Those candidate keys are composite 3. The candidate keys are overlapped Fourth Normal Form A relation is in fourth normal form if it is in BCNF and contains no multivalued dependencies Multivalued Dependency There are three attributes (e.g. A,B,C) in a relation. For each value of A there is a well-defined set of value of B and a well-defined set of value of C. The set of value of B is independent of the set of value of C, and vice versa. Fourth Normal Form (Course, Instructor, Textbook) (BCNF) One course is taught by several instructors One course uses the same set of textbooks by each instructor (Course, Textbook) (4NF) (Course, Instructor) (4NF) Fourth Normal Form Course Instructor Textbook 1ka3 David Intro. Web design 1ka3 Smith Intro. Web design 1ka3 David Intro. Access 1ka3 Smith Intro. Access Course Instructor Course Textbook 1ka3 David 1ka3 Intro. Web design 1ka3 Smith 1ka3 Intro. Access Fifth Normal Form ? Page 125 Fifth Normal Form Every join dependency is a consequence of its relation keys A non 5NF: Person-using-skills-on-jobs (Person, Skill, Job) 5 NF: Has-skill (Person, Skill) Need-skill (Skill, Job) Assigned-to-job (Person, Job) Domain Key Normal Form “if every constraint on the relation is a logical consequence of the definition of keys and domains” Constraint “a rule governing static values of attributes” Key “unique identifier of a tuple” Domain “description of an attribute’s Page 125 allowed values” Example of non DK/NF Enrollment (Student ID, Course ID, Grade) Key constraint: Student ID + Course ID --> Grade Domain constraint: Student ID: 7 digits, Course ID: 3 digits, Grade: A,B,C,D,F,P General constraint If Course ID < 900 then Grade in {A,B,C,D,F} else Grade in {P,F} Since the general constraint cannot be inferred from key constraint or domain constraint, it is not a DK/NF. Remarks on Normalization The notions of dependency and normalization are semantic in nature The normalization guidelines should be regarded primarily as a discipline to help the database design Limitations of normalization may not natural, e.g. zip code, area code for phone # May ignore operational considerations: need not change, may change over time. e.g. (order# , prod# ,description, unit-price, quantity) Difficult to enforce integrity control (Order#, Prod#, quantity) (Prod#, Description, Unit-price) Prod# may not be valid. Now the integrity control is provided by relational DBMS Denormalization Normalization is only one of many database design goals. Normalized (decomposed) tables require additional processing, reducing system speed. Normalization purity is often difficult to sustain in the modern database environment. The conflict between design efficiency, information requirements, and processing speed are often resolved through compromises that include denormalization.