ISOM MIS710 Module 1b Relational Model and Normalization Arijit Sengupta Structure of this semester ISOM MIS710 1. Design 0. Intro Database Fundamentals Conceptual Modeling Relational Model 2. Querying Query Languages Advanced SQL 3. Applications 4. Advanced Topics Java DB Applications – JDBC Transaction Management Data Mining Normalization Newbie Users Designers Developers Professionals Today’s Buzzwords ISOM • Relational Model • Superkey, Candidate Key, Primary Key and Foreign Key • Entity Integrity Rule • Referential Integrity Rule • Normalization • First, Second, Third, and Boyce-Codd Normal Forms • Unnormalization Objectives of this lecture ISOM • Understand the Relational Model and its properties • Understand the notion of keys • Understand the use and importance of referential integrity • Provide an alternative way to design relations using semantics rather than concepts • Take an existing “flat file” design and creating a relational design from it through the process of Normalization • Identify sources of problems (or anomalies) within a given relational design • Argue about improvements to designs created by others Relational Data Model ISOM • Originally proposed by Codd in 1970 • Based on mathematical set theory Attribute Names Relation Tuples ID S1 S2 S3 S4 S5 Name Jose Alice Lin Joyce Sunil Age 21 18 32 20 27 Attributes Address Stoned Hill BigHead Done-Audy Atlanta Mare-iota GPA 3.1 3.2 2.9 3.7 3.2 Attribute Values Relation: Properties ISOM • A relation is a set of tuples • A tuple is a set of attribute-value properties (relations) Ordering of attributes is immaterial Ordering of Tuples is immaterial • Tuples are distinct from one another • Attributes contain atomic values only Emp# E1 Name Jose' 'M.' 'Smith' Address 3413 Main Street', 'Atlanta', GA Attributes ISOM • Attribute name Attribute names are unique within a relation • Attribute domain Set of all possible values an attribute may take Domain (GPA) = Domain (name) = Domain (DateOfBirth) = Domain (year) • Number of attributes: degree of the relation Tuples ISOM • Aggregation of attribute values S1 = (s1, ‘Jose’, 21, ‘StonedHill’, 3.1) S2 = (s2, ‘Alice’, 18, ‘BigHead’, 3.2) • Cardinality: Number of tuples in a relation ID S1 S2 S3 S4 S5 Name Jose Alice Lin Joyce Sunil Age 21 18 32 20 27 Address Stoned Hill BigHead Done-Audy Atlanta Mare-iota GPA 3.1 3.2 2.9 3.7 3.2 • What is the difference between the cardinality and the degree? Primary Keys ISOM • Superkey: SK, a subset of attributes of R, satisfying Uniqueness, that is, no two tuples have the same combination of values for these attributes • Candidate Key: K, a superkey SK, satisfying minimality, that is, no component of K can be eliminated without destroying the uniqueness property. • Primary Key: PK, the selected Candidate key, K. • Can a primary key be composed of multiple attributes? • Can a relation have multiple primary keys? Keys - example ISOM Disk: (ISBN#, Artist_name, Album_name, Year, Producer, Genre, time, price) • Superkeys? • Candidate keys? • Primary key? Entity Integrity Rule ISOM • The primary key of a base relation cannot contain a NULL value. • Enforcement of the rule: An update which results in a NULL value in the primary key must be rejected. • Are the following ok? Primary Key Course 201 201 NULL Section 1 NULL NULL Meets MW TTh MWF Enrolled 20 25 18 Foreign Key ISOM Physician (ID, Name, …) Patient (ID, Name, PhysID*, …) Club (ID, Name, …) Player (ID, Name, ?*, …) Order (OrdID, Date, …, ?*) Customer (ID, Name, …, ?*) Dept (DeptID, Name, …, ?*) Employee (EID, Name, …, ?*) • Attribute(s) of one relation that reference(s) the PK of another relation • FK may or may not be (a part of) the PK of this relation Course (CourseID, Name, …, ?*) Student (SID, Name, …, ?*) • • Class (ClassID, Meets, …, ?*) Registration (?) Can an FK refer to a part of the PK of another relation? Can an FK refer to a PK of the same relation? Foreign Key .. ISOM • FK and referenced PK may have different names • The values of FK must draw from the value set of PK Primary Key Value Set Foreign Key Domain Domain • How do we define the Domain of an FK? • Can an FK have a NULL value? • What can we enforce with PKs and FKs? Referential Integrity Rule ISOM • If FK is the foreign key of a relation R2, which matches the primary key PK of the relation R1, then: the FK value must match the PK value in some tuple of R1, or the FK value may be NULL, but only if the FK is not (a part of) the PK of R2. • Enforcement of the Rule An update on either a referenced PK or an FK must satisfy the rule. Otherwise, the operation is rejected. • • Which operation on the primary key may violate this rule? Which operation on the foreign key may violate this rule? Referential Integrity Enforcement ISOM • If an operation violates referential integrity: Restrict • reject the operation Cascade • try to propagate the operation to all dependent FK values, if it is not possible, reject the operation Nullify (or Default) • set all dependent FK values to NULL (or a default value), if that is not possible, reject the operation • Cases for each of the above situations? Creating Relations ISOM create table STUDENT ( ID char (11) not null primary key, Name char(30) not null, age int, GPA number (2,1)); create table COURSE ( courseno char (6) not null primary key, coursename char(30) not null, credithours number (2,1)); create table REGISTRATION ( ID references STUDENT (ID) on delete cascade, CourseNum references COURSE (courseno), primary key (ID, CourseNum) ); Normalization - Motivating Example ISOM SID s1 s1 s1 s2 s2 s3 s3 s3 Name Joseph Joseph Joseph Alice Alice Tom Tom Tom Grade A B A A A B B A Course# CIS800 CIS820 CIS872 CIS800 CIS872 CIS800 CIS872 CIS860 Text b1 b2 b5 b1 b5 b1 b5 b1 Major CIS CIS CIS CS CS Acct Acct Acct Dept CIS CIS CIS MCS MCS Acct Acct Acct • Is there any redundant data? • Can we insert a new course# with a new textbook? • What should be done if ‘CIS’ is changed to ‘MIS’? • What would happen if we remove all CIS 800 students? Why Normalization? ISOM • Poor Relation Design causes Anomalies Insertion anomalies - Insertion of some piece of information cannot be performed unless other irrelevant information is added to it. Update anomalies - Update of a single piece of information requires updates to multiple tuples. Deletion anomalies - Deletion of a piece of information removes other unrelated but necessary information. • Normalization improves the design to remove these anomalies Why Normalization? ISOM • Benefits contain minimum amount of redundancy allow users to insert, delete and modify tuples in the relation without errors or inconsistencies. improve quality of information in the database decrease storage space for the database • Costs may contribute to performance problems may require more storage in some cases Unnormalized Relation ISOM STUDENT STUDENT COURSE ID NAME ID 224 Waters CIS20 CIS40 CIS50 351 Byron CIS30 CIS50 421 Smith CIS20 CIS30 CIS50 COURSE NAME Intro CBIS Database Mgt Sys.Analysis COBOL Sys.Analysis Intro CBIS COBOL Sys.Analysis INSTR NAME Greene Hong Purao Brown Purao Greene Brown Purao ROOM CREDITS GRADE 205G 311S 139S 629G 139S 205G 629G 139S 5 5 5 3 5 5 3 5 A B B B C B B B • Create a ‘Definition’ for this relation. • Do you see any problems in the definition? • Do you see any anomalies in the data? Normal Forms ISOM NF2 1NF 2NF 3NF BCNF Unnormalized Relation Only atomic attributes First Normal Form Remove nonkey dependency Second Normal Form Remove transitive dependency Third Normal Form Dependency preservation: BCNF Remove Multi-valued Dependencies: 4NF Remove Join Dependencies: 5NF Higher Order Forms The Basis of Normalization ISOM • Functional Dependency (FD) Consider two attributes, X and Y, and two arbitrary tuples r1 and r2 of a relation R. • Y is functionally dependent on X iff: value of x in r1 = value of x in r2 implies value of Y in r1 = value of Y in r2 • Also stated as: R.X R.Y or X Y Properties of FDs ISOM • If R.X R.Y or X Y X is called the determinant of Y. X may or may not be the key attribute of R. A FD changes with its semantic meaning • Name Address? X and Y may be composite X and Y may be mutually dependent on each other • Husband Wife, Wife Husband The same Y value may occur in multiple tuples • Course# Text Fully Functional Dependencies ISOM • When is X Y a FFD? When Y is not functionally dependent on any proper subset of X • X Y is a fully functional dependency ( FFD ) ( SID, Course# ) Name? ( SID, Course# ) Grade? ( SID, Name ) Major? ( SID, Name ) SID? • By default, the term FD refers to FFD Transitive Dependencies ISOM • Given attributes X, Y, and Z of a relation R, • Z is transitively dependent on X (X Z) iff X Y and Y Z • For example: SID Dept, SID Major, Dept School, Major Dept • Do you see any Transitive Functional Dependencies? Some Inference Rules for FDs ISOM • An FD is redundant if it can be derived from other FDs based on a set of inference rules. Some of these rules are: • Reflexive rule: If X Y, then X Y X always determines a subset of itself. • Augmentation rule: If X Y, then XZ YZ Adding an attribute(s) on both side does not change the FD. • Transitive rule: If X Y & Y Z, then X Z Functional dependencies can be ‘chained’. • • Decomposition rule: If X YZ, then X Y and X Z Given: { SID Name, SID Major, Major Dept }, which ones is/are redundant? SID School, SID Dept, Dept School SID ( Name, Major ), (SID, Name) (Major, Name) SID SID, SID (Name, SID) First Normal Form ISOM • DEFINITION A relation R is in first normal form (1NF) if and only if all underlying domains contain atomic values only. • Translation To be in first normal form the table must not contain any repeating attributes. • Implication Are all ‘relations’ in First Normal Form (1NF) ? Example - 1NF ISOM The ‘unnormalized’ relation has been decomposed in two. StudentID 224 251 421 StudentName Waters Byron Smith Relation: Student Relation: Student-Course StudentID 224 224 224 351 351 421 421 421 Course# CIS20 CIS40 CIS50 CIS30 CIS50 CIS20 CIS30 CIS50 Course Title Intro CBIS Database Mgt Sys.Analysis COBOL Sys.Analysis Intro CBIS COBOL Sys.Analysis Instrname Greene Hong Purao Brown Purao Greene Brown Purao • What are the PKs? ROOM 205G 311S 139S 629G 139S 205G 629G 139S CREDITS 5 5 5 3 5 5 3 5 GRADE A B B B C B B B Anomalies (with only 1NF) ISOM • Insertion Anomaly A new course cannot be inserted in the database (relation Student-Course) until a student registers for that course. • Update Anomaly If the instructor of a course is changed, this fact would have to be noted at many places in the database (many tuples of the relation Student-Course). • Deletion Anomaly Withdrawal of all students from an existing course (that is, deletion of related tuples from the relation StudentCourse) will result in unwarranted removal of that course from the database. Anomalies in 1NF ISOM Course (SID, Name, Grade, Course#, Text, Major, Dept) • 1NF Relations have anomalies Redundant Information ? Update Anomalies ? Insertion Anomalies ? Deletion Anomalies ? Major SID Name Course# Grade Dept Text Second Normal Form ISOM • DEFINITION A relation R is in second normal form (2NF) if and only if it is in 1NF and every nonkey attribute is dependent on the full primary key. • Translation A table is in second normal form if there are no partial dependencies. • Implication What kinds of primary keys may lead to a violation of the Second Normal Form (2NF) ? Bubble Chart ISOM • Reconsider the example .. StudentName Credits StudentId+ CourseId CourseTitle Instructor Classroom Grade Dealing with Compound Keys ISOM • Revised Bubble Chart StudentName Credits StudentId CourseTitle Instructor CourseId Classroom Grade Example - 2NF ISOM STUDENT STUDENT ID NAME 224 Waters 251 Byron 421 Smith COURSE ID CIS20 CIS30 CIS40 CIS50 COURSE TITLE Intro to CIS Java DBMS Systems Analysis CREDITS 5 3 5 5 STUDENT COURSE ID ID 224 CIS20 224 CIS40 224 CIS50 351 CIS30 351 CIS50 421 CIS20 421 CIS30 421 CIS50 GRADE A B B B C B B B Anomalies with (only) 2NF ISOM STUDENT STUDENT STATUS ID NAME 224 Waters Junior 351 Byron Soph 421 Smith Junior ADVISOR Young Greene Young ADVISOR OFFICE CBA221 CBA215 CBA221 ADVISOR TOTAL PHONE CREDITS 726104 105 718434 77 726104 97 • Insertion anomaly Information about a faculty (potential advisor) cannot be added to the database unless a student is assigned to him/her. • Update anomaly If the advisor’s office location or phone were changed, many tuples would need to be changed. • Deletion anomaly If all students assigned to an advisor graduate, information about the advisor will disappear from the database. Third Normal Form ISOM • DEFINITION A relation R is in third normal form (3NF) if and only if it is in 2NF and every nonkey attribute is nontransitively dependent on the primary key. • Translation A table is in Third Normal Form if every non-key attribute is determined by the key, and nothing else. • Implication How many total attributes must the relation have for a possible violation of the Third Normal Form (3NF) ? 3NF Example ISOM • Chalk out the relations. StudentName StudentId TotalCredits Advisor Status Advisor AdvisorOffice AdvisorPhone How do you maintain student-advisor relation? Boyce-Codd Normal Form (BCNF) ISOM • Update anomalies occur in an 3NF relation R if R has multiple candidate keys, Those candidate keys are composite, and The candidate keys are overlapped. Computer-Lab (SID, Account, Class, Hours) • A relation R is in BCNF iff every determinant is a candidate key. The Normalization Process ISOM 1. 2. 3. 4. 5. Flatten the Table Completely (no composite columns) Find the Key and “all” FDs (well as many as you can possibly detect) Find Partial Dependencies and decompose relation using them (2NF) Find Transitive dependencies and decompose using them (3NF) Remember – this is not a deterministic method – depends on the order in which FDs are chosen, so same Relation, same set of FDs can lead to different decompositions! Lossless Decomposition ISOM • A bad decomposition loses information • In a good decomposition The join of decomposed relations restores the original relation Decomposed relations can be maintained independently • Rissanen’s rule for non-loss decomposition: Two projections R1 and R2 of a relation R are independent iff: Every FD in R can be logically deduced from those in R 1 and R 2 , and The common attributes of R 1 and R 2 form a candidate key for at least one of the pair. Higher Normal Forms ISOM • Fourth Normal Form Multivalued Dependencies (Fagin 1977) • Fifth Normal Form Join Dependencies (Fagin 1979) • Other Dependencies Inclusion Dependencies (Casanova 1981) Template Dependencies (Sadri 1982) Domain-Key Normal Form (Fagin 1981) In-class Exercise – Normalize this: ISOM