INFYS540 Normalizing Your Database and Why you WANT to do it! Lesson 7 Chapter 5 Appendix 3/18/2016 1 Data Redundancy Problems • Redundancy breeds errors – Same data defined in multiple places is BAD – Spelling/typographical error prone – Lack of data integrity • Inability to perform simple queries • Inflexibility and inscalability • Impossible to MAINTAIN! 3/18/2016 3 Relational Database PROJECTCHIEF Project Project Chief Computing 333-22-1111 Intranet 987-65-4321 Contracting 123-45-6789 CAT 333-22-1111 EMPLOYEES LName FName Jones Mike Smith Tony Lee Bruce Doodle Yankee 1 SSN 123-45-6789 987-65-4321 567-89-1234 333-22-1111 What is a candidate key? What is a primary key? What is a foreign key? 3/18/2016 Dept M&B C2G MLD M&B 1 DEPARTMENTS Dept Dept. Director MLD 181-94-5676 C2G 987-65-4321 M&B 123-45-6789 Room B115 123 147 5 Purpose of Normalization • Take advantage of the powerful tools available in a DBMS • There are five levels of Normalization – The higher the Normal Form the “better” and more efficient the database – But, increasing the levels of Normal Form takes time and effort – For most applications, 3rd Normal Form will solve most potential problems with a DB 3/18/2016 11 Normalizing Database • Process of creating well-structured tables. • Improve performance, integrity of data • 5-step process (w/ 2 rules) to achieve Third Normal Form (3NF) • First two steps put DB into a form so you can normalize it 3/18/2016 12 Rule #1 in Databases Never design redundant data into a Database duplicate data is not consistent duplicate data wastes space 3/18/2016 13 Step 1. Primary Keys • A primary key is one or more data fields (columns) that uniquely identify each record in the table • What would the primary key be below? – “table of employees, assigned to a department.” EMPLOYEES LName FName Jones Mike Smith Tony Lee Bruce 3/18/2016 SSN 123-45-6789 987-65-4321 567-89-1234 Dept Math M&B Science 14 Step 1. Primary Keys • Answer: The SSN • It is the only “guaranteed” unique column in the table. Names are easily repeated. EMPLOYEES LName FName Jones Mike Smith Tony Lee Bruce 3/18/2016 SSN 123-45-6789 987-65-4321 567-89-1234 Dept Math M&B Science 15 Step 1. Primary Keys • Now try the following example: • “A table of projects assigned to employees, listing the project name and the employee’s function on the project.” EmpProj A Counter --The MS Access Default Key Counter SSN Project Function 1 123-45-6789 Dining Designer 2 123-45-6789 Computing Designer 3 987-65-4321 Contracting Designer 4 444-55-6666 Intranet Webmaster 5 222-99-7777 Dining Overwatch 3/18/2016 16 Step 1. Primary Keys • It is the combination of the SSN and the Project fields. Why? EMPLOYEES’ PROJECTS Counter SSN 1 123-45-6789 2 123-45-6789 3 987-65-4321 4 444-55-6666 5 222-99-7777 3/18/2016 Project Dining Computing Contracting Intranet Dining Function Designer Designer Designer Webmaster Overwatch 17 Step 1. Primary Keys • Because, you can have the following: EMPLOYEES’ PROJECTS Counter SSN 1 123-45-6789 2 123-45-6789 3 987-65-4321 4 444-55-6666 5 222-99-7777 Project Dining Dining Intranet Intranet Dining Function Designer Designer Designer Webmaster Overwatch • Redundant records! (Redundancy = BAD) 3/18/2016 18 Rule #2 about Databases NEVER Use a Counter as a Primary Key 3/18/2016 19 Step 2: Eliminate Many-to-Many Relationships • What is wrong with the following table? • “a table of personnel authorized access to a project” PROJECTS QUERY ACCESS Project Access_1 Dining 222-99-7777 Computing 222-99-7777 Intranet 987-65-4321 3/18/2016 Access_2 181-94-5676 181-94-5676 818-49-6765 Access_3 123-45-6789 20 Step 2: Eliminate Many-to-Many Relationships • Here’s essentially what this table looks like within the Access relationships diagram: Projects: Project Project Chief Department Access_1 Access_2 Access_3 3/18/2016 Employees: SSN Last Name First Name .... 21 Step 2: Eliminate Many-to-Many Relationships • Here’s how you model it in a database: – Break it up into two one-to-many relationships Projects: Project Project Chief Department .... 1 1 Employees: SSN Last Name First Name .... Access to Project Info: Project SSN 3/18/2016 22 Step 2: Eliminate Many-to-Many Relationships • How to do it: – The primary key of the new table is the composite of the primary keys of the existing tables. • Primary key of Projects = Project Name • Primary key of Employees = SSN • New table primary key of Project Name and SSN 3/18/2016 23 Step 2: Eliminate Many-to-Many Relationships – No artificial restrictions on number of people with access – You can add attributes about the types of access granted – You can easily query who has access to information about each project PROJ QUERY ACCESS EMPLOYEE LName Jones Smith Lee Doodle 3/18/2016 FName Mike Tony Bruce Yankee SSN 123-45-6789 987-65-4321 567-89-1234 333-22-1111 Project Dining Dining Computing Computing Intranet Intranet Intranet SSN 222-99-7777 181-94-5676 222-99-7777 181-94-5676 987-65-4321 818-49-6765 123-45-6789 PROJECT Project Computing Intranet Contracting CAT ProjectChief Dept 333-22-1111 MATH 987-65-4321 M&B 123-45-6789 M&B 333-22-1111 Admin 24 What is wrong with the following? “A table of PCs, which are loaded with many different applications, and assigned to a user.” PCSerial# 10291 10301 10311 3/18/2016 LoadedSoftware Word, Powerpoint, ccMail Word, Powerpoint, Lotus Notes Word, LotusNotes, Borland C++ Assigned Jones Smith Hacker 25 Step 3: Achieving 1NF: All Data must be Atomic • “Atomic” - the data occupying a field cannot be further broken down. – i.e., no multi-data entries – i.e., “No attributes can have more than one value for a single instance of an entity” PCSerial# 10291 LoadedSoftware Word, Powerpoint, ccMail Assigned Jones • If not atomic, updating is complex and error prone • If not atomic, can not easily query the database 3/18/2016 26 Step 3 Answer PCSerial# 10291 10291 10291 10301 10301 10301 10311 10311 10311 3/18/2016 LoadedSoftware Word Powerpoint ccMail Word Powerpoint LotusNotes Word LotusNotes Borland C++ Assigned Jones Jones Jones Smith Smith Smith Hacker Hacker Hacker 27 Step 3. Achieving 1NF: All Data must be Atomic Another source of redundancy: calculated fields TotalYTD Age DaysRemaining Solution: Use a Query! Remove all calculated fields from table and create a query ...then use the query whenever you need up-to-date data 3/18/2016 28 Step 4. Achieving 2NF: Eliminate Partial Dependencies • What is a partial dependency? – Look at the table. What’s redundant? – “A table of functions an employee is assigned to for a project, and the project chief.” EMPLOYEES’ PROJECTS SSN Project 123-45-6789 Dining 123-45-6789 Computing 123-45-6789 Intranet 987-65-4321 Intranet 444-55-6666 Intranet 222-99-7777 Dining 3/18/2016 Function Designer Designer Member Designer Webmaster Overwatch Project Chief 222-99-7777 333-88-5656 987-65-4321 987-65-4321 987-65-4321 222-99-7777 29 Step 4. Achieving 2NF: Eliminate Partial Dependencies • Function depends on the entire primary key: SSN and Project. • ProjectChief is dependent on just a portion of the primary key EMPLOYEES’ PROJECTS SSN Project 123-45-6789 Dining 123-45-6789 Computing 123-45-6789 Intranet 987-65-4321 Intranet 444-55-6666 Intranet 222-99-7777 Dining 3/18/2016 Function Designer Designer Member Designer Webmaster Overwatch ProjectChief 222-99-7777 333-88-5656 987-65-4321 987-65-4321 987-65-4321 222-99-7777 30 Step 4. Achieving 2NF: Eliminate Partial Dependencies • Why is this bad? – Well, what’s wrong with the following? EMPLOYEES’ PROJECTS SSN Project 123-45-6789 Dining 123-45-6789 Computing 123-45-6789 Intranet 987-65-4321 Intranet 444-55-6666 Intranet 222-99-7777 Dining 3/18/2016 Function Designer Designer Member Designer Webmaster Overwatch Project Chief 222-99-7777 333-88-5656 987-65-4321 987-65-4321 222-99-7777 222-99-7777 31 Step 4. Achieving 2NF: Eliminate Partial Dependencies • A partial dependency (PD) occurs when a nonkey field depends on only a part of the primary key, and not the whole primary key. • PDs are a relation. So, we need a new table..... EMPLOYEES’ PROJECTS SSN Project 123-45-6789 Dining 123-45-6789 Computing 123-45-6789 Intranet 987-65-4321 Intranet 444-55-6666 Intranet 222-99-7777 Dining 3/18/2016 Function Designer Designer Member Designer Webmaster Overwatch Project Chief 222-99-7777 333-88-5656 987-65-4321 987-65-4321 987-65-4321 222-99-7777 32 Step 4. Achieving 2NF: Eliminate Partial Dependencies • Here’s how it should look...... EMPLOYEES’ PROJECTS SSN Project 123-45-6789 Dining 123-45-6789 Computing 123-45-6789 Intranet 987-65-4321 Intranet 444-55-6666 Intranet 222-99-7777 Dining PROJECTS Project Project Chief Dining 222-99-7777 Computing 333-88-5656 Intranet 987-65-4321 3/18/2016 Function Designer Designer Member Designer Webmaster Overwatch 33 Step 5: Achieving 3NF: Eliminate Transitive Dependencies • What is wrong with the following table? PROJECTS Project Dining Computing Intranet Contracting CAT 3/18/2016 Project Chief 222-99-7777 333-88-5656 987-65-4321 187-87-8787 333-22-1111 Dept. Admin Admin M&B M&B Grounds Dept. Director 181-94-5676 181-94-5676 818-49-6765 818-49-6765 123-45-6789 Room B115 B115 123 123 147 34 Step 5: Achieving 3NF: Eliminate Transitive Dependencies • We have fields dependent on a non-key field: – The Director and Room fields clearly relate to the Dept., and have nothing to do with the project. (Dept is a “determinant” that is not a candidate key) PROJECTS Project Dining Computing Intranet Contracting CAT 3/18/2016 Project Chief 222-99-7777 333-88-5656 987-65-4321 187-87-8787 333-22-1111 Dept. Admin Admin M&B M&B GRND Dept. Director 181-94-5676 181-94-5676 818-49-6765 818-49-6765 123-45-6789 Room B115 B115 123 123 147 35 Step 5: Achieving 3NF: Eliminate Transitive Dependencies • A transitive dependency occurs when a non-key field depends on another non-key field. • Why is this bad?. – A typo appeared in the Contracting line. A database without the transitive dependency would not have allowed this to happen. PROJECTS Project Dining Computing Intranet Contracting CAT 3/18/2016 Project Chief 222-99-7777 333-88-5656 987-65-4321 187-87-8787 333-22-1111 Dept. Admin Admin M&B M&B GRND Dept. Director 181-94-5676 181-94-5676 818-49-6765 818-49-6765 123-45-6789 Room B115 B115 123 124 147 36 Step 5: Achieving 3NF: Eliminate Transitive Dependencies • How to do it: a. Which fields are dependent on a non-key field in the table? (Director, Room) b. Which fields are these dependent on? (Dept) c. Create a new table with (b) as the primary key. d. Put (a) in the new table. e. Remove (a) from the old table. 3/18/2016 37 Step 5: Achieving 3NF: Eliminate Transitive Dependencies • Here are the new tables. PROJECTS Project Dining Computing Intranet Contracting CAT DEPARTMENTS Dept. Name Admin M&B GRND 3/18/2016 Project Chief 222-99-7777 333-88-5656 987-65-4321 187-87-8787 333-22-1111 Dept. Admin Admin M&B M&B GRND Dept. Director 181-94-5676 818-49-6765 123-45-6789 Room B115 123 147 38 Data Analysis: Normalization • An entity is in first normal form (1NF) if there are no attributes that can have more than one value for a single instance of the entity. • An entity is in second normal form (2NF) if it is already in 1NF, and if the values of all non-primary key attributes are dependent on the full primary key – not just part of it. • An entity is in third normal form (3NF) if it is already in 2NF, and if the values of its non-primary key attributes are not dependent on any other non-primary key attributes. 3/18/2016 39 Conclusion • • • • • • • • 3/18/2016 Rule1: Never design redundant data into a database Rule2: Never use a counter as Primary Key Identify proper primary keys (1NF) Break up many-to-many relationships (1NF) 1NF: Break all data into atomic components 2NF: Identify/eliminate partial dependencies 3NF: Eliminate transitive dependencies Common sense test 41 What is a Good Data Model? – A good data model is simple. • As a general rule, the data attributes that describe an entity should describe only that entity. – A good data model is essentially nonredundant. • This means that each data attribute, other than foreign keys, describes at most one entity. – A good data model should be flexible and adaptable to future needs. • We should make the data models as application-independent as possible to encourage database structures that can be extended or modified without impact to current programs. 3/18/2016 42 • Data and Referential Integrity – There are at least three types of data integrity that must be designed into any database - key integrity, domain integrity and referential integrity. – Key Integrity: • Every table should have a primary key (which may be concatenated). – The primary key must be controlled such that no two records in the table have the same primary key value. – The primary key for a record must never be allowed to have a NULL value. 3/18/2016 49 • Data and Referential Integrity – Domain Integrity: • Appropriate controls must be designed to ensure that no field takes on a value that is outside of the range of legal values. – Referential Integrity: • A referential integrity error exists when a foreign key value in one table has no matching primary key value in the related table. 3/18/2016 50 • Referential Integrity: • Referential integrity is specified in the form of deletion rules as follows: – No restriction. • Any record in the table may be deleted without regard to any records in any other tables. – Delete:Cascade. • A deletion of a record in the table must be automatically followed by the deletion of matching records in a related table. – Delete:Restrict. • A deletion of a record in the table must be disallowed until any matching records are deleted from a related table. – Delete:Set Null. • A deletion of a record in the table must be automatically followed by setting any matching keys in a related table to the value NULL. 3/18/2016 51 • Database Capacity Planning – A database is stored on disk. • The database administrator will want an estimate of disk capacity for the new database to ensure that sufficient disk space is available. – Database capacity planning can be calculated with simple arithmetic as follows. 1 For each table, sum the field sizes. – This is the record size for the table. 2 For each table, multiply the record size times the number of entity instances to be included in the table. – This is the table size. 3/18/2016 54 • Database Capacity Planning – Database capacity planning can be calculated with simple arithmetic as follows. (continued) 3 Sum the table sizes. – This is the database size. 4 Optionally, add a slack capacity buffer (e.g., 10%) to account for unanticipated factors or inaccurate estimates above. – This is the anticipated database capacity. 3/18/2016 55