Database Design - Normalization Normalization are a set of techniques for organizing data into tables in order to... – Eliminate most redundancy – Prevent incompleteness Much of this presentation is devoted to a rather long example... will work through it together 1 Two Steps to Normalization 1. Put the data into tabular form (by removing repeating groups) 2. Remove duplicated data into separate tables 2 First - A Simple Example Graphic 2-1 - Employee Qualifications 3 Put Data Into Tabular Form Problem: What do we do if an employee has more than one qualification? Graphic 2-2 4 1. Divide Data Into Two Tables Employee Table – Emp#, Name, Dept#, DeptName, DeptLocation Qualification Table – Emp#, Qualification Desc, Qualification Year Graphic 2-3 5 2. Remove Duplicated Data Notice that Dept. Number "05" is "Auditing" and is located at "HO" It is repeated for every employee in that department – This wastes space – Makes updating data more complicated – "Elegance" rule is violated 6 Whose Data is it Anyway? The basic problem is that department name and location are really data about departments rather than employees It belongs in a separate Department table Graphic 2-4 7 A Very Basic Example This example was presented informally The rules of normalization have their foundation in mathematics – On one hand, we can have confidence in normalization as a technique – On the other hand, it's very easy to become lost in mathematical terminology and proofs – Remember that data modeling is "design" and we should be careful about anything that lead us to one "right" answer. 8 Relational Notation The sample tables we've seen so far take up a lot of space We need a more concise notation If we eliminate the sample rows, we are left with the table names and columns Graphic 2-5 9 Relational Notation - cont. Text books usually use data displayed as in Graphic 2-5 Designers usually want a little more information Graphic 2-6 10 Exercise Normalize the database shown in the graphic Take about 10 minutes Remember two steps... – Put the information in tabular format – Eliminate redundant information Graphic 2-E1 11 A More Complex Example We'll introduce the rules of normalization as we proceed The rules can be daunting at first, but we'll look at the problems they solve 12 Hospital Survey Example The form displayed in Graphic 2-7 is one used in an actual survey of antibiotic drug usage in Australian hospitals Survey used to determine which drugs and dosages were being used for various operations, to ensure that patients were properly prescribed for, and that the public was not paying for unnecessary drugs 13 One Form for Each Operation Each hospital in the survey was given a unique hospital number All hospital numbers were prefixed with 'H' Hospitals fell into three categories: – 'P' for public – 'V for private – 'T' for training » All training hospitals were public so, 'T' implied 'P' 14 Operations All Operation Numbers were assigned sequentially be each hospital Operation Code was a standard international code for the named operation Procedure Group was a broader classification 15 Surgeons Surgeon Number was allocated by individual hospitals to allow surgeons to remain anonymous The prefix 'S' stood for surgeon 16 Drugs The Total Drug Cost was the total cost of all drug doses for the operation The bottom of the form recorded the individual antibiotic drugs used in the operation The Drug Code consisted of the short name for the drug and the size of the dose 17 As the Study Continued... ... It was decided to replace the heaps of forms with a computerized database Graphic 2-8 shows the initial database design using the relational notation This was done by one person, who was the data modular, the physical database designer, and the programmer 18 Exercise Normalize the database as shown in Graphic 2-8 Take about 15 minutes Remember two steps... – Put the information in tabular format – Eliminate redundant information 19 Determining Columns Normalization relies on certain assumptions about the way data is represented – We need to make sure that these are valid There are some problems normalization does not solve – It is better to address these at the outset, rather than carry excess baggage through the whole normalization process The following steps are necessary... 20 One Fact per Column First we make sure that each column in the table represents one fact only – The Drug code column holds both a short name for the drug and a dosage size – The Dosage size consists of a numeric size and a unit of measure – The Hospital Category really provides two facts » Hospital Public or Private? » Does the hospital provide training? 21 Hidden Data Make sure that we have not lost any data in the translation to tabular form Common problem: We cannot rely on the rows of the table being stored in any particular order – Suppose the original survey forms had been filed in order of return – We would need a Return Date or Sequence column 22 Derivable Data We need to remove any data that can be derived from other data in the table and amend the columns accordingly – Remember our basic objective is nonredundancy – The Total drug cost was derived by adding together the Drug Costs multiplied by the Number of Doses 23 Determining the Key Key - a minimal set of columns that hold a different combination of values for each row in the table The value of the Key uniquely identifies one row on the table – The combination of Hospital Number and Operation number – In relational notation, underline the key column(s) 24 Exercise Using Graphic 2-8, Clean up the design (i.e. list all the columns that are valid in the new design) Remember... – – – – One fact per column Don't lose hidden data Remove data that can be derived Determine the Key Take about 10 minutes 25 After Tidying Up... Graphic 2-9 26 Repeating Groups & 1st Normal Form Our first task was to put the data in tabular format… – it might seem like we've done this – but we've actually hidden a problem about the drug administered data 27 Limiting Number of Occurrences The drug administration data is the major cause of the tables complexity and inelegance – Drug Short Name 2, Drug Short Name 4 etc. The columns needed to accommodate up to four drugs account for most of the complexity 28 Why Only Four Drugs? Why not five or six or more? Four drugs represented the maximum arrived at by asking one of the survey teams "What would be the maximum number of drugs used in an operation?" In fact, this number was frequently exceeded – some operations had 10 or more 29 What's the Problem? Part of the problem is that the question was poorly put… a line on the form was required for each drug-dose combination, rather than just for each different drug The maximum number of drugs could increase later, so this model still rates poorly against the stability criterion 30 Paper Over the Differences With the paper form, a continuation sheet was simply attached when more drugs were needed We could add more columns to the table easily, but the application program changes would be much more difficult 31 What Was Done... The original designer decided to handle continuations by suffixing the operation number with "a", "b", or "c" to indicate a continuation This caused program changes and comprised the original simplicity of the system 32 Data Reusability & Program Complexity The main difficulties are with data reusability and program complexity Program can easily answer questions… – How many operations were performed by neurosurgeons? – Which hospital is spending the most money on drugs? But not … – How much money was spent on Ampicillin? 33 Another Way You might argue that some queries are always going to be more difficult than others What would happen if we had designed the table on the basis of "one row per drug"? 34 Recognizing Repeating Groups A Set of columns repeated a number of times - a "repeating group" - results in… – inflexibility – complexity – poor data reusability Graphic 2-10 – curly brackets indicate a repeating group with an indefinite number of occurrences 35 Solution A general and flexible solution should not set any limit on the maximum number of occurrences It should also handle of few or no occurrences (the drug-free operation) 36 Removing Repeating Groups First step of Normalization: – Put the data in tabular form by identifying and eliminating repeating group Split the file into two tables – basic operation data – (repeating) drug administration table 37 Split into Two Tables Remove all repeating group columns to a new table (each occurrence of group is a row in the new table Include the primary key of the original table in the new table (this makes a foreign key in the new table) 38 Split into Two Tables - cont. Add a 'Sequence' column if needed Name the new table Identify and underline the primary key of the new table 39 Exercise Using Graphic 2-10 remove all repeating groups 40 Solution Graphic 2-11 41 Determining Key of New Table Not always an easy task Question: What is the minimum number of combination of columns needed to uniquely identify one row? 42 A Six Column Key Hospital Number - FK Operation Number - FK Drug Short Name Dose Unit of Measure Method of Administration 43 First Normal Form Our tables are now technically in first normal form (1NF). What have we achieved? – All data of the same kind in now held in the same place – The number of different drug dosages that can be recorded for an operation is effectively unlimited – An operation that don't use any drugs is allowed 44 Problems with First Normal Form Look at Operation table in Graphic 2-11 Every row for an operation (e.g. Hospital number 17) will contain the facts that its name is St. Vincent's and Fred Fleming is the contact person Criterion of non-redundancy is not being met 45 More Problems Change a fact about a hospital and you will have to change it for every operation in the hospital If we delete the last operation for the hospital, then we effectively delete the hospital 46 Eliminating Redundancy Solve all these problems by removing the hospital information to a separate table, in which each hospital number appears once only (which becomes the primary key) Graphic 2-12 47 Determinants For a given hospital number there could be only one... – – – – hospital name hospital type contact person training status The hospital number is a determinant for the other columns 48 Formal Procedure 1. Identify and determinants (other than the primary key) and the columns they determine 2. Create a separate table for each determinant and its columns (the determinant becomes the primary key) 3. Name the new table 4. Remove the determined columns from the original table, leave the determinant 49 Other Determinants Hospital number + Surgeon Number => Surgeon Specialty Operation code => Operation Name, Procedure Group Drug Short Name => Drug Name, Manufacturer Drug Short Name + Method of Administration + Size of Dose + Unit of Measure = Cost of Dose 50 Exercise Finish normalizing Graphic 2-10 51 Third Normal Form Graphic This 2-13 is the final model model is in Third Normal Form (3NF) 52 Where is Second Normal Form? Our approach took us directly to Third Normal Form Most texts treat this as a two stage process – deal first with the determinants that are part of the table's key (Second Normal Form) – then with the non-key determinants 53 Is Third Normal Form the End? Unfortunately, no Boyce-Codd Normal Form Fourth Normal Form Fifth Normal Form We'll discuss later... 54 Third Normal Form Hint Every non-key column must be a fact about the key, the whole key, and nothing but the key 55 Candidate Keys Sometime more than one column or combination of columns could serve as a primary key (e.g. Drug Name instead of Drug Short Name) These are called Candidate Keys In step 1 we should really say... – Identify any determinants, other than candidate keys... 56 Foreign Keys When we removed repeating groups we carried the primary key of the original table to cross-reference back These columns are called Foreign Keys 57 Self-Referencing Tables A Foreign Key may refer back to the Primary Key of the same table – Example: An Employees table might have employee id as its Primary Key and a Foreign Key of Manager id which refers to another row in the same table The convention to represent Foreign Keys is an asterisk... – EMPLOYEE (Employee Id, Name, Manager id* 58 Referential Integrity If the Employees table has a Department Id foreign key referring back to the Departments table's Primary key, then we would expect to find a valid Department Id on the Departments table for every value of Department id in the Employees table If not, our database lacks Referential Integrity 59 Summary Normalization is a set of techniques for organizing data into tables to eliminate certain types of redundancy an incompleteness Normalization relies on correct identification of determinants and keys 60 Last Slide - Normalization Assignment #9 due next week 61