Monash University School of Information Management and Systems IMS1907 Database Systems Semester 2, 2004 Tutorial Weeks 8 & 9 - Normalisation of Data Tutors Notes Tutorial Objectives: - to develop further understanding of detailed data modelling to practice normalisation (detailed data modelling) skills Tutorial Resources: - Tutorial 8 & 9 handout – Normalisation of data - IMS1907 Lecture Weeks 7 and 8 – Data modelling and Normalisation Tutorial Task: These questions will be considered over the next 2 weeks. There is no need to address all questions – choose a selection that gradually increases in complexity over weeks 8 and 9. Stress the importance of practice to learning this skill and encourage students to do all these exercises and to find additional exercises in textbooks 1. Provide short answers to the following questions: a) What are the objectives of the process of normalising data? Stable, robust, flexible data structures – easy to add new attributes, new structures, data structures do not change much b) What is meant by the term 'primary key'? An attribute or combination of attributes that uniquely identifies all other attributes in a record or relation c) What is meant by the term 'foreign key'? An attribute that appears in one table as a PK and also in a second table as a means of joining the tables together d) What is meant by the term 'candidate key'? Two or more attributes that appear in a relation where any of those attributes can be used to uniquely identify all the other attributes in the relation – we refer to these attributes as candidate keys and choose one to act as the PK – any further dependencies on the other candidate keys are ignored from that point. We generally choose the most stable attribute or the one over which we have most control. e) What is meant by the term 'functional dependency'? A functional dependency occurs where for each value of an attribute (or combination of attributes) in a relation, there is only ever one value of a second attribute – all attributes in a relation should be fully functionally dependent on the PK f) Describe the steps required to convert an unnormalised relation to Third Normal Form relations. 1 – identify PK 2 – identify repeating groups 1 3 – remove repeating groups 1NF 4 – identify partial dependencies within the PK 5 - identify partial dependencies of non-key attributes on part of the PK 6 – remove partial dependencies 2NF 7 – identify transitive dependencies between non-key attributes 8 – remove transitive dependencies 3NF 2. Investigate the data in the following examples to establish all the business rules. Some clues are given in each case. Draw an ER model for the relation. Then use the steps of normalisation to fully normalise the data given. Check your answer by drawing a data structure diagram for the answer and comparing it with the expected diagram of your business rules. These are very straightforward – you can do a selection of these but recommend that students attempt them all. It is important that you work through these – show how you get to each form and indicate all dependencies using arrows – I expect this in the exam. a) EMPLOYEE (EMP-NO., EMP-NAME, SALARY, (PROJ-NO, PROJNAME, COMPLETION-DATE)) [Each project has a due date by which completion is expected.] EMPLOYEE (EMP-NO., EMP-NAME, SALARY, (PROJ-NO, PROJ-NAME, COMPLETION-DATE)) 1NF EMPLOYEE (EMP-NO., EMP-NAME, SALARY) EMPLOYEE-PROJECT (EMP-NO, PROJ-NO, PROJ-NAME, COMPLETIONDATE) 2NF The proj-name and completion date depend on proj-no resulting in the following relations: EMPLOYEE (EMP-NO., EMP-NAME, SALARY) PROJECT (PROJ-NO, PROJ-NAME, COMPLETION-DATE)) EMPLOYEE-PROJECT (EMP-NO, PROJ-NO) 2NF=3NF b) EMPLOYEE (EMP-NO, EMP-NAME, EMP-LOCATION, DEPT-NO, DEPTNAME) [Each employee has an office that is his/her "location"] The data set here indicates that an employee only works for one dept – it would result in slightly different 3NF if the opposite was true. EMPLOYEE (EMP-NO, EMP-NAME, EMP-LOCATION, DEPT-NO, DEPT-NAME) 1NF 2 EMPLOYEE (EMP-NO, EMP-NAME, EMP-LOCATION, DEPT-NO, DEPT-NAME) There is only a single part PK so there are no partial dependencies so 1NF=2NF 3NF The following transitive dependency exists in the 2NF relation Dept-name depends on dept-no This results in the following 3NF relations: EMPLOYEE (EMP-NO, EMP-NAME, EMP-LOCATION, DEPT-NO) DEPT (DEPT-NO, DEPT-NAME) c) PROGRAMMER (PROGRAMMER-ID, PROGRAMMER-NAME, (PACKAGENO, PACKAGE-NAME, NO-HRS-WORKED)) [A package is a collection of programs. Several programmers may work on the same package at the same time. For costing purposes, the Department wants to know how many hours each programmer spent on each package.] PROGRAMMER (PROGRAMMER-ID, PROGRAMMER-NAME, (PACKAGE-NO, PACKAGE-NAME, NO-HRS-WORKED)) 1NF PROGRAMMER (PROGRAMMER-ID, PROGRAMMER-NAME) PROGRAMMER-PACKAGE (PROGRAMMER-ID, PACKAGE-NO, PACKAGENAME, NO-HRS-WORKED) 2NF The package-name depends on the package-no so the following 2NF result PROGRAMMER (PROGRAMMER-ID, PROGRAMMER-NAME) PACKAGE ( PACKAGE-NO, PACKAGE-NAME) PROGRAMMER-PACKAGE (PROGRAMMER-ID, PACKAGE-NO, NO-HRSWORKED) 2NF=3NF d) PART (PART-NO, PART-DESCRIPTION, (SUPPLIER-NO, SUPPLIER-NAME, SUPPLIER-ADDRESS, PRICE)) [The same part may be available from different suppliers at different prices.] PART (PART-NO, PART-DESCRIPTION, (SUPPLIER-NO, SUPPLIER-NAME, SUPPLIER-ADDRESS, PRICE)) 1NF PART (PART-NO, PART-DESCRIPTION) PART-SUPPLIER (PART-NO, SUPPLIER-NO, SUPPLIER-NAME, SUPPLIERADDRESS, PRICE) 2NF 3 The supplier-name and supplier address depend on the supplier-no so the following 2NF result PART (PART-NO, PART-DESCRIPTION) SUPPLIER (SUPPLIER-NO, SUPPLIER-NAME, SUPPLIER-ADDRESS) PART-SUPPLIER (PART-NO, SUPPLIER-NO, PRICE) 2NF=3NF e) EMPLOYEE (EMP-NO, EMP-NAME, (SKILL-CODE, SKILL-DESC), SALARY) [An employee's initial salary may have taken his skill levels into account but there is no direct relationship between skills and salary level.] EMPLOYEE (EMP-NO, EMP-NAME, (SKILL-CODE, SKILL-DESC), SALARY) 1NF EMPLOYEE (EMP-NO, EMP-NAME, SALARY) EMPLOYEE-SKILL (EMP-NO, SKILL-CODE, SKILL-DESC) 2NF The skill-desc depends on the skill-code so the following 2NF result EMPLOYEE (EMP-NO, EMP-NAME, SALARY) SKILL (SKILL-CODE, SKILL-DESC) EMPLOYEE-SKILL (EMP-NO, SKILL-CODE) 2NF=3NF f) REGION (REGION-NAME, REGION-MANAGER, LOCATION (CUSTNAME, CUST-ADDRESS)) [A customer is serviced only by his local region. Any customers with multiple branches are given different customer names.] REGION (REGION-NAME, REGION-MANAGER, LOCATION (CUST-NAME, CUST-ADDRESS)) The PK here is interesting – strictly speaking the only key we need is custnameas this will uniquely ID the rest of the relation – we will allow the regionname to be included but strictly speaking it is redundant and over-specified. Also region-name, region-manager and location can be considered as candidate keys as they identify each other – we choose region-name as the most stable 1NF REGION (REGION-NAME, REGION-MANAGER, LOCATION) REGION-CUSTOMER (REGION-NAME, CUST-NAME, CUST-ADDRESS) 2NF The cust-address depends on the cust-name so the following 2NF result REGION (REGION-NAME, REGION-MANAGER, LOCATION) CUSTOMER ( CUST-NAME, CUST-ADDRESS)) REGION-CUSTOMER (REGION-NAME, CUST-NAME) 2NF=3NF 4 Although region-manager and location are dependent on each other we ignore this as we discarded them as candidate keys and can disregard them so The next question was covered in the workshop so it can be ignored – solution included anyway 2. The data in the following table contains an example of data that is not fully normalised. Draw the ER model for this relation Define the create anomaly using data from the table. Define the delete anomaly using data from the table. Define the update anomaly using data from the table. Express the structure of the table above as a set of 3NF relations. Show the steps you follow to obtain these relations. Draw the resulting DSD Book-no 1256 3297 2672 1256 3357 6889 Copy 3 1 1 1 2 2 Call-no 102.64.c 356.66d 785.99e 102.64c 557.22a 229.89d Borrower-no 12345 35666 24287 35926 23510 35926 Name Adams Boyle Boyle Brown Dent Brown Address Brighton Caulfield Frankston Caulfield Prahran Caulfield BOOK (Book-no, copy, call-no, borrower-no, name, address) Book-no and call-no are candidate keys – we control book-no The PK of this relation needs both parts to uniquely ID the borrower of a book but it needs to be arranged to ID repeating groups. BOOK (Book-no, call-no, (copy, borrower-no, name, address)) 1NF BOOK (Book-no, call-no) BORROWED-BOOK (Book-no, copy, borrower-no, name, address) No partial dependencies exist in borrowed-book so 1NF=2NF 3NF Name and address are dependent on borrower-no resulting in the following 3NF BOOK (Book-no, call-no) BORROWER (borrower-no, name, address) BORROWED-BOOK (Book-no, copy, borrower-no) 5 3. Consider the data in the table below. The table has been designed to record information about purchase orders. Some business rules may be inferred from the data values but you should list any further assumptions you make. PO-No Supp-No Supp-Name Item-No Item-Desc Qty Cost PO-Date 158976 4576 Grey 3593 Nut 35 $8 15/5/98 158976 4576 Grey 9284 Bolt 40 $25 15/5/98 158976 4576 Grey 3598 Washer 30 $5 15/5/98 454638 7589 White 3485 Spring 400 $200 14/5/98 374365 3849 White 3593 Nut 10 $3 12/5/98 374365 3489 White 5467 Screw 11 $3 12/5/98 Draw the ER model for this relation Consider the create, delete and update anomalies using the table data. Express the structure of the table above as a set of 3NF relations. Show the steps you follow to obtain these relations. Draw the resulting DSD PO (PO-no, supp-no, supp-name, item-no, item-desc, qty, cost, PO-date) This needs rearranging to ID repeating groups. PO (PO-no, PO-date, supp-no, supp-name, (item-no, item-desc, qty, cost)) 1NF PO (PO-no, PO-date, supp-no, supp-name) PO-ITEM (PO-no, item-no, item-desc, qty, cost) 2NF Item-desc depends on item-no, and there is no apparent relationship between itemno and its cost. This results in the following 2NF PO (PO-no, PO-date, supp-no, supp-name) ITEM (item-no, item-desc) PO-ITEM (PO-no, item-no, qty, cost) 3NF Supp-name depends on supp-no resulting in the following 3NF PO (PO-no, PO-date, supp-no) SUPPLIER (supp-no, supp-name) ITEM (item-no, item-desc) PO-ITEM (PO-no, item-no, qty, cost) 4. The TopTech Computer Training Company offers courses in IT to businesses and other organisations. The following database table currently stores the trainee records for the Course and Trainee Records information system at TopTech. 6 Assume that this is the sole table to record data about companies, trainees and the courses they take. Some of the business rules may be inferred from the data in the table. Other business rules you need to consider are as follows: – A trainee may have attended courses while working for the same or different companies – A trainee can only attend one course on a given day – More than one course may be conducted on a single day – A trainee may ‘fail’ a course and therefore attend it twice No Company Name Phone Trainee No Trainee Name Address Course Date Paid 123 BJP 9812 3456 4067 Bill Nguyen Clayton Notes 1 1/6/98 YES 123 BJP 9812 3456 4067 Bill Nguyen Clayton Notes 2 9/6/98 YES 245 Henderson Consulting 9574 1234 2122 Amanda Pappas Dandenong Notes 1 20/7/98 YES 378 Dunlop Uni 9905 5000 4067 Bill Nguyen Clayton MS Office 3/8/98 NO 378 Dunlop Uni 9905 5000 3095 Jenny Tran Berwick Notes 1 20/7/98 NO 378 Dunlop Uni 9905 5000 1997 John Murphy Altona MS Office 3/8/98 NO Draw the ER model based on the above information Check that your ER diagram is correct and captures all business rules Express the structure of the table above as a set of 3NF relations. Show the steps you follow to obtain these relations. Draw the resulting DSD COMPANY (Company-no, name, phone, (trainee-no, trainee-name, address, (course, (date, paid)))) 1NF COMPANY (Company-no, name, phone) COMPANY-TRAINEE (Company-no, trainee-no, trainee-name, address) COMPANY-TRAINEE-COURSE (Company-no, trainee-no, course) COURSE-STATUS (Company-no, trainee-no, course, date, paid) 2NF trainee-name and address depend on trainee-no company-no, course and paid are all dependent on trainee-no and date (ie if we know trainee-no and date we know the company, course and its paid status ‘ços the trainee can only be working for one company on that date and can only attend one course) This results in the following 2NF relations COMPANY (Company-no, name, phone) TRAINEE (trainee-no, trainee-name, address) COMPANY-TRAINEE (Company-no, trainee-no) COMPANY-TRAINEE-COURSE (Company-no, trainee-no, course) COURSE-STATUS (Company-no, trainee-no, course, date, paid) 7 There are no transitive dependencies so 2NF=3NF The DSD for this is as follows COMPANY COMPANY TRAINEE TRAINEE COMPANY TRAINEE COURSE COURSE STATUS 5. The relation below and its accompanying business rules is a portion of the data from a personnel system. Convert this relation into a set of third normal form relations. Employee (Personnel number, employee name, employee address, employee telephone number, employee date of birth, department number, department name, commencement date, job title, (training course name, training course date, course duration, skill level acquired), (project number, project name, project start date, project end date)) 1. Each project may have a number of employees assigned to it. 2. Each course may be attended by a number of employees. 3. Every time a training course is run it is of the same duration. 4. Project start date is the date on which a project commences and project end date is the date on which it is completed. List any further assumptions you make about the business rules that apply. Assumptions: – commencement date and job title relate to the starting date with the company and job title at this time – an employee may attend a particular course more than once – an employee only attends one course on a given day 8 – each course has a fixed duration Employee (Personnel number, employee name, employee address, employee telephone number, employee date of birth, department number, department name, commencement date, job title, (training course name, training course date, course duration, skill level acquired), (project number, project name, project start date, project end date)) 1NF Employee (Personnel number, employee name, employee address, employee telephone number, employee date of birth, department number, department name, commencement date, job title) Employee-Course (Personnel number, training course name, training course date, course duration, skill level acquired) Employee-Project (Personnel number, project number, project name, project start date, project end date) 2NF training course duration is dependent on training course name training course name is dependent on personnel number and training course date project name, project start date and project end date are dependent on project number Employee (Personnel number, employee name, employee address, employee telephone number, employee date of birth, department number, department name, commencement date, job title) Course (training course name, course duration) Employee-Course (Personnel number, training course name, training course date, skill level acquired) Project (project number, project name, project start date, project end date) Employee-Project (Personnel number, project number) 3NF department name depends on department number This results in the following 3NF relations Employee (Personnel number, employee name, employee address, employee telephone number, employee date of birth, department number, commencement date, job title) Department (department number, department name) Course (training course name, course duration) Employee-Course (Personnel number, training course name, training course date, skill level acquired) Project (project number, project name, project start date, project end date) Employee-Project (Personnel number, project number) 9 6. The SIMS Alumni Association wishes to keep records of all their past students and their employment history ie the companies they have worked for and the positions they have held within the companies. The data for these records is contained in a single table as follows: Student No Student Name Student Address Course Code Course Name Company Name 1256 Jane Brighton 9458 M. Comp Boles 3297 Jack Caulfield 2358 Bach. I.S. 2672 Bill Frankston 9458 1256 Jane Brighton 1256 Jane 1256 Jane Company Address Position Held Date Carnegie Programmer 050297 Felstra Melbourne Programmer 030593 M. Comp Mobil Melbourne Analyst Programmer 020794 9458 M. Comp Waysafe Geelong Systems Analyst 100195 Brighton 2358 Bach. I.S. Sands Caulfield Systems Analyst 050593 Brighton 9458 M. Comp Waysafe Geelong IT Project Manager 180899 The business rules may be inferred from the data in the table but list any further assumptions you make. Draw the ER model based on the above information Check that your ER diagram is correct and captures all business rules Express the structure of the table above as a set of 3NF relations. Show the steps you follow to obtain these relations. Draw the resulting DSD. From the table above the following UNF relation can be formed STUDENT (student-no, student-name, student-address, course-code, course-name, company-name, company-address, position-held, date) From the table above the following rules can be determined: – – – A student can take more than one course A student can work at more than one company A student can have held more than one position with the same company but on different dates The following assumptions are also made: On any given date a student will only hold one position with one company The following UNF can be determined: STUDENT (student-no, student-name, student-address, (course-code, coursename), (company-name, company-address, (position-held, date) 1NF STUDENT (student-no, student-name, student-address) STUDENT-COURSE (student-no, course-code, course-name) STUDENT-COMPANY (student-no, company-name, company-address) STUDENT-COMPANY-POSITION (student-no, company-name, position-held, date) The following dependencies exist: 10 course-code course-name company-name company-address student-no, date company-name, position-held (on any particular date, a student can have only been holding one position at one company) This leads to the following 2NF relations 2NF STUDENT (student-no, student-name, student-address) COURSE (course-code, course-name) STUDENT-COURSE (student-no, course-code) COMPANY (company-name, company-address) STUDENT-COMPANY (student-no, company-name) STUDENT-COMPANY-POSITION (student-no, company-name, position-held, date) There are no transitive dependencies so 2NF=3NF 7. The relations below represent a portion of the data from an insurance claims system. Convert these relations into a set of third normal form relations. Show all intermediate forms of the relation between unnormalised and third normal form. List any assumptions you make concerning the "business rules". Claim (claim no., claim date, claimant name, claim type, claim type payment rate, claim details, policy no., policy type) Policy (policy no., policy type, policy date, policy renewal date, policy amount due, no. of claims processed, (claim no., date of claim, claim type, claim amount paid)) The following relatively straightforward ER can be drawn POLICY CLAIM So merging the two relations above we arrive at the following two UNF relations Claim (claim no., claim date, claimant name, claim type, claim type payment rate, claim amount paid, claim details, policy no., policy type, policy date, policy renewal date, policy amount due, no. of claims processed) Policy (policy no., policy type, policy date, policy renewal date, policy amount due, no. of claims processed, (claim no., claim date, claimant name, claim type, claim type payment rate, claim amount paid, claim details)) We should now normalize both of these separately and merge our final 3NF relations. 11 Claim (claim no., claim date, claimant name, claim type, claim type payment rate, claim amount paid, claim details, policy no., policy type, policy date, policy renewal date, policy amount due, no. of claims processed) There are no repeating groups so 1NF Claim (claim no., claim date, claimant name, claim type, claim type payment rate, claim amount paid, claim details, policy no., policy type, policy date, policy renewal date, policy amount due, no. of claims processed) There are no partial dependencies so 2NF Claim (claim no., claim date, claimant name, claim type, claim type payment rate, claim amount paid, claim details, policy no., policy type, policy date, policy renewal date, policy amount due, no. of claims processed) The following transitive dependencies exist claim-type claim type payment rate policy no policy type, policy date, policy renewal date, policy amount due, no. of claims processed These result in the following 3NF relations 3NF Claim-Type (claim type, claim type payment rate) Policy (policy no., policy type, policy date, policy renewal date, policy amount due, no. of claims processed) Claim (claim no., claim date, claimant name, claim type, claim amount paid, claim details, policy no.) Now we normalize the Policy relation Policy (policy no., policy type, policy date, policy renewal date, policy amount due, no. of claims processed, (claim no., claim date, claimant name, claim type, claim type payment rate, claim amount paid, claim details)) 1NF Policy (policy no., policy type, policy date, policy renewal date, policy amount due, no. of claims processed) Policy (policy no., claim no., claim date, claimant name, claim type, claim type payment rate, claim amount paid, claim details) The following dependencies exist claim no policy no, claim date, claimant name, claim type, claim type payment rate, claim amount paid, claim details Removing these dependencies we end up with 2NF Policy (policy no., policy type, policy date, policy renewal date, policy amount due, no. of claims processed) 12 Claim (claim no., policy no, claim date, claimant name, claim type, claim type payment rate, claim amount paid, claim details) The following transitive dependencies exist claim type claim type payment rate Removing this dependency we end up with 3NF Policy (policy no., policy type, policy date, policy renewal date, policy amount due, no. of claims processed) Claim (claim no., policy no, claim date, claimant name, claim type, claim amount paid, claim details) Claim Type (claim type, claim type payment rate) Merging the two groups of relations we end up with the following final 3NF relations 3NF Policy (policy no., policy type, policy date, policy renewal date, policy amount due, no. of claims processed) Claim (claim no., policy no, claim date, claimant name, claim type, claim amount paid, claim details) Claim Type (claim type, claim type payment rate) 13