Normalization: Kroenke Chapters 3 and 4 A relation is categorized by one of several normal forms. An aid to design helps characterize relations that experience anomalies in update operations Higher normal forms TEND to be better design, but not guaranteed. Remember the one fact-one place theme! Deletion anomaly: Deleting 1 fact inadvertently deletes another. Insertion anomaly: inserting 1 fact not possible without inserting another seemingly unrelated fact. First Normal Form – 1NF A relation is 1NF if each attribute is atomic That is, attributes are simple types (int, float, string, char, etc) Second Normal Form – 2NF How about this as a base table? Primary key Definition: R is a relation; X and Y are attributes of R. Y is functionally dependent on X iff each X-value in R has precisely one Y-value in R associated with it. A common notation is X Y. Example: In the supplier’s S table, Status, City, and Name are functionally dependent on S#. In the SP table, Qty is functionally dependent on the combined attributes of S# and P# Status S# Qty P# S# City Name City and status are not functionally dependent on each other. There may be several entries containing ‘London’ but different status values. There may be several entries containing a status of 50 but have different cities. QTY is not functionally dependent on either P# or S#. S1 might have multiple QTY values for different parts Similar for P1 Def: Y is Fully Functionally Dependent on X if X Y but Y is not functionally dependent on any proper subset of X. In S, (S#, Status) City -- but not fully because S# City in SP: (S#, P#) Qty -- fully because neither S# nor P# by itself determines Qty. The functional dependence requires BOTH S# and P#. Semantic notation. Must understand meaning of data, NOT a consequence of table data. For example, suppose that each city in S has the same status. Is it coincidence or by design? Why is this important? What if we combined the relations S and SP into a single relation, First as in a few slides previous? First(S#, P#, Status, City, QTY) Underlined attributes represent the primary key. Insertion Anomalies: Cannot enter fact that a supplier is located in a city unless that supplier already supplies some part. Why? Deletion Anomalies Suppose S3 no longer supplies P2. Delete (S3, P2, 10, Paris, 200) if that is the ONLY part S# supplied, you lose fact that S3 is in Paris. Update Anomalies S1 moves from London to Amsterdam. May have to update many entries. Violates the “one fact, one place” guideline These problems are caused by dependences on a proper subset of the primary key. See also Kroenke’s example on page 95 and the text on page 96. Second Normal Form (2NF) A relation is 2NF iff it is 1NF and every non-key attribute is fully functionally dependent on the primary key. There are no attributes dependent on a proper subset of the primary key. Table First is NOT 2NF. Some nonkey attributes are not fully dependent on the primary key (S#, P#). Some are dependent on S# only The S and SP tables ARE 2NF. They are a better design in this case. Similar example in Fig 3-10 on page 106 of text. How about this table? Does it contain redundancy? Are there update anomalies? Transitive dependencies Suppose a supplier status is determined by the supplier’s city. That is, City Status. Since also S# City then S# status is a result of these dependencies. A Transitive dependency exists as shown below. Status S# City Similarly, a Housing table that links a student with a dorm and a residence fee would also likely have a transitive dependence. Fee SID dorm Insertion anomalies: Cannot state fact that a supplier in Rome must have a status of 50 unless there is already a supplier there. Cannot state fact a dorm has a specific cost unless there is already a student there. Deletion anomalies Delete (S5, 30, Athens) If it’s the ONLY Athens, lose fact that status for Athens must be 30. Delete (100, Randolph, $3200) from the Housing table. If that’s the only “Randolph” then you lose the connection between dorm and cost. Update anomalies “Change status of London supplier” may mean multiple updates. Violates the “one fact” – “one place” rule. i.e. that each fact should be stored in one place. Third Normal Form (3NF) A relation is 3NF iff it is 2NF and every non-key attribute is nontransitively dependent on the primary key. i.e. nonkey attributes are mutually independent. Again, it’s a consequence of the meaning of the data, not the data itself. Suppose all London suppliers had a status of 50. Is that coincidence? Is it by design? Question: Is 3NF better than 2NF? Maybe. In the cases presented here, probably so. An employee table where EmpIDAddressZipCode is not 3NF. We may not care about AddressZip_Code unless it’s a UPS or Post Office application. Table Decomposition Dividing a table into 2 or more tables to achieve a higher normal form. Previously we divided First into tables S and SP to achieve 2NF. Now we find that S is not 3NF, so we should decompose S into two tables. We have options: SS(S#, Status) and CS(City, Status) 2. SC(S#, City) and SS(S#, Status) or 3. SC(S#, City) and CS(City, Status) 1. Which is best? Need to ask: Does the decomposition result in a loss of information? For example, can we still relate the attributes that have been separated into two tables? Are the two relations independent of each other? Option 1: SS(S#, Status) and CS(City, Status) Cannot get the city of a supplier. Can you see why? Option 2: SC(S#, City) and SS(S#, Status) Relations not independent. If two suppliers are in the same city, must make sure they have the same status. Requires monitoring of changes, possibly the use of triggers. Extra work. CAN get the status of a city but ONLY if there’s a supplier there. Otherwise there’s a loss of information. Can’t store the status of a city unless there’s a supplier there. Option 3: SC(S#, City) and CS(City, Status) Two relations are independent. No loss of information Best option Decompose the Housing table into one of 1. 2. 3. SD(SID, Dorm) and DF(Dorm, Fee) SD(SID, Dorm) and SF(SID, Fee) SF(SID, Fee) and DF(Dorm, Fee) Which is better? Construct a similar argument Fee SID dorm Best decomposition frequently follows the FD arrows. This is a guideline, not an absolute rule. Determinants Consider SMA(SID, MID, AID) where a student has one advisor for a major and an advisor advises for one major. This table is 3NF since there is only one non-key attribute. S2 drops Physics and you may lose the fact that A3 advises for Physics. SID S1 S1 S2 S2 MID Math Phys Math Phys AID A1 A2 A1 A3 SID MID AID Def: If Y is fully functionally dependent on X then X is a determinant. Def: A tuple is an entry from a relation. The name is rooted in the historical development by E.J. Codd who used mathematical models to describe relations. Def: An attribute is a candidate key if that attribute uniquely identifies a tuple. A primary key is chosen from a list of candidates keys. Every candidate key is a determinant. Boyce-Codd Normal Form (BCNF) A relation is BCNF if every determinant is a candidate key. SMA is NOT BCNF since AID is a determinant but not a candidate key. Possible decompositions: SA(SID, AID) and AM(AID, MID) No Loss but relations are not independent. How do you “Find the major of S1”. It requires a search of two tables which seems somewhat counterintuitive. SM(SID, MID) and AM(AID, MID). Cannot get advisor of a student. SA(SID, AID) and SM(SID, MID). Cannot get who advises what. None of the three possible decompositions seems satisfactory. Solution: Look at bigger picture (E-R diagram) redundant Major (M) Student (S) Advisor (A) Relations: S, M, A (With a foreign key matching the primary key in M), SM, and SA to implement the many-many relationships NOTE: With BOTH SM and SA, it is possible for inconsistency to occur. Could have (S1, M5) in SM; (S1, A3) in SA; and have M8 as a foreign key for advisor A3 in the Advisor table. Would need software or triggers to assure consistency which adds to overhead. On the other hand, relationship between S and M is derived from relationships between S and A and between A and M. This provides an argument that the relationship between S and M should not be shown as a separate relationship Of course, then the fact that “ a student is majoring is something” is NOT explicitly stored. The design is based on business rules which we assume to be correct. May not always be the case. Maybe the business rule that states “a student is majoring in something” is flawed. Allows a student to choose a major without having an advisor first. Perhaps a better rule is “a student has an advisor, which determines the major”. It would be a model that forces student to choose an advisor, which may be a better rule since many students do NOT seek out advisors in timely fashion. Multivalued Dependencies (MVDs) Consider SMA(Student, Major, Activity) A student can have multiple majors and participate in multiple activities. This relation is BCNF vacuously (There are no determinants) Can’t store the major of a student unless that student has an activity. Student S1 S1 S2 S2 Major Math Math Phys Math Activity Swimming Football Baseball Baseball Another example CIX(Courses, Instructor, teXt) To implement training programs or corporate sponsored courses. Courses taught by many instructors and an instructor can lead many courses. Similar for text and courses Instructors do NOT choose texts There are NO determinants in this table Can’t store the text for a course unless there is an instructor. Yet another example on page 95-96. Def: Suppose A, B, and C are attributes of a relation. A Multivalued dependency (MVD) AB holds in R if for each A there are multiple B values which are independent of any C values. Fourth Normal Form (4NF) A relation is 4NF if it is BCNF and has no multi-valued dependencies. SMA is NOT 4NF. Would decompose into SM and SA. No loss since there’s no connection between M and A. CIX is NOT 4NF. Would decompose into CI and CX. No loss since there’s no connection between I and X. There is a 5NF but we will not cover. They rarely occur in practice. Domain/Key Normal Form (DK/NF ) Landmark paper: Ronald Fagin, “A Normal Fsorm for Relational Databases That Is Based on Domains and Keys”, ACM Transactions on Database Systems, September 1981. In this paper he Defined DK/NF Proved that a relation in DK/NF has NO modification anomalies A relation having no modification anomalies must be in DK/NF What is it? First, some definitions: Constraint: a rule governing static values of attributes. e.g. rules such as 0<=gpa<=4; credits >=0; functional dependencies; multivalued dependencies. key: unique identifier of a tuple. domain: description of an attribute’s allowable values Def: A relation is DK/NF (Domain Key Normal Form) if every constraint is a logical consequence of the definition of its keys and domains. Without an example this probably makes little sense. Ex. (from a previous edition of Kroenke): Track students, faculty, and who advises whom. Possible relations: Student(SID, Sname, FID) and Faculty(FID, Fname, FacStatus) Constraints FacStatus=0 or 1 (undergrad/graduate); FID begins with 1; SID must not begin with 1; SID of grad students begins with 9. Only graduate faculty can advise graduate students. Alternative constraint statement: “Grad student must be advised by Grad Faculty” “If Sid starts with 9 then FacultyStatus of the advisor must be 1” Difficult to enforce through the database design since the relevant data lies in two distinct tables. Each relation is still 1NF through 4NF Decomposing tables: Kroenke discussed Themes. Each relation has a theme. 3 themes here: Faculty grad advising undergrad advising Possible Tables: Faculty(FID, Fname, FacStatus) G-ADV(GSID, Sname, GFID) UG-ADV(UGSID, Sname, FID) Domain Definitions FID in CDDD where C=1; D=decimal digit This is a generic notation for our purposes here. In Access you’d write: FID like “1###”; In SQL Server you’d write: FID like “1[0-9][0-9][0-9]” (See F-Adv table in the university database) Fname in Char(30) FacStatus in [0, 1] GSID in CDDD where C=9; D=Decimal digit UGSID in CDDD where C!=1; C!=9, D= decimal digit See G-Adv and UG-ADV tables for exact syntax Sname in CHAR(30) GFID in {Select FID of Faculty, where FacStatus=1} (assuming the DBMS supports this type of constraint) There is a trigger in G-Adv to implement the equivalent of this. Relations & Keys: All constraints are met by enforcing key and domain restrictions. i.e. it is DK/NF This relation is guaranteed to have NO modification anomalies. Examples: A company hires student interns to work on various projects under the guidance of company employees. Semantics are as follows: A student intern can work on several projects and a project can use several interns. A project can have several team leaders (or co-leaders) which are company employees but an employee works on only one project. For each project on which an intern participates there is one team leader to which that intern must report. Consider a table, IPE, consisting of 3 attributes: Intern ID (I#), Project ID (P#), and employee ID (E#). So for example, if (I4, P5, E3) is an entry in this table then it means that Intern I4 is working on project P5 and must report to team leader E3 for that project. List FDs; what should the primary key be? Find the lowest normal form that is violated? (I#, P#) E# and E# P# Primary key should be (I#, P#) violates BCNF (I#, P#) E# and E# P# Consider three possible decompositions of the above relation as follows. Primary keys are underlined. Table IE(I#, E#) and IP(I#, P#). IE might contain (I2, E4) and (I2, E6); IP might contain (I2, P3) and (I2, P5); What project does E4 work on? Lose the project to which an employee is assigned. (I#, P#) E# and E# P# Table IE(I#, E#) and EP(E#, P#) Since E# P# we can get the project that an intern is working on through a join of these two tables. (I#, P#) E# and E# P# Table EP(E#, P#) and IP(I#, P#) EP might contain (E2, P4) and (E4, P4); IP might contain (I2, P4) To which employee does I2 report? Lose the employee to which an intern reports. Assume the following scenario in a university in which a student is paid by a department to do work for a faculty member. Semantics are as follows: A department can hire many students and a faculty member can have many students working for him/her. Each student can work for only one faculty member and is paid through the faculty member’s department budget. A department has many faculty members but each faculty member is a member of one department. There is a table consisting of 3 attributes: Student ID (SID), Department ID (DID), and Faculty ID (FID). So for example, if (S4, D5, F3) is an entry in this table then it means that Student S4 is working for faculty member F3 who, in turn, is a member of department D5. List FDs; what should the primary key be? Find the lowest normal form that is violated? SID FID DID Primary key should be SID Violates 3NF SID FID DID Consider three possible decompositions of the above relation as follows: SF(SID,FID) and SD(SID, DID): By doing a join between these tables, you can get the department of a faculty member But only IF the faculty member has a student employee. Also, tables are not independent SID FID DID SF(SID, FID) and FD(FID, DID): By doing a join between these tables, you can get the department that is paying the student. SID FID DID FD(FID, DID) and SD(SID, DID) Can you construct an example that shows you may not get the faculty member for whom a student is working? From a previous exam An organization needs to track many ongoing projects, the department responsible for each project, and which employees are project leaders. Rules are as follows Each project is the responsibility of a single department. Each project has one project leader who is a member of the department responsible for the project. A department can have many employees and be responsible for many projects. An employee can be a project leader for several projects. Each employee is assigned to one department. Proceed as in the previous slides