Normalization Process Relations can fall into one or more categories (or classes) called Normal Forms Normal Form: A class of relations free from a certain set of modification anomalies. Normal forms are given names such as: o First normal form (1NF) o Second normal form (2NF) o Third normal form (3NF) o Boyce-Codd normal form (BCNF) o Fourth normal form (4NF) o Fifth normal form (5NF) o Domain-Key normal form (DK/NF) These forms are cumulative. A relation in Third normal form is also in 2NF and 1NF. The Normalization Process for a given relation consists of: 1. Specify the Key of the relation 2. Specify the functional dependencies of the relation. Sample data (tuples) for the relation can assist with this step. 3. Apply the definition of each normal form (starting with 1NF). 4. If a relation fails to meet the definition of a normal form, change the relation (most often by splitting the relation into two new relations) until it meets the definition. 5. Re-test the modified/new relations to ensure they meet the definitions of each normal form. In the next set of notes, each of the normal forms will be defined along with an example of the normalization steps. First Normal Form (1NF) A relation is in first normal form if it meets the definition of a relation: 1. Each attribute (column) value must be a single value only. 2. All values for a given attribute (column ) must be of the same type. 3. Each attribute (column) name must be unique. 4. The order of attributes (columns) is insignificant 5. No two tuples (rows) in a relation can be identical. 6. The order of the tuples (rows) is insignificant. If you have a key defined for the relation, then you can meet the unique row requirement. Example relation in 1NF: STOCKS (Company, Symbol, Headquarters, Date, Close_Price) Company Symbol Headquarters Date Close Price Microsoft MSFT Redmond, WA 09/07/2010 23.96 Microsoft MSFT Redmond, WA 09/08/2010 23.93 Microsoft MSFT Redmond, WA 09/09/2010 24.01 Oracle ORCL Redwood Shores, CA 09/07/2010 24.27 Oracle ORCL Redwood Shores, CA 09/08/2010 24.14 Oracle ORCL Redwood Shores, CA 09/09/2010 24.33 Note that the key (which consists of the Symbol and the Date) can uniquely determine the Company, headquarters and Close Price of the stock. Here was assume that Symbol must be unique but Company, Headquarters, Date and Price are not unique Second Normal Form (2NF) A relation is in second normal form (2NF) if all of its non-key attributes are dependent on all of the key. Relations that have a single attribute for a key are automatically in 2NF. This is one reason why we often use artificial identifiers as keys. In the example below, Close Price is dependent on Company, Date The following example relation is not in 2NF: STOCKS (Company, Symbol, Headquarters, Date, Close_Price) Company Symbol Headquarters Date Close Price Microsoft MSFT Redmond, WA 09/07/2010 23.96 Microsoft MSFT Redmond, WA 09/08/2010 23.93 Microsoft MSFT Redmond, WA 09/09/2010 24.01 Oracle ORCL Redwood Shores, CA 09/07/2010 24.27 Oracle ORCL Redwood Shores, CA 09/08/2010 24.14 Oracle ORCL Redwood Shores, CA 09/09/2010 24.33 List the functional dependencies (FD): FD1: Symbol, Date -> Company, Headquarters, Close Price FD2: Symbol -> Company, Headquarters Consider that Symbol, Date -> Close Price. So we might use Symbol, Date as our key. However: Symbol -> Headquarters This violates the rule for 2NF. Also, consider the insertion and deletion anomalies. Another name for this is a Partial key dependency. Symbol is only a “part” of the key and it determines a non-key attribute. One Solution: Split this up into two new relations: COMPANY (Company, Symbol, Headquarters) STOCK_PRICES (Symbol, Date, Close_Price) At this point we have two new relations in our relational model. The original “STOCKS” relation we started with is removed form the model. Sample data and functional dependencies for the two new relations: COMPANY Relation: Company Symbol Headquarters Microsoft MSFT Redmond, WA Oracle Redwood Shores, CA ORCL Symbol FD1: Symbol -> Company, Headquarters STOCK_PRICES relation: Date Close Price MSFT 09/07/2010 23.96 MSFT 09/08/2010 23.93 MSFT 09/09/2010 24.01 ORCL 09/07/2010 24.27 ORCL 09/08/2010 24.14 ORCL 09/09/2010 24.33 FD1: Symbol, Date -> Close Price In checking these new relations we can confirm that they meet the definition of 1NF (each one has well defined unique keys) and 2NF (no partial key dependencies). Third Normal Form (3NF) A relation is in third normal form (3NF) if it is in second normal form and it contains no transitive dependencies. Consider relation R containing attributes A, B and C. R(A, B, C) If A -> B and B -> C then A -> C Transitive Dependency: Three attributes with the above dependencies. Example: At CUNY: Consider one of the new relations we created in the STOCKS example for 2nd normal form: Course_Code -> Course_Number, Section Course_Number, Section -> Classroom, Professor Company Symbol Headquarters Microsoft MSFT Redmond, WA Oracle Redwood Shores, CA ORCL The functional dependencies we can see are: This is a transitive dependency. What happens if we remove Oracle? We loose information about 2 different facts. The solution again is to split this relation up into two new relations: Symbol -> Company Company -> Headquarters so therefore: Symbol -> Headquarters STOCK_SYMBOLS(Company, Symbol) COMPANY_HEADQUARTERS(Company, Headquarters) This gives us the following sample data and FD for the new relations Company Symbol Microsoft MSFT Oracle ORCL Company FD1: Symbol -> Company Headquarters Microsoft Redmond, WA Oracle Redwood Shores, CA FD1: Company -> Headquarters Again, each of these new relations should be checked to ensure they meet the definition of 1NF, 2NF and now 3NF. Boyce-Codd Normal Form (BCNF) A relation is in BCNF if every determinant is a candidate key. Recall that not all determinants are keys. Those determinants that are keys we initially call candidate keys. Eventually, we select a single candidate key to be the key for the relation. Consider the following example: o Funds consist of one or more Investment Types. o Funds are managed by one or more Managers o Investment Types can have one more Managers o Managers only manage one type of investment. Relation: FUNDS (FundID, InvestmentType, Manager) FundID InvestmentType Manager 99 Common Stock 99 Municipal Bonds Jones 33 Common Stock Green 22 Growth Stocks Brown 11 Common Stock Smith FD1: FD2: FD3: Smith FundID, InvestmentType -> Manager FundID, Manager -> InvestmentType Manager -> InvestmentType In this case, the combination FundID and InvestmentType form a candidate key because we can use FundID,InvestmentType to uniquely identify a tuple in the relation. Similarly, the combination FundID and Manager also form a candidate key because we can use FundID, Manager to uniquely identify a tuple. Manager by itself is not a candidate key because we cannot use Manager alone to uniquely identify a tuple in the relation. Is this relation FUNDS(FundID, InvestmentType, Manager) in 1NF, 2NF or 3NF ? Given we pick FundID, InvestmentType as the Primary Key: 1NF for sure. 2NF because all of the non-key attributes (Manager) is dependant on all of the key. 3NF because there are no transitive dependencies. However consider what happens if we delete the tuple with FundID 22. We loose the fact that Brown manages the InvestmentType “Growth Stocks.” Therefore, while FUNDS relation is in 1NF, 2NF and 3NF, it is in BCNF because not all determinants (Manager in FD3) are candidate keys. The following are steps to normalize a relation into BCNF: 1. List all of the determinants. 2. See if each determinant can act as a key (candidate keys). 3. For any determinant that is not a candidate key, create a new relation from the functional dependency. Retain the determinant in the original relation. For our example: FUNDS (FundID, InvestmentType, Manager) 0. The determinants are: 1. FundID, InvestmentType 2. FundID, Manager 3. Manager 4. Which determinants can act as keys ? 5. FundID, InvestmentType YES 6. FundID, Manager YES 7. Manager NO 8. Create a new relation from the functional dependency: MANAGERS(Manager, InvestmentType) FUND_MANAGERS(FundID, Manager) In this last step, we have retained the determinant “Manager” in the original relation MANAGERS. Each of the new relations sould be checked to ensure they meet the definitions of 1NF, 2NF, 3NF and BCNF Fourth Normal Form (4NF) A relation is in fourth normal form if it is in BCNF and it contains no multivalued dependencies. Multivalued Dependency: A type of functional dependency where the determinant can determine more than one value. More formally, there are 3 criteria: 1. There must be at least 3 attributes in the relation. call them A, B, and C, for example. 2. Given A, one can determine multiple values of B. Given A, one can determine multiple values of C. 3. B and C are independent of one another. Book example: Student has one or more majors. Student participates in one or more activities. StudentID Major Activities 100 CIS Baseball 100 CIS Volleyball 100 Accounting Baseball 100 Accounting Volleyball 200 Marketing Portfolio ID Swimming FD1: StudentID ->-> Major FD2: StudentID ->-> Activities Stock Fund Bond Fund 999 Janus Fund Municipal Bonds 999 Janus Fund Dreyfus Short-Intermediate Municipal Bond Fund 999 Scudder Global Fund Municipal Bonds 999 Scudder Global Fund Dreyfus Short-Intermediate Municipal Bond Fund 888 Kaufmann Fund Portfolio ID T. Rowe Price Emerging Markets Bond Fund A few characteristics: 1. No regular functional dependencies 2. All three attributes taken together form the key. 3. Latter two attributes are independent of one another. 4. Insertion anomaly: Cannot add a stock fund without adding a bond fund (NULL Value). Must always maintain the combinations to preserve the meaning. Stock Fund and Bond Fund form a multivalued dependency on Portfolio ID. PortfolioID PortfolioID Resolution: Split into two tables with the common key: ->-> ->-> Stock Fund Bond Fund Stock Fund 999 Janus Fund 999 Scudder Global Fund 888 Kaufmann Fund Portfolio ID Bond Fund 999 Municipal Bonds 999 Dreyfus Short-Intermediate Municipal Bond Fund 888 T. Rowe Price Emerging Markets Bond Fund Fifth Normal Form (5NF) Also called “Projection Join” Normal form. There are certain conditions under which after decomposing a relation, it cannot be reassembled back into its original form. We don’t consider these issues here. Domain Key Normal Form (DK/NF) A relation is in DK/NF if every constraint on the relation is a logical consequence of the definition of keys and domains. Constraint: An rule governing static values of an attribute such that we can determine if this constraint is True or False. Examples: 1. Functional Dependencies 2. Multivalued Dependencies 3. Inter-relation rules 4. Intra-relation rules However: Does Not include time dependent constraints. Key: Unique identifier of a tuple. Domain: The physical (data type, size, NULL values) and semantic (logical) description of what values an attribute can hold. There is no known algorithm for converting a relation directly into DK/NF. De-Normalization Consider the following relation: CUSTOMER (CustomerID, Name, Address, City, State, Zip) This relation is not in DK/NF because it contains a functional dependency not implied by the key. Zip -> City, State We can normalize this into DK/NF by splitting the CUSTOMER relation into two: CUSTOMER (CustomerID, Name, Address, Zip) ZIPCODES (Zip, City, State) We may pay a performance penalty – each customer address lookup requires we look in two relations (tables). In such cases, we may de-normalize the relations to achieve a performance improvement. All-in-One Database Normalization Example Many of you asked for a “complete” example that would run through all of the normal forms from beginning to end using the same tables. This is tough to do, but here is an attempt: Example relation: EMPLOYEE ( Name, Project, Task, Office, Floor, Phone ) Note: Keys are underlined. Example Data: Name Project Task Office Floor Phone Bill 100X T1 400 4 1400 Bill 100X T2 400 4 1400 Bill 200Y T1 400 4 1400 Bill 200Y T2 400 4 1400 Sue 100X T33 442 4 1442 Sue 200Y T33 442 4 1442 Sue 300Z T33 442 4 1442 Ed 100X T2 588 5 1588 Name is the employee’s name Project is the project they are working on. Bill is working on two different projects, Sue is working on 3. Task is the current task being worked on. Bill is now working on Tasks T1 and T2. Note that Tasks are independent of the project. Examples of a task might be faxing a memo or holding a meeting. Office is the office number for the employee. Bill works in office number 400. Floor is the floor on which the office is located. Phone is the phone extension. Note this is associated with the phone in the given office. First Normal Form Assume the key is Name, Project, Task. Is EMPLOYEE in 1NF ? Second Normal Form List all of the functional dependencies for EMPLOYEE. Are all of the non-key attributes dependant on all of the key ? Split into two relations EMPLOYEE_PROJECT_TASK and EMPLOYEE_OFFICE_PHONE. EMPLOYEE_PROJECT_TASK (Name, Project, Task) Name Project Task Bill 100X T1 Bill 100X T2 Bill 200Y T1 Bill 200Y T2 Sue 100X T33 Sue 200Y T33 Sue 300Z T33 Ed 100X T2 EMPLOYEE_OFFICE_PHONE (Name, Office, Floor, Phone) Name Office Floor Phone Bill 400 4 1400 Sue 442 4 1442 Ed 588 5 1588 Third Normal Form Assume each office has exactly one phone number. Are there any transitive dependencies ? Where are the modification anomalies in EMPLOYEE_OFFICE_PHONE ? Split EMPLOYEE_OFFICE_PHONE. EMPLOYEE_PROJECT_TASK (Name, Project, Task) Name Project Task Bill 100X T1 Bill 100X T2 Bill 200Y T1 Bill 200Y T2 Sue 100X T33 Sue 200Y T33 Sue 300Z T33 Ed 100X T2 EMPLOYEE_OFFICE (Name, Office, Floor) Name Office Floor Bill 400 4 Sue 442 4 Ed 588 5 EMPLOYEE_PHONE (Office, Phone) Office Phone 400 1400 442 1442 588 1588 Boyce-Codd Normal Form List all of the functional dependencies for EMPLOYEE_PROJECT_TASK, EMPLOYEE_OFFICE and EMPLOYEE_PHONE. Look at the determinants. Are all determinants candidate keys ? Forth Normal Form Are there any multivalued dependencies ? What are the modification anomalies ? Split EMPLOYEE_PROJECT_TASK. EMPLOYEE_PROJECT (Name, Project ) Name Project Bill 100X Bill 200Y Sue 100X Sue 200Y Sue 300Z Ed 100X EMPLOYEE_TASK (Name, Task ) Name Task Bill T1 Bill T2 Sue T33 Ed T2 EMPLOYEE_OFFICE (Name, Office, Floor) Name Office Floor Bill 400 4 Sue 442 4 Ed 588 5 R4 (Office, Phone) Office Phone 400 1400 442 1442 588 1588 At each step of the process, we did the following: 1. 2. 3. 4. Write out the relation (optionally) Write out some example data. Write out all of the functional dependencies Starting with 1NF, go through each normal form and state why the relation is in the given normal form. Another short example Consider the following example of normalization for a CUSTOMER relation. Relation Name CUSTOMER (CustomerID, Name, Street, City, State, Zip, Phone) Example Data CustomerID Name Street City C101 Bill Smith C102 Mary Green 11 Birch St. Old Bridge State 123 First St. New Brunswick NJ NJ Zip Phone 07101 732-555-1212 07066 908-555-1212 Functional Dependencies CustomerID -> Name, Street, City, State, Zip, Phone Zip -> City, State Normalization 1NF Meets the definition of a relation. 2NF All non key attributes are dependent on all of the key. 3NF Relation CUSTOMER is not in 3NF because there is a transitive dependency. CustomerID -> Zip and Zip -> City, State Solution: Split CUSTOMER into two relations: CUSTOMER (CustomerID, Name, Street, Zip, Phone) ZIPCODES (Zip, City, State) Check both CUSTOMER and ZIPCODE to ensure they are both in 1NF up to BCNF. As a final step, consider de-normalization. Relational Algebra - Example Contents Symbolic Notation Usage Rename Operator Derivable Operators Equivalence Equivalences Comparing RA and SQL Comparing RA and SQL Consider the following SQL to find which departments have had employees on the `Further Accounting' course. SELECT DISTINCT dname FROM department, course, empcourse, employee WHERE cname = `Further Accounting' AND course.courseno = empcourse.courseno AND empcourse.empno = employee.empno AND employee.depno = department.depno; The equivalent relational algebra is PROJECTdname (department JOINdepno = depno ( PROJECTdepno (employee JOINempno = empno ( PROJECTempno (empcourse JOINcourseno = courseno ( PROJECTcourseno (SELECTcname = `Further Accounting' course) )) )) )) Symbolic Notation From the example, one can see that for complicated cases a large amount of the answer is formed from operator names, such as PROJECT and JOIN. It is therefore commonplace to use symbolic notation to represent the operators. SELECT ->σ (sigma) PROJECT -> π(pi) PRODUCT -> ×(times) JOIN -> |×| (bow-tie) UNION -> ∪ (cup) INTERSECTION -> ∩(cap) DIFFERENCE -> - (minus) RENAME ->ρ (rho) Usage The symbolic operators are used as with the verbal ones. So, to find all employees in department 1: SELECTdepno = 1(employee) becomes σdepno = 1(employee) Conditions can be combined together using ^ (AND) and v (OR). For example, all employees in department 1 called `Smith': SELECTdepno = 1 ^ surname = `Smith'(employee) becomes σdepno = 1 ^ surname = `Smith'(employee) The use of the symbolic notation can lend itself to brevity. Even better, when the JOIN is a natural join, the JOIN condition may be omitted from |x|. The earlier example resulted in: PROJECTdname (department JOINdepno = depno ( PROJECTdepno (employee JOINempno = empno ( PROJECTempno (empcourse JOINcourseno = courseno ( PROJECTcourseno (SELECTcname = `Further Accounting' course))))))) becomes πdname (department |×| ( πdepno (employee |×| ( πempno (empcourse |×| ( πcourseno (σcname = `Further Accounting' course) )))))) Rename Operator The rename operator returns an existing relation under a new name. ρ A(B) is the relation B with its name changed to A. For example, find the employees in the same Department as employee 3. ρemp2.surname,emp2.forenames ( σemployee.empno = 3 ^ employee.depno = emp2.depno ( employee × (ρemp2employee) ) ) Derivable Operators Fundamental operators:σ, π, ×, ∪, -, ρ Derivable operators: |×|,∩ A ∩ B ⇔ A - (A - B) Equivalence A|×|cB ⇔ πa1,a2,...aN(σc(A × B)) where c is the join condition (eg A.a1 = B.a1), and a1,a2,...aN are all the attributes of A and B without repetition. c is called the join-condition, and is usually the comparison of primary and foreign key. Where there are N tables, there are usually N-1 join-conditions. In the case of a natural join, the conditions can be missed out, but otherwise missing out conditions results in a cartesian product (a common mistake to make). Equivalences The same relational algebraic expression can be written in many different ways. The order in which tuples appear in relations is never significant. A ×B ⇔ B × A A∩B⇔B∩A A ∪B ⇔ B ∪ A (A - B) is not the same as (B - A) σc1 (σc2(A)) ⇔ σc2 (σc1(A)) ⇔ σc1 ^ c2(A) πa1(A) ⇔ πa1(πa1,etc(A)) where etc represents any other attributes of A. many other equivalences exist. While equivalent expressions always give the same result, some may be much easier to evaluate that others. When any query is submitted to the DBMS, its query optimiser tries to find the most efficient equivalent expression before evaluating it. Comparing RA and SQL Relational algebra: is closed (the result of every expression is a relation) has a rigorous foundation has simple semantics is used for reasoning, query optimisation, etc. SQL: is a superset of relational algebra has convenient formatting features, etc. provides aggregate functions has complicated semantics is an end-user language. Comparing RA and SQL Any relational language as powerful as relational algebra is called relationally complete. A relationally complete language can perform all basic, meaningful operations on relations. Since SQL is a superset of relational algebra, it is also relationally complete. Concurrency using Transactions Contents Transactions Transaction Schedules Lost Update scenario. Uncommitted Dependency Inconsistency Serialisability Precedence Graph Precedence Graph : Method Example 1 Example 2 The goal in a `concurrent' DBMS is to allow multiple users to access the database simultaneously without interfering with each other. A problem with multiple users using the DBMS is that it may be possible for two users to try and change data in the database simultaneously. If this type of action is not carefully controlled, inconsistencies are possible. To control data access, we first need a concept to allow us to encapsulate database accesses. Such encapsulation is called a `Transaction'. Transactions Transaction (ACID) unit of logical work and recovery o A - atomicity (for integrity) o C - consistency preservation o I - isolation o D - durability Available in SQL Some applications require nested or long transactions After work is performed in a transaction, two outcomes are possible: Commit - Any changes made during the transaction by this transaction are committed to the database. Abort - All the changes made during the transaction by this transaction are not made to the database. The result of this is as if the transaction was never started. Transaction Schedules A transaction schedule is a tabular representation of how a set of transactions were executed over time. This is useful when examining problem scenarios. Within the diagrams various nomenclatures are used: READ(a) - This is a read action on an attribute or data item called `a'. WRITE(x,a) - This is a write action on an attribute or data item called `a', where the value `x' is written into `a'. tn (e.g. t1,t2,t10) - This indicates the time at which something occurred. The units are not important, but tn always occurs before tn+1. Consider transaction A, which loads in a bank account balance X (initially 20) and adds 10 pounds to it. Such a schedule would look like this: Time Transaction A t1 TOTAL:=READ(X) t2 TOTAL:=TOTAL+10 t3 WRITE(TOTAL,X) Now consider that, at the same time as transaction A runs, transaction B runs. Transaction B gives all accounts a 10% increase. Will X be 32 or 33? Time Value TOTAL Transaction A t1 Transaction B BALANCE:=READ(X) t2 TOTAL:=READ(X) t3 TOTAL:=TOTAL+10 30 t4 WRITE(TOTAL,X) Value BALANCE 20 20 30 t5 BALANCE:=BALANCE*110% 22 t6 WRITE(BALANCE,X) 22 Whoops... X is 22! Depending on the interleaving, X can also be 32, 33, or 30. Lets classify erroneous scenarios. Lost Update scenario. Time Transaction A Transaction B t1 X=READ(R) t2 t3 t4 Y=READ(R) WRITE(X,R) WRITE(Y,R) Transaction A's update is lost at t4, because Transaction B overwrites it. B missed A's update at t3 as it got the value of R at t2. Uncommitted Dependency Time Transaction A Transaction B t1 WRITE(X,R) t2 Y=READ(R) t3 ABORT Transaction A is allowed to READ (or WRITE) item R which has been updated by another transaction but not committed (and in this case ABORTed). Inconsistency Time X Y Z Transaction A Action Transaction B SUM t1 40 50 30 SUM:=READ(X) 40 t2 40 50 30 SUM+=READ(Y) 90 t3 40 50 30 ACC1=READ(Z) t4 40 50 20 WRITE(ACC1-10,Z) t5 40 50 20 ACC2=READ(X) t6 50 50 20 WRITE(AC2+10,X) t7 50 50 20 COMMIT t7 50 50 20 SUM+=READ(Z) 110 SUM should have been 120... Serialisability A `schedule' is the actual execution sequence of two or more concurrent transactions. A schedule of two transactions T1 and T2 is `serialisable' if and only if executing this schedule has the same effect as either T1;T2 or T2;T1. Precedence Graph In order to know that a particular transaction schedule can be serialized, we can draw a precedence graph. This is a graph of nodes and vertices, where the nodes are the transaction names and the vertices are attribute collisions. The schedule is said to be serialised if and only if there are no cycles in the resulting diagram. Precedence Graph : Method To draw one; Draw a node for each transaction in the schedule Where transaction A writes to an attribute which transaction B has read from, draw an line pointing from B to A. Where transaction A writes to an attribute which transaction B has written to, draw a line pointing from B to A. Where transaction A reads from an attribute which transaction B has written to, draw a line pointing from B to A. Example 1 Consider the following schedule: Time TRAN1 t1 READ(A) t2 READ(B) TRAN2 t3 READ(A) t4 READ(B) t5 WRITE(x,B) t6 WRITE(y,B) Example 2 Consider the following schedule: Time TRAN1 t1 READ(A) t2 READ(B) TRAN2 t3 READ(A) t4 READ(B) t5 WRITE(x,A) t6 WRITE(v,C) t7 WRITE(w,B) t8 TRAN3 WRITE(z,C)