Database Systems Lecture #5 Yan Pan School of Software, SYSU 2011 Agenda Last time: FDs This time: Anomalies Normalization: BCNF & 3NF Next time: RA & SQL 1. 2. 2 Review: FDs Definition: If two tuples agree on the attributes A1, A2, …, An then they must also agree on the attributes B1, B2, …, Bm 3 Notation: A1, A2, …, An B1, B2, …, Bm Read as: Ai functionally determines Bj Review: Combining FDs If some FDs are satisfied, then others are satisfied too If all these FDs are true: name color category department color, category price Then this FD also holds: name, category price Why? 4 Problem: find all FDs Given a relation instance and set of given FDs Find all FD’s satisfied by that instance Useful if we don’t get enough information from our users: need to reverse engineer a data instance Q: How long does this take? A: Some time for each subset of atts Q: How many subsets? 5 powerset exponential time in worst-case But can often be smarter… Closure Algorithm Start with X={A1, …, An}. Repeat: B1, …, Bn C is a FD and B1, …, Bn are all in X then add C to X. if until X didn’t change 6 Example: name color category department color, category price {name, category}+ = {name, category, color, department, price} Closure alg e.g. Example: A, B C A, D B B D Compute X+, for every set X (AB is shorthand for {A,B}): A+ = A, B+ = BD, C+ = C, D+ = D AB+ = ABCD, AC+ = AC, AD+ = ABCD, BC+ = BC, BD+ = BD, CD+ = CD ABC+ = ABD+ = ACD+ = ABCD (no need to compute–why?) BCD+ = BCD, ABCD+ = ABCD What are the keys? 7 Closure alg e.g. In class: A, B A, D B A, F R(A,B,C,D,E,F) Compute {A,B}+ X = {A, B, } Compute {A, F}+ X = {A, F, } What are the keys? 8 C E D B Closure alg e.g. Product(name, price, category, color) name, category price category color FDs are: Keys are: {name, category} Enrollment(student, address, course, room, time) student address room, time course student, course room, time FDs are: Keys are: 9 Next topic: Anomalies Identify anomalies in existing schemata Decomposition by projection BCNF Lossy v. lossless Third Normal Form 10 Types of anomalies Redundancy Update anomalies: Delete some values & lose other values too Insert anomalies: 11 Change info in one tuple but not in another Deletion anomalies: Repeat info unnecessarily in several tuples Inserting row means having to insert other, separate info / null-ing it out Examples of anomalies Name SSN Mailing-address Phone Michael 123 NY 212-111-1111 Michael 123 NY 917-111-1111 Hilary 456 DC 202-222-2222 Hilary 456 DC 914-222-2222 Bill 789 Chappaqua 914-222-2222 Bill 789 Chappaqua 212-333-3333 SSN Name, Mailing-address Redundancy: name, maddress Update anomaly: Bill moves Delete anom.: Bill doesn’t pay bills, lose phones lose Bill! Insert anom: can’t insert someone without a (non-null) phone Underlying cause: SSN-phone is many-many Effect: partial dependency ssn name, maddress, 12 SSN Phone Whereas key = {ssn,phone} Decomposition by projection Soln: replace anomalous R with projections of R onto two subsets of attributes Projection: an operation in Relational Algebra Projecting R onto attributes (A1,…,An) means removing all other attributes 13 Corresponds to SELECT command in SQL Result of projection is another relation Yields tuples whose fields are A1,…,An Resulting duplicates ignored Projection for decomposition R(A1, ..., An, B1, ..., Bm, C1, ..., Cp) R1(A1, ..., An, B1, ..., Bm) R2(A1, ..., An, C1, ..., Cp) R1 = projection of R on A1, ..., An, B1, ..., Bm R2 = projection of R on A1, ..., An, C1, ..., Cp A1, ..., An B1, ..., Bm C1, ..., Cp = all attributes R1 and R2 may (/not) be reassembled to produce original R 14 Decomposition example Break the relation into two: Name SSN Mailing-address Phone Michael 123 NY 212-111-1111 Michael 123 NY 917-111-1111 Hilary 456 DC 202-222-2222 Hilary 456 DC 914-222-2222 Bill 789 Chappaqua 914-222-2222 Bill 789 Chappaqua 212-333-3333 Name SSN Mailing-address SSN Phone Michael 123 NY 123 212-111-1111 Hilary 456 DC 123 917-111-1111 Bill 789 Chappaqua 456 202-222-2222 The anomalies are gone 456 914-222-2222 789 914-222-2222 789 212-333-3333 15 No more redundant data Easy to for Bill to move Okay for Bill to lose all phones Thus: high-level strategy name E/R Model: Product price Relational Model: plus FD’s Normalization: Eliminates anomalies 16 Person buys name ssn Using FDs to produce good schemas Start with set of relations Define FDs (and keys) for them based on real world Transform your relations to “normal form” (normalize them) 1. 2. 3. Intuitively, good design means 17 Do this using “decomposition” No anomalies Can reconstruct all (and only the) original information Decomposition terminology Projection: eliminating certain atts from relation Decomposition: separating a relation into two by projection Join: (re)assembling two relations If exactly the original rows are reproduced by joining the relations, then the decomposition was lossless 18 Whenever a row from R1 and a row from R2 have the same value for some atts A, join together to form a row of R3 We join on the attributes R1 and R2 have in common (As) If it can’t, the decomposition was lossy Lossless Decompositions A decomposition is lossless if we can recover: R(A,B,C) Decompose R1(A,B) R2(A,C) Recover R’(A,B,C) should be the same as R(A,B,C) R’ is in general larger than R. Must ensure R’ = R 19 Lossless decomposition 20 Sometimes the same set of data is reproduced: Name Price Category Word 100 WP Oracle 1000 DB Access 100 DB Name Price Name Category Word 100 Word WP Oracle 1000 Oracle DB Access 100 Access DB (Word, 100) + (Word, WP) (Word, 100, WP) (Oracle, 1000) + (Oracle, DB) (Oracle, 1000, DB) (Access, 100) + (Access, DB) (Access, 100, DB) Lossy decomposition 21 Sometimes it’s not: Name Price Category Word 100 WP Oracle 1000 DB Access 100 DB What’s wrong? Category Name Category Price WP Word WP 100 DB Oracle DB 1000 DB Access DB 100 (Word, WP) + (100, WP) (Word, 100, WP) (Oracle, DB) + (1000, DB) (Oracle, 1000, DB) (Oracle, DB) + (100, DB) (Oracle, 100, DB) (Access, DB) + (1000, DB) (Access, 1000, DB) (Access, DB) + (100, DB) (Access, 100, DB) Ensuring lossless decomposition R(A1, ..., An, B1, ..., Bm, C1, ..., Cp) R1(A1, ..., An, B1, ..., Bm) R2(A1, ..., An, C1, ..., Cp) If A1, ..., An B1, ..., Bm or A1, ..., An C1, ..., Cp Then the decomposition is lossless Note: don’t need both 22 Examples: name price, so first decomposition was lossless category name and category price, and so second decomposition was lossy Quick lossless/lossy example X 1 4 Y 2 2 Z 3 5 At a glance: can we decompose into R1(Y,X), R2(Y,Z)? At a glance: can we decompose into R1(X,Y), R2(X,Z)? 23 Next topic: Normal Forms First Normal Form = all attributes are atomic As opposed to set-valued Assumed all along Second Normal Form (2NF) Third Normal Form (3NF) Boyce Codd Normal Form (BCNF) Fourth Normal Form (4NF) Fifth Normal Form (5NF) 24 BCNF definition A simple condition for removing anomalies from relations: A relation R is in BCNF if: If As Bs is a non-trivial dependency in R , then As is a superkey for R I.e.: The left side must always contain a key I.e: If a set of attributes determines other attributes, it must determine all the attributes Slogan: “In every FD, the left side is a superkey.” 25 BCNF decomposition algorithm Repeat choose A1, …, Am B1, …, Bn that violates the BNCF condition //Heuristic: choose Bs as large as possible split R into R1(A1, …, Am, B1, …, Bn) and R2(A1, …, Am, [others]) continue with both R1 and R2 Until no more violations B’s R1 26 A’s Others R2 Boyce-Codd Normal Form Name/phone example is not BCNF: Name SSN Mailing-address Phone Michael 123 NY 212-111-1111 Michael 123 NY 917-111-1111 {ssn,phone} is key FD: ssn name,mailing-address holds Violates BCNF: ssn is not a superkey Its decomposition is BCNF Only superkeys anything else Name SSN Mailing-address SSN PhoneNumber Michael 123 NY 123 212-111-1111 123 917-111-1111 27 BCNF Decomposition Larger example: multiple decompositions {Title, Year, Studio, President, Pres-Address} FDs: No many-many this time Problem cause: transitive FDs: 28 Title, Year Studio Studio President President Pres-Address => Studio President, Pres-Address Title, year studio president BCNF Decomposition Illegal: As Bs, where As not a superkey Decompose: Studio President, Pres-Address Result: 1. 2. Studios(studio, president, pres-address) Movies(studio, title, year) Is (2) in BCNF? Is (1) in BCNF? 29 As = {studio} Bs = {president, pres-address} Cs = {title, year} Key: Studio FD: President Pres-Address Q: Does president studio? If so, president is a key But if not, it violates BCNF BCNF Decomposition Studios(studio, president, pres-address) Illegal: As Bs, where As not a superkey Decompose: President Pres-Address {Studio, President, Pres-Address} becomes 30 As = {president} Bs = {pres-address} Cs = {studio} {President, Pres-Address} {Studio, President} Decomposition algorithm example 31 R(N,O,R,P) F = {N O, O R, R N} Name Office Residence Phone George Pres. WH 202-… George Pres. WH 486-… Dick VP NO 202-… Dick VP NO Key: N,P Violations of BCNF: N O, OR, N OR Pick N OR Can we rejoin? What happens if we pick N O instead? Can we rejoin? 307-… An issue with BCNF We could lose FDs Relation: R(Title, Theater, Neighboorhood) FDs: Title,N’hood Theater Assume a movie shouldn’t play twice in same n’hood Theater N’hood Keys: Title Theater {Title, N’hood} Aviator Angelica {Theater, Title} Life Aquatic Angelica 32 N’hood Village Village Losing FDs BCNF violation: Theater N’hood Decompose: {Theater, N’Hood} {Theater, Title} Resulting relations: R1 R2 Theater N’hood Theater Title Angelica Village Angelica Angelica Aviator Life Aquatic 33 Losing FDs Suppose we add new rows to R1 and R2: R1 Theater N’hood R2 Theater Title Angelica Village Angelica Life Aquatic Film Forum Village Angelica Aviator Film Forum Life Aquatic R’ 34 Theater N’hood Title Angelica Village Life Aquatic Angelica Village Aviator Film Forum Village Life Aquatic Neither R1 nor R2 enforces FD Title,N’hood Theater Third normal form: motivation Sometimes In these cases BCNF is too severe a req. “over-normalization” Solution: define a weaker normal form, 3NF 35 BCNF is not dependency-preserving, and Efficient checking for FD violation on updates is important FDs can be checked on individual relations without performing a join (no inter-relational FDs) relations can be converted, preserving both data and FDs BCNF lossiness NB: BCNF decomp. is not data-lossy But: it can lose dependencies 36 Results can be rejoined to obtain the exact original After decomp, now legal to add rows whose corresponding rows would be illegal in (rejoined) original Data-lossy v. FD-lossy Third Normal Form Now define the (weaker) Third Normal Form Turns out: this example was already in 3NF A relation R is in 3rd normal form if : For every nontrivial dependency A1, A2, ..., An B for R, {A1, A2, ..., An } is a super-key for R, or B is part of a key, i.e., B is prime Tradeoff: BCNF = no FD anomalies, but may lose some FDs 3NF = keeps all FDs, but may have some anomalies 37 Canonical Cover Consider a set F of functional dependencies and the functional dependency in F. From A C A canonical coverFc for F is a set of dependencies such that F logically implies all dependencies in Fc and Fc logically implies all dependencies in F, and further 38 Attribute A is extraneous in if A and if A is removed from , the Iset getofAB C functional dependencies implied by F doesn’t change. Given AB C and A C then B is extraneous in AB Attribute A is extraneous in if A and if A is removed from , the set of functional dependencies implied by F doesn’t change. Given A BC and A B then B is extraneous in BC No functional dependency in Fc contains an extraneous attribute. Each left side of a functional dependency in Fc is unique. Canonical Cover 39 Compute a canonical over for F: repeat use the union rule to replace any dependencies in F 1 1 and 1 2 replaced with 1 12 Find a functional dependency with an extraneous attribute either in or in If an extraneous attribute is found, delete it from until F does not change Example of Computing a Canonical Cover 40 R = (A, B, C) F = { A BC B C A B AB C} Combine A BC and A B into A BC A is extraneous in AB C because B C logically implies AB C. C is extraneous in A BC since A BC is logically implied by A B and B C. The canonical cover is: AB BC A BC B C AB C A BC B C B C A BC B C A B B C 3NF Decomposition Algorithm 41 Let Fc be a canonical cover for F; i := 0; for each functional dependency in Fc do if none of the schemas Rj, 1<= j <= i contains then begin i:=i+1; Ri:= ; end if none of the schemas Rj, 1<= j <= i contains a candidate key for R then begin i:=i+1; Ri:= any candidate key for R; end return (R1, R2, …, Ri) Example Relation schema: Banker-info-schema branch-name, customer-name, banker-name, office-number 42 The functional dependencies for this relation schema are: banker-name branch-name, officenumber customer-name, branch-name banker-name The key is: {customer-name, branch-name} Applying 3NF to banker - info schema Go through the for loop in the algorithm: banker-name branch-name, office-number is not in any decomposed relation (no decomposed relation so far) Create a new relation: Banker-office-schema ( banker-name, branch-name, office-number ) customer-name, branch-name banker-name is not in any decomposed relation (one decomposed relation so far) Create a new relation: Banker-schema ( customer-name, branch-name, banker-name ) 43 Since Banker-schema contains a candidate key for Banker-infoschema, we are done with the decomposition process. Comparison of BCNF and 3NF It is always possible to decompose a relation into relations in 3NF and It is always possible to decompose a relation into relations in BCNF and 44 the decomposition is lossless dependencies are preserved the decomposition is lossless it may not be possible to preserve dependencies Next week 45 Read ch.5.1-2 (SQL)