Mid3 Revision 2 Prof. Sin Min Lee Deparment of Computer Science San Jose State University Functional DependenciesR FDs defined over two sets of attributes: X, YR Notation: X Y reads as “X determines Y” If X Y, then all tuples that agree on X must also agree on Y X Y Z 1 2 3 2 4 5 1 2 4 1 2 7 2 4 8 3 7 9 Q6. (1 point) Given the relation Supplies: Snumber | Pnumber | Qty --------|---------|----101 1 20 101 2 30 102 1 14 103 4 21 104 4 10 105 1 5 what will be returned by the SQL query: Select Pnumber From Supplies Group By Pnumber Having Count(*) = (Select Max(Count(*)) From Supplies Group By Pnumber) (a) 1 (b) 2 Answer: a (c) 3 (d) 4 Q4.(1 point) Consider the relation R(ABCDE) with FDs: FD1: AB -> D, FD2. AB->E, FD3. D->A, and FD4. D->B. The number of keys of R is: 2 (c) 3 (d) 10 (a) 1 (b) Answer: candidate keys 2 {A,B,C}, {C,D}. Superkeys 9 {CD},{ABC}, {ACD},{BCD},{CDE},{ABCD},{ABCE},{B CDE},{ABCDE} nd Normal Form 2 has to be in 1st Normal Form each attribute A in relation schema R meets one of the following criteria: It appears in a candidate key. It is not partially dependent on a candidate key. -No need to check if the primary key has only one attribute -Create a new relation for each partial key and its dependent attributes Partial dependency A functional dependency a b is called a partial dependency if there is a proper subset g of a such that g b We say that b is partially dependent on a. 2NF example A B C D 1 1 3 2 2 2 3 1 3 2 4 4 4 1 4 3 1 2 1 2 nd Normal Form (cont.) 2 Lots Property County-Id# name Lot # Area Price Tax-Rate is partially dependent on candidate key {County-name, Lot#} Tax-Rate 2NF (cont.) Lot # Property- CountyId# name County-name Area Tax-Rate Price 3rd Normal Form in 2nd Normal Form no non-key attributes are functionally dependent on other non-key attribute 3NF (cont.) Property- CountyId# name Prope rtyId# Lot # County Lot -name # Area Area Price Area Price Inventory(PartNbr, {Warehouse, Location}, QOH, Weight, PartColor) PartNbr --> Weight, PartColor PartNbr + Warehouse --> QOH QOH is Quantity On hand Warehouse --> Location Sample Data PartNbr Warehouse Location QOH Weight PartColor 01 500 NW 135 11.75 Blue 01 600 SW 210 11.75 Blue 01 800 East 192 11.75 Blue 02 500 NW 75 2.50 Red 02 800 East 45 2.50 Red 03 500 NW 290 21.35 Green 03 600 SW 83 21.35 Green Which Normal form is the Inventory table in? Answer: key { PartNbr ,W]} 1NF, not 2NF Q2.(1 point) Hospital(Patient, Insurance, Doctor, {Test, Result}) Patient --> Insurance, Doctor Patient + Test --> Result Sample Data Patient Insurance Doctor Test Result Tweety Tweety Sylvester Sylvester Sylvester Red Cross Red Cross Red Shield Red Shield Red Shield Livingston Brain Scan Not Found Livingston Blood work Yes and red Kilder Cat Scan Yes he is a Cat Kilder X Rays No broken bones Kilder Flea check None Which Normal form is the Hospital table in? Q7. (1 point) Given the following table (a) Draw the functional dependency graph of this table. (b) Can D in 3NF ? Closure of F Let F be a set of functional dependencies. The closure of F, denoted by F+, is the set of all functional dependencies logically implied by F. Armstrong’s Axiom Reflexivity rule. If a is a set of attributes and b a, then a b. Augmentation rule. If a b holds and g is a set of attributes, then ga gb holds. Transitivity rule. If a b holds and b g holds, then a g holds. Q3.(1 point) Suppose we have R(A,B,C,D) with FD1. A,BC FD2.A,C B FD3. B,D A Identify all the candidate keys. Decompositions in General R(A1, ..., An, B1, ..., Bm, C1, ..., Cp) R1(A1, ..., An, B1, ..., Bm) R2(A1, ..., An, C1, ..., Cp) If A1, ..., An B1, ..., Bm Then the decomposition is lossless Note: don’t need necessarily A1, ..., An C1, ..., Cp Example: name price, hence the first decomposition is lossless BCNF Decomposition Algorithm Repeat choose A1, …, Am B1, …, Bn that violates the BNCF condition split R into R1(A1, …, Am, B1, …, Bn) and R2(A1, …, Am, [others]) continue with both R1 and R2 Until no more violations B’s R1 A’s Others R2 Is there a 2-attribute relation that is not in BCNF ? Summary of BCNF Decomposition Find a dependency that violates the BCNF condition: A1, A2, …, An B1, B2, …, Bm Heuristics: choose B1 , B2, … Bm“as large as possible” Decompose: Others A’s B’s Continue until there are no BCNF violations left. 2-attribute relations are BCNF R1 R2 Example Decomposition Person(name, SSN, age, hairColor, phoneNumber) SSN name, age age hairColor Decompose in BCNF (in class): Step 1: find all keys (How ? Compute S+, for various sets S) Step 2: now decompose Other Example R(A,B,C,D) A B, BC Key: AD Violations of BCNF: A B, A C, ABC Pick A BC: split into R1(A,BC) R2(A,D) What happens if we pick A B first ? Q5. (1 point) Given the FDs {B->D, AB->C, D->B} and the relation R(A, B, C, D)}, give a two distinct lossless join decomposition to BNCF indicating the keys of each of the resulting relations Answer: Relations in the first lossless join decomposition R1(A, B, C) R2(B, D) Relation in the second lossless join decomposition R1(A, C, D) R2(B, D) Lossless Decompositions A decomposition is lossless if we can recover: R(A,B,C) Decompose R1(A,B) R2(A,C) Recover R’(A,B,C) should be the same as R(A,B,C) R’ is in general larger than R. Must ensure R’ = R Q8.(2 points) Consider the relation schema R(A,B,C,D) with FDs F = {ABC; BCD; AB}. Which FD has an extraneous attribute on the left hand side? a. ABC b. BCD c. Both (b) and (a) d. None of the above Answer: a Multivalued Dependencies (MVDs) Let R be a relation schema and let a R and b R. The multivalued dependency a b holds on R if in any legal relation r(R), for all pairs for tuples t1 and t2 in r such that t1[a] = t2 [a], there exist tuples t3 and t4 in r such that: t1[a] = t2 [a] = t3 [a] = t4 [a] t3[b] = t1 [b] t3[R – b] = t2[R – b] t4 [b] = t2[b] t4[R – b] = t1[R – b] MVD (Cont.) Tabular representation of a b X ->> Y is trivial if (a) Y X or (b) Y U X = R Multivalued Dependencies There are database schemas in BCNF that do not seem to be sufficiently normalized Consider a database classes(course, teacher, book) such that (c,t,b) classes means that t is qualified to teach c, and b is a required textbook for c The database is supposed to list for each course the set of teachers any one of which can be the course’s instructor, and the set of books, all of which are required for the course (no matter who teaches it). Multivalued Dependencies course database database database database database database operating systems operating systems operating systems operating systems teacher Avi Avi Hank Hank Sudarshan Sudarshan Avi Avi Jim Jim book DB Concepts Ullman DB Concepts Ullman DB Concepts Ullman OS Concepts Shaw OS Concepts Shaw classes There are no non-trivial functional dependencies and therefore the relation is in BCNF Insertion anomalies – i.e., if Sara is a new teacher that can teach database, two tuples need to be inserted (database, Sara, DB Concepts) (database, Sara, Ullman) Multivalued Dependencies Therefore, it is better to decompose classes into: course teacher database database database operating systems operating systems Avi Hank Sudarshan Avi Jim teaches course book database database operating systems operating systems DB Concepts Ullman OS Concepts Shaw text We shall see that these two relations are in Fourth Normal Form (4NF) MVD (Cont.) Tabular representation of a b Example: F = + A = + B = + C = + AB { A B, B C } ABC BC C = ABC First Normal Form Every field contains only atomic values No lists or sets. Implicit in our definition of the relational model. Second Normal Form every non-key attribute is fully functionally dependent on the ENTIRE primary key. Mainly of historical interest. – – Intuitively, in a BCNF relation, the only nontrivial dependencies are those in which a key determines some attributes. Each tuple can be thought of as an entity or relationship, identified by a key and described by the remaining attributes Key Nonkey attr_1 Nonkey attr_2 FDs in a BCNF Relation Nonkey attr_k Key Attributes X Attributes A A not in a key Partial Dependencies Key Key Attributes X Attributes A Attributes A Attributes X Transitive Dependencies A not in a key A in a key Motivation of 3NF By making an exception for certain dependencies involving key attributes, we can ensure that every relation schema can be decomposed into a collection of 3NF relations using only decompositions. Such a guarantee does not exist for BCNF relations. It weaken the BCNF requirements just enough to make this guarantee possible. Unlike BCNF, some redundancy is possible with 3NF. The problems associate with partial and transitive dependencies persist if there is a nontrivial dependency XA and X is not a superkey, even if the relation is in 3NF because A is part of a key. Reserves Assume: sid cardno (a sailor uses a unique credit card to pay for reservations). Reserves is not in 3NF sid is not a key and cardno is not part of a key In fact, (sid, bid, day) is the only key. (sid, cardno) pairs are redundantly. Reserves Assume: sid cardno, and cardno sid (we know that credit cards also uniquely identify the owner). Reserves is in 3NF (cardno, sid, bid) is also a key for Reserves. sid cardno does not violate 3NF. 1. Suppose that in our banking example, we had an alternative design including the schema: BC-schema=(loan#, cname, street, ccity) We can see this is not BCNF, as the functional dependency cname street ccity holds on this schema, and cname is not a superkey. 2. 3. If we have customers who have several addresses, though, then we no longer wish to enforce this functional dependency, and the schema is in BCNF. However, we now have the repetition of information problem. For each address, we must repeat the loan numbers for a customer, and vice versa. 4. Figure 1 shows a tabular representation of this. It looks horrendously complicated, but is really rather simple. A simple example is a table with the schema (name, address, car), as shown in Figure 2. a b R-a-b t1 t2 a1….ai a1….ai ai+1….aj bi+1….bj aj+1….an bj+1….bn t3 t4 a1....ai a1....ai ai+1….aj bi+1….bj bj+1….bn aj+1….an (name, address, car) where nameaddress and namecar name address car Tom Tom Tom Tom North Rd. Oak St. North Rd. Oak St. Toyota Honda Honda Toyota What is a Decomposition? Let R be a relation schema. A set of relation schemas {R1, R2,…, Rn} is a decomposition of R if R = R1 U R2 U…U Rn That is, {R1, R2,…, Rn} is a decomposition of R for I=1,2,…,n, each Ri is a subset of R, and every attribute in R appears in at least one Ri. Normalization Using Functional Dependencies Desirable properties of Decomposition 1. Lossless-Join Decomposition Let R be a relation schema, and let F be a set of functional dependencies on R. Let R1 and R2 form a decomposition of R. This decomposition is a lossless-join decomposition of R if at least one of the following functional dependencies are in F+: R1 R2 R1 R1 R2 R2 2. Dependency Preservation When an update is made to the database, the system should be able to check if it satisfies all the given functional dependencies. If we want to check updates efficiently, we should design relational-database schemas that allow update validation without the computation of joins. To decide whether joins must be computed we need to determine what functional dependencies may be tested by checking each relation individually. Cont. Let F be a set of functional dependencies on a schema R, and Let R1, R2,…, Rn be a decomposition of R. The restriction of F to Ri is the set Fi of all functional dependencies in F+ that include only attributes of Ri. Let F’ = F1 U F2 U… U Fn. F’ is a set of functional dependencies on schema R, in general, F’ F. However, it may be F’+ = F+. If the latter is true, then every dependency in F is logically implied by F’, and if we verify that F’ is satisfied, we have verified that F is satisfied. We say that a decomposition having the property F’+ = F+ is a dependency preserving decomposition. Algorithm to test dependency preservation compute F+; for each schema Ri in D do begin Fi: = the restriction of F+ to Ri; end F’:=0 for each restriction Fi do begin F’=F’ U Fi end compute F’+ if(F’+ = F+) then return (true) else return (false); Note: since the first step, computation of F+ takes exponential time, it is often easier not to apply the algorithm. Boyce-Codd Normal Form A relation schema R is in BCNF with respect to a set F of functional dependencies if for all functional dependencies in F+ of the form a b, where a R and b R, at least one of the following holds. a R is a trivial functional dependency ( b a) • a is a superkey for schema R. • Cont. " " A database design is in BCNF if each member of the set of relation schemas that constitutes the design is in BCNF. To determine whether these schemas are in BCNF, we need to determine what functional dependencies apply to them. Note: examples are available in text P225-226 BCNF Decomposition Algorithm Result := {R}; done := false; compute F+; while( not done ) do if( there is a schema Ri in result that is not in BCNF ) then begin let a b be a nontrivial functional dependency that holds on Ri such that a Ri is not in F+, and a b result = ( result -Ri ) ( Ri - B ) a , b end else done := true; Cont. Not every BCNF decomposition is dependency preserving • We can not always satisfy all three design goals: • 1. BCNF 2. Lossless join 3. Dependency preservation Cont. Example: Banker-schema = ( branch-name, customer-name, bankername ) This banker-schema indicates that a customer has "personal banker" in a particular branch. The set F of functional dependencies that we require to hold on the banker-schema is banker-name branch-name branch-name customer-name banker name Banker-schema is not in BCNF because banker-name is not a superkey Third Normal Form A relation schema R is in 3NF with respect to a set F of functional dependencies if, for all functional dependencies in F+ of the form a b where a R and b R, at least one of the following holds: •a b is a trivial functional dependency. • a is a superkey for R. • Each attribute A in b - a is contained in a candidate key for R. Transitive Dependencies The definition of 3NF allows certain functional dependencies that are not allowed in BCNF. A dependency a b satisfies only the third condition of the 3NF definition is not allowed in BCNF, but is allowed in 3NF. These dependencies are examples of transitive dependencies. Cont. If a relation schema is in BCNF, then all functional dependencies are of the form “superkey determines a set of attributes,” or the dependency is trivial. So A BCNF schema cannot have an transitive dependencies. Every BCNF schema is also in 3NF, and BCNF is therefore a more restrictive constraint than is 3NF. Algorithm for Dependency-preserving, lossless-join decomposition into 3NF Let Fc be a canonical cover for F; i:=0; for each functional dependency ab in Fc do if none of the schemas Rj, j=1,2,…, I contains ab then begin i:=i + 1; Ri:= ab; end If none of the schemas Rj, j=1,2,…,I contains a candidate key for R return (R1, R2,…, Ri) Comparison of BCNF and 3NF Using 2NF has an advantage which it is always possible to obtain a 3NF design without sacrificing a lossless join or dependency preservation.So it is generally preferable to choose 3NF. Conclusion Now we have three design goals for a relational-database design: 1. BCNF 2. Lossless join 3. Dependency preservation If we cannot achieve all three, we can do 1. 3NF 2. Lossless join 3. Dependency preservation Testing for Lossless Join Fortunately, there is a simple test to determine if a decomposition into two schemes is lossless Let R1 and R2 be a decomposition of R Let F be the set of FDs of R If either (R1 R2) (R1 - R2) or (R1 R2 ) (R1 - R2 ) belongs to F, the decomposition is lossless Data Mining and KDD Putting the results in practical use What is Data Mining? “the automated extraction of hidden predictive information from large databases” Algorithms produce patterns, rules Predict future trends/behavior Used to make business decisions Classification Items belong to classes Given past items’ classification, predict class of new item Example: Issuing credit cards Use information: income, educational background, age, current debts Credit worthiness: Bad, good, excellent Decision Tree Classifiers Internal Node has predicate Leaf node is class To classify instance Start at root node Traverse tree until reach leaf node Each internal node, make decision Credit Risk Decision Tree Decision Tree Construction Some Definitions Purity: > # instances of each leaf belonging to only 1 class means > purity Best Split: split giving the maximum information gain ratio (info gain/info content) Choose attribute and condition resulting in maximum purity Decision Tree Construction Association Rules antecedent consequent if then beer diaper (Walmart) economy bad higher unemployment Higher unemployment higher unemployment benefits cost Rules associated with population, support, confidence Association Rules Population: instances such as grocery store purchases Support % of population satisfying antecedent and consequent Confidence % consequent true when antecedent true Association Rules Population Support (MS)= 3/6 MS, MSA, MSB, MA, MB, BA M=Milk, S=Soda, A=Apple, B=beer (MS,MSA,MSB)/(MS,MSA,MSB,MA,MB, BA) Confidence (MS) = 3/5 (MS, MSA, MSB) / (MS,MSA,MSB,MA,MB) Clustering “The process of dividing a dataset into mutually exclusive groups such that the members of each group are as "close" as possible to one another, and different groups are as "far" as possible from one another, where distance is measured with respect to all available variables.” Clustering Birch Algorithm points inserted into multidimensional tree items guided to leaf nodes "near" representative internal nodes nearby points clustered into one leaf node Clustering Example of Clustering predict what new movies a person is interested in 1) a person’s past movie preferences 2) others with similar preferences 3) preferences of those in the pool for new movies Clustering 1) cluster people with similar movie preferences 2) given a new movie goer, find a cluster of similar movie goers 3) then predict the cluster's new movie preferences Amazon Examples Amazon Examples