TDDC94 Fall 2007 Lecture Overview TDDC94—Normalisation User 4 Model Real World Database Updates Answers UserQueries 3 Updates Answers UserQueries 2 UserQueries 1 Updates Answers Updates Queries Answers Processing of queries and updates Database management system Almut Herzog Attentec AB almhe@attentec.se almhe@ida.liu.se Access to stored data Physical database September 4, 2007 TDDC94—Normalisation Good Design Informal Measures of Good Design Can we be sure that a translation from EER-diagram to relational tables results in good database design? Confronted with a deployed database, how can we be sure that it is well-designed? What is good database design? Four informal measures: ... Formal measure: Normalisation TDDC94—Normalisation 2 3 Informal Measures of Good Design Reducing NULL values in tuples (not so big issue really) Why is the attribute there in the first place if almost every tuple has a NULL-value for it? Unclear semantics. Storage is not really a problem in today’s databases But it is very costly to have joins on such columns because the NULL-value must be taken into account with an OUTER join which is an expensive operation. Disallowing the possibility of generating spurious tuples Semantics of the attributes (it is straightforward to understand why the attribute is in the relation, putting department data in the EMP-table is for example not straightforward) Reducing redundant information in tuples because redundancy can cause update anomalies: Given: EMP(EMPID, EMPNAME, DEPTNAME, DEPTMGR) Insertion anomaly: if the EMP table also contains DEPT-info, the insertion of a new employee must contain correct, consistant DEPT-information. Deletion anomaly: If the last employee of the department is deleted, all department information is lost. Modification anomaly: If a department receives a new manager, all employee tuples need to be updated with that information. TDDC94—Normalisation 4 Functional dependencies (FD) Let R be a relational schema with the attributes A1,...,An and let X and Y be subsets of {A1,...,An}. Let r(R) denote a relation in relational schema R. We Wesay saythat that XXfunctionally functionallydetermines determinesY, Y, XX YY Despite the mathematical definition an FD cannot be determined automatically. It is a property of the semantics of attributes.in r(R): ififfor r(R)and andfor forall allrelations relations in r(R): foreach eachpair pairof oftuples tuplestt11, ,tt22∈∈r(R) IfIftt1[X] [X]==tt2 [X] [X]then thenwe wemust mustalso alsohave havett1[Y] [Y]==tt2 [Y] [Y] 1 2 1 2 Figure 10.5 in the book … Only join on foreign key/primary key-attributes. Lossless join property: It shall be possible to recreate the universal relation without spurious tuples. TDDC94—Normalisation Author: Almut Herzog 5 TDDC94—Normalisation 6 1 TDDC94 Fall 2007 Definitions Definitions An FD X Y is a full functional dependency (FFD) if removal of any attribute Ai from X means that the dependency does not hold any more. Key: A set of attributes that uniquely and minimally identifies a tuple of a relation. Superkey: a set of attributes uniquely (but not minimally!) identifying a tuple of a relation. Candidate key: If there is more than one key in a relation, the keys are called candidate keys. Primary key: One candidate key is chosen to be the primary key. Prime attribute: An attribute A that is part of a candidate key X (A ⊆ X) Inference rules… TDDC94—Normalisation TDDC94—Normalisation 7 8 1NF – rather a definition The relation must not have non-atomic values. Normal Forms 1NF, 2NF, 3NF, BCNF, 4NF, 5NF Minimise redundancy Minimise update anomalies Normal form ↑ = redundancy ↓ and relations become smaller Rnon1NF ID Name LivesIn 100 Pettersson {Stockholm, Linköping} 101 Andersson {Linköping} 102 Svensson {Ystad, Hjo, Berlin} R11NF Normalisation TDDC94—Normalisation TDDC94—Normalisation Author: Almut Herzog Linköping Pettersson 102 Ystad 101 Andersson 102 Hjo 102 Svensson 102 Berlin 10 3NF – 2NF + no non-key attribute shall be functionally dependent on any other nonkey attribute (= no transitive dependencies) Rnon3NF Work% EmpName ID Name Zip City Dev 50 Baker 100 Andersson 58214 Linköping 100 Support 50 Baker 101 Björk 10223 Stockholm 200 Dev 80 Miller 102 Carlsson 58214 Linköping R22NF R13NF Dept Work% Normalisation R23NF ID Name Zip Zip City 100 Andersson 58214 58214 Linköping 10223 Stockholm EmpID EmpName 100 Dev 50 100 Baker 100 Support 50 101 Björk 10223 200 Miller 200 Dev 80 102 Carlsson 58214 11 Linköping 100 Dept EmpID Stockholm 100 101 100 R12NF LivesIn 100 Name EmpID Normalisation ID ID TDDC94—Normalisation 9 2NF: no non-key attribut shall be functionally dependent on parts of any candidate key Rnon2NF R21NF TDDC94—Normalisation 12 2 TDDC94 Fall 2007 Boyce-Codd Normal Form BCNF Every determinant is a superkey. “colloquial and historical definition, which does not consider candidate keys”: 3NF + no attribute of the primary key is fully functional dependent on a non-key attribute Warning! With this definition you do not always get it right, use the formal one instead: Every determinant is a superkey (in practice: every determinant is a candidate key) TDDC94—Normalisation 13 Properties of decomposition Keep all attributes from the universal relation R. Preserve the identified functional dependencies. Lossless join It must be possible to join the smaller tables to arrive at composite information without spurious tuples. TDDC94—Normalisation 15 PID PersonName PID, Country NumVisitsInCountry Country Continent Continent ContinentSize With these FFDs which are the keys in R (= can we find a FD that contains all attributes?) Use the inference rules and get started. Example: Given R(A,B,C,D) and the functional dependencies AB CD, C B R is in 3NF but not in BCNF. C is determinant but not superkey, i.e. it cannot be turned into a key by taking away 0 or more attributes. In reality, a relation that is in 3NF is most often also in BCNF (but you cannot be sure ☺) TDDC94—Normalisation 14 Normalisation Given the universal relationen R(PID, PersonName, Country, Continent, ContinentSize, NumVisitsInCountry) How does one elicit the functional dependencies? Which are the keys in R? TDDC94—Normalisation Country Continent leads to Country 16 Continent ContinentSize Continent, ContinentSize (transitive rule) PID, Country PID, Country PID, Country Continent, ContinentSize (augmentation rule) PersonName (augmentation rule) NumVisitsInCountry leads to PID, Country Continent, ContinentSize, PersonName, NumVisitsInCountry (additive rule) Thus Person, Country is key for R. TDDC94—Normalisation Author: Almut Herzog 17 TDDC94—Normalisation 18 3 TDDC94 Fall 2007 Is R (PID, Country, Continent, ContinentSize, PersonName, NumVisitsInCountry) Are R1, R21, R22 in 3NF? in 2NF? No, because PersonName is only ffd on PID, thus we split R1(PID, PersonName) R2(PID, Country, Continent, ContinentSize, NumVisitsInCountry) R22(PID, Country, NumVisitsInCountry), R1(PID, PersonName): Yes, because there is only one non-key attribute. Is R2 in 2NF? No, because Continent and ContinentSize are only ffd on Country, thus R1(PID, PersonName) R21(Country, Continent, ContinentSize) R22(PID, Country, NumVisitsInCountry) R1, R21, R22 are now in 2NF R21(Country, Continent, ContinentSize): No, because Continent determines ContinentSize, thus R211(Country, Continent) R212(Kontinent, ContinentSize) R1, R22, R211, R212 are now in 3NF 2NF: 2NF: non non non-key non-key attribute attribute shall shall be be functionally functionally dependent dependent on on parts parts of of any candidate key. any candidate key. TDDC94—Normalisation 3NF: 3NF: 2NF 2NF ++ no no non-key non-key attribute attribute shall shall be be fd fd on on any any other other non-key non-key attribute attribute (= (= no no transitive transitive dependencies) dependencies) TDDC94—Normalisation 19 Är R1, R22, R211, R212 i BCNF? 20 4NF and 5NF BCNF: BCNF: Every Every determinant determinant is is aa superkey. superkey. R22(PID, Country, NumVisitsInCountry), R1(PID, PersonName): R211(Country, Continent) R212(Continent, ContinentSize) YES Don’t get fooled by candidate keys as in Rx(PID1, PID2, PName) Relevant if primary key consists of more than two attributes. Relevant if confronted with an existing database. Can the universal relation R be recreated from R1, R22, R211 and R212 without spurious tuples? TDDC94—Normalisation 21 Summary and Open Issues Good design: informal and formal properties of relations Functional dependencies, and thus normal forms, are about attribute semantics (= real-world knowledge), normalisation can only be automated if FDs are given. TDDC94—Normalisation 22 1. Which normal form? The database contains data about cars, their owners and when the car was registered for that owner. Are high normal forms good design when it comes to performance? No, denormalisation may be required. TDDC94—Normalisation Author: Almut Herzog 23 PersonID FirstName LastName LicensePlate RegistrationDate Birthdate 1000 Ann Anderson ABC123 2004-10-12 1981-04-04 1010 Ben Benson DEF234 2003-02-12 1945-12-12 1000 Ann Anderson ABC123 2001-04-23 1981-04-04 TDDC94—Normalisation 24 4 TDDC94 Fall 2007 2. Which normal form? 3. Which normal form? A database contains data about registered cars and their make (type). Date Flight Aircraft Pilot 13-Jan-2005 TGU7 Airbus 300 John 14-Jan-2005 TGU7 Boeing 747 Daniel 12-Jan-2005 SKX6 Airbus 300 John LicensePlate Type Maker ABC123 C70 Volvo DEF234 S40 Volvo 13-Jan-2005 SKX6 Boeing 747 Ann FGH345 Corolla Toyota 14-Jan-2005 SKX6 Fokker 50 Mary TDDC94—Normalisation Author: Almut Herzog The database contains data about flights, aircrafts and their pilots. Flights use different aircrafts depending on the number of booked passengers. 25 TDDC94—Normalisation 26 5