DATA NORMALIZATION Carbaquil, Carmela Dawn P. May 27, 2023 What has to be broken before you can use it? E__ How many month of the year which has 28 days? __ What is full of holes but still holds water? S_ON_E Relation It is a two-dimensional table of data consisting of rows (records) and columns (attribute or field) Relation (Entity) must have a unique name Every attribute value must be atomic (not multivalued, not composite) Every row must be unique (can’t have two rows with exactly the same values for all their fields) Relation Attributes (columns) in tables must have unique names The order of the columns (field names) is irrelevant The order of the rows (records) must be irrelevant Integrity Constraints Domain Constraints – allowable values for an attribute Entity Integrity – no primary key attribute may be null. All primary key fields MUST have data Referential Integrity – a foreign key in one relation must match a primary key value in another relation or the foreign key value must be null Data Normalization Primarily a tool to validate and improve a logical design so that it satisfies certain constraints that avoid unnecessary duplication of data Process of decomposing relations (with anomalies) to produce smaller, well-structured relations Develop by E. F. Codd in 1972 Well-Structured Relations A relation that contains minimal data redundancy and allows users to insert, delete, and update rows without causing data inconsistencies Goal is to avoid anomalies a) Insertion Anomaly b) Deletion Anomaly c) Modification Anomaly Anomalies Insertion Anomaly – adding new rows forces user to create duplicate data Deletion Anomaly – deleting rows may cause a loss of data that would be needed for other future rows Modification Anomaly – changing data in a row forces changes to other rows because of duplication Functional Dependency Functional Dependency is a constraint between two attributes in which the value of one attribute is determined by the value of another attribute For any relation R, attribute B is functionally dependent on attribute A, if the value of A uniquely determines that value of B Functional Dependency Attributes on the left side (SSSNo, ISBN) of the arrow in functional dependency is called determinant while the attributes in the right are the dependents Candidate Key is a unique identifier (one or more attributes). One of the candidate keys will become the primary key and must satisfy the (a) unique identification, and (b) non redundancy, properties. Steps in Normalization Normal Form – is a state of a relation that requires that certain rules regarding relationships between attributes (or functional dependencies) are satisfied. What can you break, even if you never pick it up or touch it? PR_MI_E What goes up but never comes down? A_E First Normal Form (1NF) A relation that has a primary key and in which there are not repeating groups Attributes are atomic (simple) and single-valued. Hence, multivalued attributes are eliminated A primary key has been identified Second Normal Form (2NF) A relation in first normal form in which non-key attribute is fully functionally dependent on primary key A relation without partial dependency on primary key (composite attributes) Partial Dependency occurs when a non-key attribute (dependents) depends on a part of the primary key (one of the attributes of composite primary key) Second Normal Form (2NF) Second Normal Form (2NF) Steps to convert relation with partial dependencies into second normal form: 1. Create a new relation for each primary key attribute (or combination of attributes) that is a determinant in the partial dependency that will serve as the primary key in the new relation. 2. Move the non-key attributes that are dependent on that certain primary key attribute from the old relation to the new relation. Third Normal Form (3NF) Relation must be in second normal form and has no Transitive Dependencies Transitive Dependency is a functional dependency between the primary key and one or more attributes that are dependent on the primary key via another nonkey attribute Third Normal Form (3NF) Steps to convert relation with transitive dependencies into third normal form: 1. For each non-key attributes (or set of attributes) that is a determinant in a relation, create an new relation and make it as the primary key. 2. Move the attributes that are functionally dependent of the primary key in the new relation. 3. Make the primary key as the foreign key in the old relation to create association with the two relations Third Normal Form (3NF) It belongs to you, but other people use it more than you do. What is it? N_ME Boyce-Codd Normal Form (BCNF) Most third normal form relations are also BCNF relations. A third normal form relation is NOT in BCNF if; A. Candidate keys in the relation are composite keys (they are not single attributes) B. There is more than one candidate key in the relation, and; C. The keys are not disjoint, that is, some attributes in the keys are common. Boyce-Codd Normal Form (BCNF) In the above table Functional dependencies are as follows: 1. EMP_ID → EMP_COUNTRY 2. EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO} Candidate key: {EMP_ID, EMP-DEPT} The table is not in BCNF because neither EMP_DEPT nor EMP_ID alone are keys. Boyce-Codd Normal Form (BCNF) Candidate keys: For the first table: EMP_ID For the second table: EMP_DEPT For the third table: {EMP_ID, EMP_DEPT} Fourth Normal Form (4NF) A relational table is in the fourth normal form (4NF) if it is in BCNF and all multivalued dependencies are also functional dependencies. Fourth normal form (4NF) is based on the concept of multivalued dependencies. Multivalued dependency occurs when in a relational table, containing at least three columns, one column has multiple rows whose values match a value of a single row of one of the other columns Fourth Normal Form (4NF) The given STUDENT table is in 3NF, but the COURSE and HOBBY are two independent entity. Hence, there is no relationship between COURSE and HOBBY. In the STUDENT relation, a student with STU_ID, 21 contains two courses, Computer and Math and two hobbies, Dancing and Singing. So there is a Multi-valued dependency on STU_ID, which leads to unnecessary repetition of data. Fourth Normal Form (4NF) Fifth Normal Form (5NF) A table is in the fifth normal form (5NF) if it cannot have a lossless decomposition into any number of smaller tables. Fifth normal form is based on the concept of join dependence. Join dependency means that an table, after it has been decomposed into three or more smaller tables, must be capable of being joined again on common keys to form the original table. 5NF indicates when an entity cannot be further decomposed Fifth Normal Form (5NF) In the above table, John takes both Computer and Math class for Semester 1 but he doesn't take Math class for Semester 2. In this case, combination of all these fields required to identify a valid data. Suppose we add a new Semester as Semester 3 but do not know about the subject and who will be taking that subject so we leave Lecturer and Subject as NULL. But all three columns together acts as a primary key, so we can't leave other two columns blank. Fifth Normal Form (5NF) What has a head and a tail but no body? C__N CODD’s Rules Dr Edgar F. Codd did some extensive research in Relational Model of database systems and came up with twelve rules that a database must obey in order to be a true relational database. These rules can be applied on a database system that is capable of managing is stored data using only its relational capabilities. This is a foundation rule, which provides a base to imply other rules on it CODD’s Rules Rule 1: Information rule This rule states that all information (data), which is stored in the database, must be a value of some table cell. Everything in a database must be stored in t able formats. This information can be user data or meta-data. Rule 2: Guaranteed Access rule This rule states that every single data element (value) is guaranteed to be accessible logically with combination of table-name, primary-key (row value) and attribute-name (column value). No other means, such as pointers, can be used to access data. CODD’s Rules Rule 3: Systematic Treatment of NULL values This rule states the NULL values in the database must be given a systematic treatment. As a NULL may have several meanings, i.e. NULL can be interpreted as one the following: data is missing, data is not known, data is not applicable etc. Rule 4: Active online catalog This rule states that the structure description of whole database must be stored in an online catalog, i.e. data dictionary, which can be accessed by the authorized users. Users can use the same query language to access the catalog which they use to access the database itself. CODD’s Rules Rule 5: Comprehensive data sub-language rule This rule states that a database must have a support for a language which has linear syntax which is capable of data definition, data manipulation and transaction management operations. Database can be accessed by means of this language only, either directly or by means of some application. If the database can be accessed or manipulated in some way without any help of this language, it is then a violation. Rule 6: View updating rule This rule states that all views of database, which can theoretically be updated, must also be updatable by the system. CODD’s Rules Rule 7: High-level insert, update and delete rule This rule states the database must employ support high-level insertion, updation and deletion. This must not be limited to a single row that is, it must also support union, intersection and minus operations to yield sets of data records. Rule 8: Physical data independence This rule states that the application should not have any concern about how the data is physically stored. Also, any change in its physical structure must not have any impact on application. CODD’s Rules Rule 9: Logical data independence This rule states that the logical data must be independent of its user’s view (application). Any change in logical data must not imply any change in the application using it. For example, if two tables are merged or one is split into two different tables, there should be no impact the change on user application. This is one of the most difficult rule to apply. CODD’s Rules Rule 10: Integrity independence This rule states that the database must be independent of the application using it. All its integrity constraints can be independently modified without the need of any change in the application. This rule makes database independent of the front-end application and its interface. CODD’s Rules Rule 11: Distribution independence This rule states that the end user must not be able to see that the data is distributed over various locations. User must also see that data is located at one site only. This rule has been proven as a foundation of distributed database systems. Rule 12: Non-subversion rule This rule states that if a system has an interface that provides access to low level records, this interface then must not be able to subvert the system and bypass security and integrity constraints. Why do we need to learn data normalization? How can you say that a database is well structured? Group Assignment Answer the activity about data normalization attached in the MS Teams. To be submitted on or before the next meeting.