Class Agenda: 03/13 – 3/15 Review Database design – core concepts Review design for ERD Scenarios #3 & #4 Review concepts of normalization. Do practice design from forms using the Replica Toy database (ERD scenario #5). Discuss issues in database design and normalization. Discuss concepts of data warehouse design. Establish environment surrounding DW design. Contrast methods of DW design. 1 Goals for Transaction Database Design Protect the integrity of the data. Reduce data redundancy. Prevent data anomalies. Provide for change. Prevent inflexible data structures. Anticipate changes. Provide access to complete data for decision making. 2 What is normalization? Normalization is a formal, process-oriented approach to data modeling. Normalization is the process of: examining groups of data attributes; splitting them into appropriate entities; identifying the relationships between the entities; and identifying appropriate primary and foreign keys. 3 Two methods of applying normalization 1. Use it to help in designing a database. Normalization starts with a single entity. Normalization breaks that entity into a series of additional entities. More entities are discovered and named during the process. Entities are linked during the process. 2. Use it to validate the design of a database. Identify entities from the meaning of the data. Create conceptual and logical data models. Apply the rules of normalization to ensure a stable, nonredundant design. 4 Normalization Vocabulary: Functional Dependency and Determinants A social security number determines your name and address. SSN name, address. A vehicle id number determines the make and model of a car. VIN make, model. Name and address are “functionally dependent” on SSN. SSN “determines” name and address. Functional dependency diagram format: CrsNum CrsDescription, CrsCredits ZipCode City, State (this implies that a zip code uniquely identifies a city and state in the U.S. postal system) PatID, TrtDateTime TstResults, TrtID, LocID, Normal forms relevant to business oriented databases First normal form: groups. Remove repeating Second normal form: Remove partial functional dependencies. Third normal form: dependences Remove transitive 6 First Normal Form First normal form: groups. Remove repeating A repeating group is an attribute or group of attributes that can have more than one value for an instance of an entity. Example of repeating groups: StudentID StudentName, StudentAddress, courseID1, DateTaken1, Grade1, courseID2, DateTaken2, Grade2, courseID3, DateTaken3, Grade3, CourseID4, DateTaken4, Grade4… 7 Other examples of a repeating group Serial# model#, customer name, customer address, feature 1 chosen, feature 2 chosen, feature 3 chosen… PatientID name, address, zip, first insurance company, second insurance company, third insurance company… 8 To remedy a problem discovered with normalization To get a data model into an appropriate normal form: Identify the problem (repeating group, partial functional dependency, or transitive dependency) and place the “problem” attributes in one or more new separate entities in the model. Identify a primary key for the new entity. The key may be concatenated if it is an associative entity, rather than a strong entity. Create relationships between existing and new entities. Divide m:n relationships with appropriate intersection entities. 9 Second Normal Form Second normal form: Remove partial functional dependencies. A partial functional dependency is a situation in which one or more non-key attributes are functionally dependent on part, but not all, of the primary key. Partial functional dependencies occur only with entities that have concatenated primary keys. Examples of partial functional dependencies: PatID, TrtDateTime PatName, TstResults, TrtType, TrtDescription, LocName, TrtID, LocID, CourseID, StudentID CourseTitle, Grade 10 Third Normal Form Third normal form: dependencies. Remove transitive A transitive dependency occurs when a non-key attribute is functionally dependent on one or more non-key attributes. Examples of transitive dependencies: TrackingNumber ShipmentDate, OrderID, ItemID ShipmentLocationID, LocationDescription, QuantityShipped PatID, TrtDateTime TstResults, TrtType, TrtDescription, LocName, TrtID, LocID, 11 Issues in Database Design Characteristics of business-oriented databases. Used to store transactions. Updated quickly and frequently, but not always accurately. Accessed online real-time. Support operational decision making. Assuming that the data stored is accurate, what is the biggest potential problem with a transaction database in third normal form? How do most organizations solve that problem? What do organizations potentially lose when they solve that problem? 12 Major purposes of a data warehouse To create a data storage designed to facilitate managerial decision making. Integrated data. Subject-oriented. Time-variant. Non-volatile. To create a data storage that has better quality, more consistent data than existing operational databases. 13 OperationalTransaction and External Data Sources User Departments Data Warehouse Server Extract Transform Load (ETL) Processes Data Mart Tier Extract Load Processes Reconciled Enterprise Data Warehouse Goals of data warehouse design Make accurate information easily accessible. Present information consistently. Be adaptive and flexible to change. Provide reasonable and expected performance for information to support decision making. Minimize data redundancy. Protect/secure information. 15 Three different data models Transaction (operational) data model: Contains current data required by separate and/or integrated operational systems. Supports the transactional processing of the organization. Is frequently used to support day-to-day decision making. 3rd normal form. Reconciled (enterprise data warehouse) data model: Contains detailed, current data intended to be the single, authoritative source for all decision support applications. Usually in 3rd normal form. Derived (data mart) data model: Contains data that are selected, formatted and aggregated for end-user decision support applications. Star schema. Probably not normalized. 16 Comparison – Replica Toys Transaction data model Reconciled data model Derived (data mart) data model 17 Reconciled and Derived Data Models Reconciled (EDW) Independent of specific decisions Centralized control; usually owned by IT Historical Not summarized Normalized Flexible Many data sources Long life Starts large, becomes larger Derived (Data Mart) Specific decisions One central subject Usually accessed directly by users; usually decentralized into user area Closely defined subject area Detailed and/or summarized Usually denormalized Restrictive – few sources Short life span Starts small, becomes large Two approaches to design Enterprise Data Warehouse (Inmon) Focus is on enterprise subjects that will be needed to support comprehensive decision making. Emphasis on creating design that is consistent among subject areas. Implementation is of a data mart. Uses ERD for modeling. Relies on comprehensive blueprint for interrelation of data. Interrelated Data Marts (Kimball) Focus is on business subject area for data warehouse. Emphasis on creating simple design that can be implemented quickly. Implementation is of a data mart. Uses “dimensional model” for modeling. Kind of like an ERD with UML-type aspects. Relies on consistent interrelation of data by integration of existing data models. Compare/Contrast Approaches Similarities: Both focus on subject areas for development of data model. Both require extensive input from data warehouse stakeholders. Both produce a subject-oriented, non-volatile, time-related data warehouse. Both try to quickly implement a prototype data mart. Differences: Inmon creates a more integrated and consistent data warehouse by attempting to design an enterprise-wide warehouse at the beginning of the first data warehouse project. This is called a “reconciled” DW design. Kimball relies on future project teams referencing existing data warehouse models for new projects. 20 What do both approaches yield? A design for a data mart. The design for a data mart relies on the concept of a data warehouse “cube.” A cube is a logical construct containing a “fact” table that is accessed on multiple “dimension” tables. A fact table contains values that a manager uses to make decisions. A dimension table is used as a reference for the values in the fact table. 21 Steps of data warehouse design 1. Identify the stakeholders that need data to support their decisions. 2. Define and describe the data needs of those stakeholders. 3. Define the subject area. 4. Choose (EDW and data mart) or just data mart. 5. Select the data of interest. 6. Add element of time. 7. Add derived data. 8. Determine granularity level. 9. Summarize data. 10. Identify and attempt to solve potential performance issues. 22 How do you identify those people within an organization who require data to support their decision making processes? 23 Define and describe the data needs Usually termed “stakeholder analysis”. Differing levels of decision making require differing sets of data. Internal vs. external data. Integrated vs. non-integrated data. Detailed vs. summarized data. Different stakeholders require different access mechanisms. Online vs. reports. Pre-formatted vs. ad-hoc availability of data. Different stakeholders require different timing. Online, real time vs. delay. Relative size of delay/timeliness is always an issue. 24 Stakeholder Analysis Table Example – Replica Toys Stakeholder Decision Making Responsibilities Existing Information? Additional Information? Availability of Additional Information? Marketing Analyst Decide what features are most valuable to which customers. No data related to features currently available. Features selected by customers. Not in existing system and cannot be compiled manually. Maybe telephone survey? Maybe registration system? Determine trends in toy purchases. Distribution Manager Determine trends in use of distribution outlets. Customer order data by distribution outlet. Customer order data by distribution outlet. Determine distribution outlet profitability. Purchases by toy by customer by distribution outlet. Purchase price by toy by customer by distribution outlet. Quality control specialist Support call data. Evaluate comparative defects of toys within and Product return data. across product lines. Development engineer Evaluate relative safety issues with existing product line. Determine potential safety issues with new product development. Purchases by toy by customer. Support call data. Product return data. Safety test data. Need customer order data with more specific parameters. See if available in customer order system. Detailed problem reports including date, toy, problem, extent of damage. Not available in current support call and product return systems. Could be added. Detailed problem reports including date, toy, problem, injury, relative impact of injury, potential responsibility. Not available in current support call and product return systems. Could be added. Engineering safety test data is available. Define the subject area Potential subject areas in common to many businesses: Customers: people and organizations who acquire and/or use the company’s products. Equipment: Machinery, devices, tools and their components. Facilities: Real estate and their components. Sales: Transactions that move a product from company to a customer. Suppliers: Entities that provide a company with goods and services. Products: Goods and services that the company, or its competitors, provide to customers. Materials: Goods and services that the company uses to produce its products. Financials: Information about money that is received, retained, expended, invested or in any way tracked by the company. Human resources: Individuals who perform work for the company – may be employees, contracts, or simply positions. 26 Select the data of interest Use the existing transaction database model. Identify and understand the necessary business decisions. Identify external data that could help support decisions. Use tables to help sort available attributes. Example: Table 4.1 on pgs 104-106 of chapter 4 in “Mastering Data Warehouse Design.” 27 Add element of time Data warehouse is a historical model rather than a current “point in time” model. Must have a way to incorporate changes that occur over time. Important issues: Fact table must include a time component. Ranges of time vs. effective period in time Time also relates to dimension tables May have to deal with differing time periods. Examples are fiscal years, “holiday rush,” billing cycle, etc. 28 Add derived data Derived data includes any kind of calculated field. Examples: total sales; net sales amount; total funds raised; total cost of products. Issues: Must be identified, defined and agreed upon by data warehouse stakeholders. Must be documented in metadata. Must be consistent. 29 Determine granularity level What are the benefits and drawbacks of a low level of granularity? What are the benefits and drawbacks of a high level of granularity? What factors should be considered when determining the level of granularity in the data warehouse? 30 Summarize (aggregate) data What is summarized data? How is data summarized? Does summarized data save disk space? Why summarize data? 31 Identify and solve performance issues What are the potential performance problems that can occur with a data warehouse? Why is performance a consideration during data warehouse design? What can a designer do to alleviate potential performance problems? 32