02-2-DATA MODELING and IMPORTANCE Topics Covered: WHY DO DATA MODELING? MODELING -- Entity Relational Model -- Dimensional Modeling – Quick Look -- Fact Table -- Dimension Tables Document1 by rt -- 12 March 2016 1 of 11 Why do data modeling? Data modeling is the process of analyzing and representing the "things" (entities) about which an enterprise must know. (Source reported from Larry English's teachings as far back as 1994) When they built the CN Tower they used a lot of plans; plans for HVAC, Telecommunications, and the physical structure. The problem of building the Tower was complex. When you designed your data base in an earlier subject, you were doing the "plan" before implementing an actual database. Data Modeling is a -- A plan -- A guide On how to build something As mentioned above you are already familiar with data modeling before in the design course. You should remember doing Entity Attribute Lists Entity Relationship Diagrams Document1 by rt -- 12 March 2016 2 of 11 MODELING Entity Relational Model The ER model describes data as Entities, Relationships and Attributes -- What are the advantages? - Its advantage is that data exists in a highly normalized form - This model seeks to drive all the redundancy out of the data This means that a transaction that changes any data only needs to touch the database in one place. Speeds up processing -- Disadvantage - For queries that span many records or many tables - Too complex for users to understand and - Too complicated for users be successfully navigate by DBMS software (code) - Very slow when summarizing large amounts of data - Multiple JOINS are required and that uses more resources to process So what is the alternative? Document1 by rt -- 12 March 2016 3 of 11 Dimensional Modeling – Quick Look Dimensional modeling is a technique that helps the database designer To build information structures that satisfy the way a business user asks for information. Most important advantage is that It matches end users’ needs for simplicity. Purpose of dimensional modeling is to Provide the data warehouse and managed-query tools with a database definition that lends itself to subject-oriented information processing. Information can be re-arranged to Be presented to the end user in multiple views, from different perspectives and very fast. Components of a DW First, it is still a relational database. That means it has tables that can be accessed like any other relational database – by SQL 2 types of tables FACT TABLES DIMENSION TABLES This where the name dimensional model comes from 2 styles of dimensional models STAR SCHEMA Snowflake Schema General description There is one dominant table (fact table) in the center of the schema With multiple joins joining it to other tables (dimension tables). Document1 by rt -- 12 March 2016 4 of 11 DIMENSIONAL MODEL: STAR SCHEMA The concentration in this course will be on the STAR Schema SALES FACT TIME DIMENSION Time key Day-of-week Month Quarter Year Holiday flag PRODUCT DIMENSION Time key Product key Store key Dollars sold Units sold Dollars cost Product key Description Brand Category STORE DIMENSION Store key Store name City or Province Floor plan type Document1 by rt -- 12 March 2016 5 of 11 Fact Table Part of Star Schema in DW Contains 2 things Fact tables contain 2 things ONLY. Each row in a fact table is composed of a Set of keys and a Set of measures. Keys Keys represent the foreign keys that connect to the primary keys in the dimensional table. The set of keys form a composite primary key to the fact table. Measures The most useful fact tables also contain one or more numerical measures, or facts, that occur for the combination of keys that define each record. Measures represent how the business measures performance. As you can see from the above diagram total dollar sales, total dollar costs or total products sold are ways of measuring the business. We can look at the measures such as dollars sold broken down by various time periods or by products or by a store attribute or a combination of any of these. The best and most useful facts (measures) are 1 Numeric, 2 Additive 3 Continuously valued This makes analysis of the data much easier, and distinguishes the attributes in the fact table from attributes in the dimension tables Again to repeat, an important piece to understand is that the fact table contains numeric values. It does not contain character fields such as name or province. Additive (More on this topic later) Additive ability is crucial because data warehouse applications almost never retrieve a single fact or table record; rather, they fetch back hundreds, thousands, or even millions of these records at a time, and often the most useful thing to do with so many records is to add them up and present a summary by category for management to analyze. The Fact table is the largest table in a data warehouse. There can be millions of rows, up to several hundred billion rows. Think of a company like Wal-Mart. For Wal-Mart to understand their customer, each customers purchasing habits need to be recorded. Also with this is purchasing habits by store, by time of week, by province or region. The combinations of how inventory or products are sold results in a massive amount of data extracted from EVERY purchase in EVERY store over a period of years. As another example consider the amount of data stored by Facebook or Google. These are massive databases of customer information. To show cumulative you need data to be summarized as Total $ Sales of a PRODUCT or Category or Brand across many time periods. So what is stored will be the different groupings of Sales. What is NOT stored is a single sale of a product on an order or in a Banking operation the account balances over a period of a month are not stored. These values would be unlikely to have value to the organization and so is not additive. Document1 by rt -- 12 March 2016 6 of 11 NOTE: it is a common error to include numeric data in the fact table that is not additive. You know that all numeric data can be added. In fact since people's names are really a set of bits, in theory they could be added also. However if we added up all the names in a class and displayed the result the key thing to notice is that the result would be meaningless. So to determine whether numeric items should be put in the fact table it must also be meaningful to add up the results. For example if we took the total dollar sales for each day and added those up for a whole week the result would still have meaning. You would have an important value, namely weekly dollar sales. However if instead the numeric quantity on hand (QOH) of a single product in a store was added to all the other products in that same aisle, the total quantity of products in a specific aisle would have no businesses value or meaning. For example in a grocery store the QOH of all canned goods is 23,721. Is that all beans? Is it large cans, small cans, mostly vegetables or is it juice. The number 23,721 has no value to the business. Another star schema example A better choice of names for a measure might be Total Dollar Cost. Document1 by rt -- 12 March 2016 7 of 11 Using a smaller, less detailed fact table with the following attributes as an example: Time ID Product ID Store ID Total Dollar Sales Total Dollar Cost Total Units Sold A single row of data represents the total dollar sales, total dollar cost and the total number of units sold broken down by the (keys) or 1 product, in 1 store, in 1 time period. Assuming the time period to be a week or month, there would be many sales of that product in a single store. The row is an accumulation of many transactions from the OLTP system. Document1 by rt -- 12 March 2016 8 of 11 Dimensional Model (DM) vs Entity Relation (ER) model The key to understanding the relationship between DM and ER is that a single ER diagram breaks down into multiple DM diagrams, or ‘stars’. Think of a large ER diagram as representing every possible business process within an application. The ER diagram may have Sales Calls, Order Entries, Shipment Invoices, Customer Payments, and Product Returns, all on the same diagram. Returns Shipments Orders Sales Contact Payments Up until now you have been dealing with small sets of tables, maybe 12 to 15. However, corporations contain far more tables. On the next page is an example of slightly more complicated ERD. This is by no means a full representation of a large company. It contains very little of the accounting functions that need to be stored and tables associated with employees, such as benefits etc. Document1 by rt -- 12 March 2016 9 of 11 Sample of larger ERD Document1 by rt -- 12 March 2016 10 of 11 Dimension Tables Dimension tables are textual descriptions of the dimensions, in other words the attributes of a store or customer or product. Looking at the star schema example a few pages back you can see that if you were to add sample data to the dimensional table attribute, the data would be textual. Examples for CUSTOMER would be customer name, address etc. Examples for PRODUCT would be product name, product category name, brand name and other text-like attributes. The purpose of these attributes is to filter how the facts are presented. Why does the numeric measures like total dollar sales need to be filtered. Management wants something more than a total sales figure. Management asks and wants total dollars sales in the last quarter by single products, by product sub category, by product category or by brand. Sometimes it wants the data further or differently broken down by province or store within categories of products. What is happening in Sales of FROZEN ITEMS (Fish, Vegetables, packaged dinners, ice cream) by Province and Store for last week compared to the same week the previous year. From this we can see that we can break down large totals like the total sales dollars for all products in this quarter, into sub amounts by displaying total dollars sales by product category. In this way we can see that the textual attributes are ways of breaking down or filtering the measures. Another example might be showing me sales of Bananas (text) and Grapefruit (text). January February Bananas 12,375 13,200 Grapefruit 10,101 10,917 ASIDE: Notice how time is important to providing meaning to the data. What is important is to determine what type of textual data is to be stored in a dimension table. Not all the data found in the OLTP that describes a product is worthwhile moving to the data warehouse. For example sales by products coloured yellow might not be worthwhile unless the product is cars and not groceries. Later on with practice doing some cases you will be able to determine which attributes should go into a dimension table. Dimension tables have levels or categories that populate the table. (more later) Another characteristic of dimension tables is that they have levels. In the example above we can see that products are contained within in a level called sub product name, which in turn is contained in a level called product category name. These levels are a way to provide different groupings for the measures. Examples of levels: Weeks Month Quarter Year City Province Region Country Customer Branch Region ASIDE: These levels are due to denormalizing the OLTP tables into a single table. Document1 by rt -- 12 March 2016 11 of 11