Contents of this slideshow: • What is a data warehouse? • Multi-dimensional data modeling An example of a Datawarehouse: A star shema datawarehouse has a central table (the Fact table) surrouded by dimension tables with on-to-many relationships towards the fact table. Dimension Orders - Order# - Ordertype Dimension Products - Product# - Product-name - Price The fixed data base structure implies that application programs (drilling functions/aggregates) can be generated automatically! Fact table Dimension Orderdetails - Product# - Order# - Qty - Date# - Salesman# Salesmen - Salesman# - Salesman-name Dimension Time - Date# - Date-Name Dimension hierarchies: A dimension hierarchy is a set of tables connected by one-to-many relationships towards the fact table: Fact table Orderdetails - Product# - Order# - Qty - Price Dimension hierarchy Orders - Order# - Customer# - Date Customers - Customer# - Customer-name In a dimension hierarchiy it is possible to aggregate data from the fact table to the different levels of the hierachy. Drill-down = “de-aggregate” = break an aggregate into its constituents. Roll-up = aggregate along one or more dimensions. Two different types of drilling: Dimension • Drilling in dimension hierarchies. • Drilling between dimensions. Dimension Products - Product# - Product-name - Price Orders - Order# - Ordertype Fact table Dimension Orderdetails - Product# - Order# - Qty - Date# - Salesman# Salesmen - Salesman# - Salesman-name Dimension Time - Date# - Date-Name Which star schemas or data marts can be build by using the illustrated integrated E-commerce/ ERP data model? Which star schema would you recommend to be implemented first? Product Product# ProductName Price Order-Detail Product# Order# Qty Price Timestamp Order-DetailHistory Inv-Item# Order# Seq# State Timestamp InvoyceHistory Invoice# Timestamp State Notes Product-Stock Product# Location# Qty Order Order# OrderDate Balance State Shipping Shipping# ShipMethod ShipCharge State ShipDate Address Address# Name Add1 Add2 City State Zip Shipping Invoice Invoice# CreationDate Location Location# Address Customer Customer# Kredit-Limit Balance UserSession Session# IPaddress #Click Timestamp UserAccount Salesman# PassWord Timestamp #visits #trans Ttl-tr-amount Payment Payment# Ammount State Timestamp Billing CreditCard Card# HolderName ExpireDate Data marts = Kimball uses the word for any multidimensional database/star schema. A galaxy is a set of multidimensional databases with conformed (fælles tilpassede) dimensions: Fact table The value chain SaleOrderdetails - Product# - Sale-order# - Qty - Discount - Sale-price - Date# Suppose an entreprise has a datamart for Purchase and another datamart for Sale as illustrated above. Is it possible to calculate the revenue per month for the last year by using such a galaxy? Time dimension hierarchy Year - yy Month - yy - mm Day - yy - mm - dd Fact table - Date# Storage-per- Qty product Fact table Purchaseorderdetails - Product# - Date# - End-of-daystorage-qty - Product# - Purchase-order# - Purchase-price - Qty - Date# Products - Product# - Product-name Product groups - Product-group# - Product-group-name Product dimension hierarchy Time dimension hierarchy Conformed dimensions = dimensions designed to be common for different data marts in order to make drill across operations possible. Conformed facts = measures with common units of measurement and granularities that make it possible to integrate measures from different fact tables. Fact table The value chain SaleOrderdetails - Product# - Sale-order# - Qty - Discount - Sale-price - Date# Is it possible to calculate the revenue per month for the last year if the datamart for Purchase and the datamart for Sale do not have conformed dimensions or facts? Year - yy Month - yy - mm Day - yy - mm - dd Fact table - Date# Storage-per- Qty product Fact table Purchaseorderdetails - Product# - Date# - End-of-daystorage-qty - Product# - Purchase-order# - Purchase-price - Qty - Date# Products - Product# - Product-name Product groups - Product-group# - Product-group-name Product dimension hierarchy Contents of this slideshow: • What is a datawarehouse? • Multi-dimensional data modelling Datawarehouse aggregating to the product level: SELECT Product#, SUM(Qty*Price) AS omsætning FROM Orderdetails JOIN Products GROUP BY Product# Dimension Orders - Order# - Ordertype Dimension Products - Product# - Product-name - Price Fact table Dimension Orderdetails - Product# - Order# - Qty - Date# - Salesman# Salesmen - Salesman# - Salesman-name Dimension Time - Date# - Date-Name Drill down to the Product per Salesman level: SELECT Product#, Salesman#, SUM(Qty*Price) AS omsætning FROM Orderdetails JOIN Products JOIN Salesmen GROUP BY Product#, Salesman#; Dimension Orders - Order# - Ordertype Dimension Products - Product# - Product-name - Price Where should the Price be stored? Fact table Dimension Orderdetails - Product# - Order# - Qty - Date# - Salesman# Salesmen - Salesman# - Salesman-name Dimension Time - Date# - Date-Name Dimension hierarchies: A dimension hierarchi is a set of tables connected by one-to-many relationships towards the fact table: Fact table Orderdetails - Product# - Order# - Qty - Price Dimension hierarchy Orders - Order# - Customer# - Date Customers - Customer# - Customer-name A Snowflake schema may in contrast to star schemas have dimension hierarchies. Describe advantage and disadvantage by using dimension hierarchies/Snowflake schema? Snowflake schema with branches: A Snowflake schema may have branches in the dimension hierarchies: Fact table Dimension hierarchy Orderdetails Orders - Product# - Order# - Qty - Order# - Customer# - Date Products - Customers Product# Product-name Price Group# - Customer# - Customer-name Snowflake hierarchy Salesmen Branch offices - Salesman# - Salesman-name – Branch-office# - Branch-office# - Branch-office# - Region# Product groups - Group# - Group-name - Department# Departments - Department# - Department-name Dimension hierarchy Regions - Region# - Region-name Are Customers related to the regions? Dimension Orders The aggregation level is the argument to the GROUP BY statement. - Order# - Ordertype Dimension Fact table Products Dimension Orderdetails - Product# - Product-name - Price - Product# - Order# - Qty - Date# - Salesman# Salesmen - Salesman# - Salesman-name - Branch-Office# Dimension Time - Date# - Date-Name Salesman# Productname Turnover Branch-office# Smith Screw 10,000 LA Smith Bolt 30,000 LA Smith Nut 60,000 LA Jones Screw 20,000 SF Jones Nut 40,000 SF Aggregated data Non-aggregated data ... x1 x2 … xn Drilling in dimension hierarchies: Dimension hierarchy Fact table - Product# - Order# - Qty Customers Orders Orderdetails - Customer# - Customer-name - Order# - Customer# - Date Snowflake hierarchy Products - Product# Product-name Price Group# Salesmen Branch offices - Salesman# - Salesman-name – Branch-office# - Branch-office# - Branch-office# - Region# Product groups - Group# - Group-name - Department# Departments Dimension hierarchy Salesman# Turnover Branch-office# Branch-office# Turnover Smith 100,000 LA LA 400,000 Jones 300,000 LA SF 200,000 Adams 200,000 SF Drilling between dimension hierarchies: Dimension hierarchy Fact table Customers Orders Orderdetails - Product# - Order# - Qty - Customer# - Customer-name - Order# - Customer# - Date Snowflake hierarchy Products Salesmen Branch offices - Salesman# - Salesman-name – Branch-office# - Branch-office# - Branch-office# - Region# Sales man# Productname Turnover Branchoffice# Salesman# Turnover Branchoffice# Smith Screw 10,000 LA Smith 100,000 LA Smith Bolt 30,000 LA Jones 300,000 LA Smith Nut 60,000 LA Adams 200,000 SF Jones Screw 20,000 SF Jones Nut 40,000 SF ... Roll up to the top level: Sales man# Productname Turnover Branchoffice# Smith Screw 10,000 LA Smith Bolt 30,000 LA Smith Nut 60,000 LA Jones Screw 20,000 SF Jones Nut 40,000 SF Roll up can be executed by removing one or more argument to the GROUP BY statement. ... Productname Turnover Screw 100.000 Bolt 200.000 Nut 300,000 Top level Roll up to the product level. Turnover 600.000 Roll up to the top level. Non-linear dimensions as e.g. the Date Dimension: Fiscal Year Calendar Year Fiscal Quarter Calendar Quarter Fiscal Month Calendar Month • The granularity is day. • Many different hierarchies. • Two major problems: – Calender Week do not aggregate to year. – Type of Day distinguish between working day and holiday. However, they are idependent of the other dimensions (e.g. Easter). Fiscal Week Calendar Week Type of Day Day of Week Day What aggregation level would you use to calculate the average sale in nonhollyday mondays per month? The time dimension: Day Part AM/PM Flag • The granularity is minute. • The top level is a hole day. Hour Why do you think Kimball recommends to separate the date and time dimensions? Minute Degenerated dimension = A dimension that is not created because nobody want to aggregate data to the degenerated level. Example: The Order dimension should be deleted while the Time and Customer attributes should be created as new dimensions to which it is meaningful to aggregate data. Orders Products - Product# - Product-name - Price Fact table Orderdetails - Product# - Order# - Qty - Date# - Salesman# - Order# - Time - Customer# Salesmen - Salesman# - Salesman-name Exercise: Customers Branch offices The figure illustrates an ER-diagram of a car rental company like Hertz or Avis. Orders Contracts Pick up Reservations Car return Cars Car types Garage services Garages Design a snowflake shema, star shema or Galaxy for the car rental company! Major problems in data warehouse design: Drilling in many-to-many relationships and tree structures. Inconsistensies caused by ”slowly changing dimensions”. Slowly Changing Dimensions (SCD) If the attributes of a dimension is dynamic (e.i. they may be updated) we say that they are slowly changing. May the Branch-size of a Branch-office change after e.g. a renovation? May the Branch-name of a Branch-office change? Fact table Bank accounts - Account# - Interest-last-year - Cost-last-year - Branch# Dimension Branch-offices - Branch# - Branch-name - Branch-size Exercise in SCD: Soppose the attribute Branch-size is dynamic and aggregations is made to the levels (Branch-size, Year) or (Branch-size, Month) . Does this aggregation make sense and how would you solve possible problems? Fact table Bank accounts - Account# - Interest-last-year - Cost-last-year - Branch# Dimension Branch-offices - Branch# - Branch-name - Branch-size Exercise: Customers Branch offices Is the region of the customer a dynamic attribute of the customer? Orders Contracts Pick up Reservations Car return Cars Car types Garage services Garages Does it make sense to aggregate the rental revenue to the region of the customers? It is possible to cheat the application generator. That is, special very complicated data structures may function as many-to-many or networt relationships when they are dealt with as 1-to-many relationships. How would you recommend to design a datawarehouse where it is possible to aggregate Sale to the Stock locations used for the sale? Product Product# ProductName Price Order-Detail Product# Order# Qty Price Timestamp Product-Stock Product# Location# Qty Order Order# OrderDate Balance State Location Location# Address Customer Customer# Kredit-Limit Balance UserSession Session# IPaddress #Click Timestamp UserAccount Salesman# PassWord Timestamp #visits #trans Ttl-tr-amount Exercise. Design a datawarehouse for a travel agency. Customers Buyer Orders Traveler Bookings Reservations Flight routes/ Room types/ Car types/ service types Departures/ Hotel rooms/ Car rentals/ etc. Product owners Design a data warehouse (or galaxy) for an ERP system with as many meaningful dimensions as possible: The sales module Customers Orders Orderlines Stocks per product per location Products The account module offer services to the other ERP modules. Account items Accounts End of session Thank you !!! Response type Evaluation criteria Is historical information preserved Aggregation performance Storage consumption Response 1 where dimension records are overwritten No In the evaluation, we define this solution to have average performance Only the current dimension record version is stored. No redundant data is stored Response 2 where new versions are created Yes Version records makes performance slower proportional to the number of changes All old versions of dimension records are stored often with redundant attributes Response 3 where only one historical version is saved The current version and a single history destroying version are saved No performance degradation occurs if either the current or the historical version are used in a query Normally, only a single extra attribute version is stored Response 4 that use the top of a dynamic dimen-sion hierarchy as a new static dimension Yes Better or worse depen-ding on whether both dimension tables are used in a query The relatively large fact table must have an extra foreign key attribute Response 5 with dimension data as fact data Yes Better or worse depen-ding on whether the new fact data are used in a query The relatively large fact table must have an extra attribute for each dynamic dimension attribute Response 6 that use fine granularity in combination with response 1 or 3 The finer the granularity, the more historical state information is preserved The finer the granularity, the slower the performance The finer the granularity, the more storage consumption Response 7 that stores dynamic dimension data as static facts in another data mart Yes Better or worse depen-ding on whether both fact tables are used in a drill across query This is the most storage consuming solution as at least a new fact and foreign key are stored in the new fact table Where do the responses of SCDs store historic information? • Response 1 does not store historic information. • Response 2 store historic information in a new record version. • Response 3 store at one historic value in a new dimension attribute. • Response 4 store historic information in a new dimension relationship. • Response 5 store historic information in a new fact attribute. • Response 6 can sometimes deminish the aggregation error of response 1 as finer granularity