Contents of this slideshow: • What is a datawarehouse? • Multi-dimensional data modelling • Data warehouse architecture The hidden slides of this slideshow may be important. However, I will focus on leaning by exercises and therefore, rattling off new concepts are often done in hidden slides. OLTP versus OLAP OLTP = On Line Transaction Processing OLAP = On Line Analytical Processing OLTP OLAP users clerk, IT professional knowledge worker/decision makers function day to day operations decision support DB design application-oriented data current, up-to-date detailed, flat relational isolated data. repetitive subject-oriented (Business functions) historical, summarized, multidimensional integrated, consolidated data. ad-hoc usage lots of scans unit of work read/write index/hash on prim. key short, simple transaction # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB access complex query An example of a Datawarehouse: A star shema datawarehouse has a central table (the Fact table) surrouded by dimension tables with on-to-many relationships towards the fact table. Dimension Orders - Order# - Ordertype Dimension Products - Product# - Product-name - Price The fixed data base structure implies that application programs (drilling functions/aggregates) can be generated automatically! Fact table Dimension Orderdetails - Product# - Order# - Qty - Date# - Salesman# Salesmen - Salesman# - Salesman-name Dimension Time - Date# - Date-Name Dimension hierarchies: A dimension hierarchy is a set of tables connected by one-to-many relationships towards the fact table: Fact table Orderdetails - Product# - Order# - Qty - Price Dimension hierarchy Orders - Order# - Customer# - Date Customers - Customer# - Customer-name In a dimension hierarchiy it is possible to aggregate data from the fact table to the different levels of the hierachy. Roll-up = aggregate along one or more dimensions. Drill-down = “de-aggregate” = break an aggregate into its constituents. Two different types of drilling: -Drilling in dimension hierarchies -Drilling between dimensions. Dimension Orders - Order# - Ordertype Dimension Products - Product# - Product-name - Price Fact table Dimension Orderdetails - Product# - Order# - Qty - Date# - Salesman# Salesmen - Salesman# - Salesman-name Dimension Time - Date# - Date-Name Which star schemas or data marts can be build by using the illustrated integrated E-commerce/ ERP data model? Which star schema would you recommend to be implemented first? Product Product# ProductName Price Order-Detail Product# Order# Qty Price Timestamp Order-DetailHistory Inv-Item# Order# Seq# State Timestamp InvoyceHistory Invoice# Timestamp State Notes Product-Stock Product# Location# Qty Order Order# OrderDate Balance State Shipping Shipping# ShipMethod ShipCharge State ShipDate Address Address# Name Add1 Add2 City State Zip Shipping Invoice Invoice# CreationDate Location Location# Address Customer Customer# Kredit-Limit Balance UserSession Session# IPaddress #Click Timestamp UserAccount Salesman# PassWord Timestamp #visits #trans Ttl-tr-amount Payment Payment# Ammount State Timestamp Billing CreditCard Card# HolderName ExpireDate A galaxy is a set of star fact tables with conformed (fælles tilpassede) dimensions: Time dimension hierarchy Year - yy Month - yy - mm Day Fact table The value chain SaleOrderdetails - Product# - Sale-order# - Qty - Discount - Sale-price - Date# - yy - mm - dd Fact table - Date# Storage-per- Qty product Fact table Purchaseorderdetails - Product# - Date# - End-of-daystorage-qty - Product# - Purchase-order# - Purchase-price - Qty - Date# Products - Product# - Product-name Product groups - Product-group# - Product-group-name Product dimension hierarchy Conceptual Modeling of Data Warehouses – Star schema: A fact table in the middle connected to a set of dimension tables – Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake – Galaxy schema: Multiple fact tables share dimension tables (Conformed dimensions), viewed as a collection of stars, therefore called galaxy schema or fact constellation The aggregating level is the argument to the GROUP BY statement: SELECT Product#, SUM(Qty*Price) AS Turnover FROM Orderdetails JOIN Products GROUP BY Product# Dimension Orders - Order# - Ordertype Dimension Products - Product# - Product-name - Price Fact table Dimension Orderdetails - Product# - Order# - Qty - Date# - Salesman# Salesmen - Salesman# - Salesman-name Dimension Time - Date# - Date-Name Drill down to the Product per Salesman level: SELECT Product#, Salesman#, SUM(Qty*Price) AS Turnover FROM Orderdetails JOIN Products JOIN Salesmen GROUP BY Product#, Salesman#; Dimension Orders - Order# - Ordertype Dimension Products - Product# - Product-name - Price Where should the Price be stored? Fact table Dimension Orderdetails - Product# - Order# - Qty - Date# - Salesman# Salesmen - Salesman# - Salesman-name Dimension Time - Date# - Date-Name Snowflake schema with branches: A Snowflake schema may have branches in the dimension hierarchies: Fact table Dimension hierarchy Orderdetails Orders - Product# - Order# - Qty - Order# - Customer# - Date Products - Customers Product# Product-name Price Group# - Customer# - Customer-name Snowflake hierarchy Salesmen Branch offices - Salesman# - Salesman-name – Branch-office# - Branch-office# - Branch-office# - Region# Product groups - Group# - Group-name - Department# Departments - Department# - Department-name Dimension hierarchy Regions - Region# - Region-name Are Customers related to the Regions? Drilling in dimension hierarchies: Dimension hierarchy Fact table - Product# - Order# - Qty Customers Orders Orderdetails - Customer# - Customer-name - Order# - Customer# - Date Snowflake hierarchy Products - Product# Product-name Price Group# Salesmen Branch offices - Salesman# - Salesman-name – Branch-office# - Branch-office# - Branch-office# - Region# Product groups - Group# - Group-name - Department# Departments Dimension hierarchy Salesman# Turnover Branch-office# Branch-office# Turnover Smith 100,000 LA LA 400,000 Jones 300,000 LA SF 200,000 Adams 200,000 SF Drilling between dimension hierarchies: Dimension hierarchy Fact table Customers Orders Orderdetails - Product# - Order# - Qty - Customer# - Customer-name - Order# - Customer# - Date Snowflake hierarchy Products Salesmen Branch offices - Salesman# - Salesman-name – Branch-office# - Branch-office# - Branch-office# - Region# Sales man# Productname Turnover Branchoffice# Salesman# Turnover Branchoffice# Smith Screw 10,000 LA Smith 100,000 LA Smith Bolt 30,000 LA Jones 300,000 LA Smith Nut 60,000 LA Adams 200,000 SF Jones Screw 20,000 SF Jones Nut 40,000 SF ... Roll up to the top level: Sales man# Productname Turnover Branchoffice# Smith Screw 10,000 LA Smith Bolt 30,000 LA Smith Nut 60,000 LA Jones Screw 20,000 SF Jones Nut 40,000 SF Roll up can be executed by removing one or more argument to the GROUP BY statement. ... Productname Turnover Screw 100.000 Bolt 200.000 Nut 300,000 Top level Roll up to the product level. Turnover 600.000 Roll up to the top level. Dimension Orders The aggregation level is the argument to the GROUP BY statement. - Order# - Ordertype Dimension Fact table Products Dimension Orderdetails - Product# - Product-name - Price - Product# - Order# - Qty - Date# - Salesman# Salesmen - Salesman# - Salesman-name - Branch-Office# Dimension Time - Date# - Date-Name Salesman# Productname Turnover Branch-office# Smith Screw 10,000 LA Smith Bolt 30,000 LA Smith Nut 60,000 LA Jones Screw 20,000 SF Jones Nut 40,000 SF Aggregated data Non-aggregated data ... x1 x2 … xn Dimension hierarchies: A dimension hierarchi is a set of tables connected by one-to-many relationships towards the fact table: Fact table Orderdetails - Product# - Order# - Qty - Price Dimension hierarchy Orders - Order# - Customer# - Date Customers - Customer# - Customer-name A Snowflake schema may in contrast to star schemas have dimension hierarchies. Describe the advantage/disadvantage of using dimension hierarchies or Snowflake schema? Exercise: Customers Branch offices Orders Contracts Pick up Reservations The figure illustrates an ER-diagram of a car rental company like Hertz or Avis. Car return Question 1. Design a star schema or Galaxy for the car rental company. Question 2. Is there advantages by storing suppliers as customers in e.g. an ecommerce data warehouse? Cars Car types Garage services Garages Contents of this slideshow: • What is a datawarehouse? • Multi-dimensional data modelling • Data warehouse architecture Data Models – Relational models/ER-diagram used for OLTP databases – Stars, snowflakes and galaxies used for OLAP databases – Cubes used for OLAP databases Et star schema DW can be illustrated as a multidimensinal cube: Describe advantages/disadvantages of storing data in a cube in memory? OLAP Cube operations: OLAP operations: Roll Up = Aggregatin to a higer level. For example from month to year) Drill Down = recalculation with more details Slice = Selecting a subset by using a fixed dimension value. Drill Across = Join of fact data across conformed dimensions Drill Through = Accessing related data from a OLTP system. Aggregating Pivoting = See next slide! Pivoting = Transforming SQL query output to user friendly two dimensional screen layout Fact table view: sale prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 Multi-dimensional cube: date 1 1 1 1 2 2 amt 12 11 50 8 44 4 day 2 day 1 p1 p2 c1 p1 12 p2 11 p1 p2 c1 56 11 c1 44 c2 4 c2 c3 c3 50 8 c2 4 8 c3 50 OLAP Server Architectures • Relational OLAP (ROLAP) – Use relational or extended-relational DBMS to store and manage warehouse data – Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services – greater scalability • Multidimensional OLAP (MOLAP) – Array-based multidimensional storage engine (sparse matrix techniques) – fast indexing to pre-computed summarized data • Hybrid OLAP (HOLAP) – Storage flexibility with mix of ROLAP and MOLAP • POLAP personel HOLAP Contents of this slideshow: • What is a datawarehouse? • Multi-dimensional data modeling • Data warehouse design/implementation architectures 1. Kimball has a bottom-up architecture 2. Inmon has a top-down architecture 3. Data Vault architecture is normalized tables extended with historic data tables. That is, the Data Vault can be used to generate any data mart when needed. Kimball’s Bottom-Up DW architecture: Kimball’s architecture uses conformed dimensions and conformed facts. Conformed dimensions makes it possible to drill across from one data mart to another to present data from different marts in the same view. Only the conformed data have top-down design. Kimball’s Data Warehousing Architecture Query side ETL side Metadata Data sources Presentation servers - Extract - Transform - Load Data Staging Area Query Services -Warehouse Browsing -Access and Security Data marts with aggregate-only data -Query Management - Standard Reporting Conformed -Activity Monitor Data Warehouse Bus Reporting Tools Desktop Data Access Tools Data mining dimensions and facts Data marts with atomic data Surrogate key (Surrogatnøgle) = A sequense number used as primary key. William Inmon’s DATA WAREHOUSE architecture from 1990 has top-down design without conformed data. and: EDS = Enterprise Data Warehouse. Department datawarehouses The DSA (Data Staging Area) where transformation takes place is not illustrated. The DATA VAULT architecture from 2002-2005 has full top-down design and buttom up implementation: Normalized Data Vault with historic data In the Data Vault database with historic information only the Extract activity has taken place. Therefore, the Data Vault architecture is not drowned in the design phase. Classical Data warehousing 1 Source OLTP Extraction Delta Detection 2 DSA Cleansing 3 EDW Transformation Business Rules DSA = Data Staging Area EDS = Enterprise Data Warehouse DM Filter Aggregate Error handling Classical Data warehousing 1 Source 2 DSA 3 EDW DM OLTP Extraction Delta Detection Cleansing Transformation Business Rules Filter Aggregate Error handling HANA from SAP is an In memory Data Warehouse product 1 Source OLTP Delta Cleansing Detection Transformation DSA Business Rules Error handling Aggregate Filter Extraction Classical Data warehousing 1 Source Extraction 2 DSA Delta Detection Cleansing 3 EDW Transformation Business Rules DM Filter Aggregate Error handling In memory Data warehousing 1 OLTP Source Delta Cleansing Detection Transformation DSA Business Rules How can OLTP and OLAP be integrated in a common In Memory database? Error handling Aggregate Filter Extraction Patients_____ Patient ID Name Address Patient admits ... Sympthoms and test results Exercise: Transform the OLTP Health records database to a Star schema DW for a Hospital. Health record subtypes Diagnoses/ diseases Prescriptions Patient discharges Treatments Prescription lines Basic Health records are above the dottet line Conseptual hospital entites in general are below the dottet line ERdiagram for a hospital. Patient admit type Employees ... Sympthom types Disease types Treatment types Medicin types Patient discharges type Airline Exercise: companies Design an Airline DW. Airports Flight routes Subroutes Departures Tickets Customers Travel arrangement Exercise: Design a Hotel DW. Hotel chains Hotels Rooms Room reservations Services/ tours/ car rentals Customer groups Check-in periods Customers Exercise. Design a datawarehouse for a travel agency. Customers Buyer Orders Traveler Bookings Reservations Flight routes/ Room types/ Car types/ service types Departures/ Hotel rooms/ Car rentals/ etc. Product owners End of session Thank you !!! Inmon versus Kimball’s DW definitions: Kimball and Inmon agree in that OLAP datawarehouses do not use the OLTP databases. However, what is the difference in the architectures? Why do you think Kimball’s DW architecture is used most in practice? Dates may be stored in different formats. As an example the First purchase date may be stored as a FK to a hierachical time dimension and Birth date as a SQL time stamp. Why is different Date formats used in the Customer table? OLAP • OLAP = On-Line Analytical Processing – Interaktiv analyse – Eksplorativ opdagelse – Kræver hurtige svartider • Data kan vises som multidimensionelle terninger – Terninger/kuber kan have et vilkårligt antal dimensioner – Dimensioner har hierarkier, f.eks. dag-måned-år • OLAP operationer – Aggregering = Sammentælling af data, f.eks. med SUM, AVG, COUNT… – Startniveau, (Kvartal, Produkt) – Roll Up: mindre detalje, Kvartal->År – Drill Down : mere detalje, Kvartal->Måned – Slice: Projektering/selektering, År=1999 – Drill Across: “join” på fælles dimensioner – Drill Through: Opsøgning af kildedataene i de operative systemer – Pivoting The Business Dimensional Lifecycle = Kimball’s activity model for DATAWAREHOUSE devellopment has three parallel tracks: Design af teknisk arkitektur Projekt planlægning Specifikation af krav Dimensionel modellering Valg af produkt og installation Fysisk design Specifikation af applikationer ETL: design og udvikling Udvikling af applikationer Projektledelse Ibrugtagning Vedligehold og vækst The Data Warehouse Bus Architecture = Arkitektur for design af en række data marts som tilsammen udgør virksomhedens data warehouse med fælles conformed dimensions og conformed facts. Data marts = afdelings data warehouse. Kimball bruger ordet mere generelt om en enkelt multidimensional database. Conformed dimensions = Fælles dimensioner, som er tilpasset kravere fra flere data marts. Stovepipe (kakkelovnsrør) = Skældsord for et data warehouse uden conformed dimensions. Kimball’s datawarehouse concepts: Query side ETL side Metadata Data sources Presentation servers -Warehouse Browsing -Access and Security Data marts with aggregate-only data -Query Management - Standard Reporting Conformed -Activity Monitor Data - Extract - Transform - Load Data Staging Area Query Services Warehouse Bus dimensions and facts Reporting Tools Desktop Data Access Tools Data mining Operationel systems Data marts with atomic data Data Service Element Inmonn does not use the conformed facts and dimension table concepts! In the DATA VAULT Architecture the data marts are loaded from a normalized database with historic information. Existing databases and systems (OLTP) New databases and systems (OLAP) Appl. DM DB OLAP Appl. DB Appl. DB DM ETL Data Vault … Appl. DB Appl. DB Data mining DM Visualization In the future the DATA VAULT may be the only database and stored In-Memory. Appl. DM DB OLAP Appl. DB Appl. DB DM ETL Data Vault … Appl. DB Data mining DM Appl. DB SAP has already developed an In-Memory OLAP database called HANA Visualization