ISQS 3358, Business Intelligence Dimensional Modeling Zhangxi Lin Texas Tech University 1 1 Outline Data Warehousing Approaches Dimensional Modeling Data Warehousing with Microsoft SQL Server 2005 Case: Adventure Works Cycles (AWC) : Data Warehouse Design Phases 2 Data Warehousing Approaches 3 Data Warehouse Development Approaches Data warehouse development approaches ◦ ◦ Inmon Model: EDW approach Kimball Model: Data mart approach Which model is better? ◦ ◦ There is no one-size-fits-all strategy to data warehousing One alternative is the hosted warehouse 4 General Data Warehouse Development Approaches “Big bang” approach Incremental approach: ◦ Top-down incremental approach ◦ Bottom-up incremental approach ISQS 6339, Data Mgmt & BI, Zhangxi Lin 5 “Big Bang” Approach Analyze enterprise requirements Build enterprise data warehouse Report in subsets or store in data marts ISQS 6339, Data Mgmt & BI, Zhangxi Lin 6 Incremental Approach to Warehouse Development Multiple iterations Shorter implementations Validation of each phase Increment 1 Strategy Definition Analysis Design Iterative Build Production ISQS 6339, Data Mgmt & BI, Zhangxi Lin 7 Top-Down Approach Analyze requirements at the enterprise level Develop conceptual information model Identify and prioritize subject areas Complete a model of selected subject area Map to available data Perform a source system analysis Implement base technical architecture Establish metadata, extraction, and load processes for the initial subject area Create and populate the initial subject area data mart within the overall warehouse framework ISQS 6339, Data Mgmt & BI, Zhangxi Lin 8 Bottom-Up Approach Define the scope and coverage of the data warehouse and analyze the source systems within this scope Define the initial increment based on the political pressure, assumed business benefit and data volume Implement base technical architecture and establish metadata, extraction, and load processes as required by increment Create and populate the initial subject areas within the overall warehouse framework ISQS 6339, Data Mgmt & BI, Zhangxi Lin 9 Dimensional Modeling 10 Dimensional Model Also called star schema ◦ Fact table is in the middle and dimensions serving as the points on the star. ◦ A normalized fact table plus denormalized dimension tables Facts ◦ Measurements associated with a specific business process. ◦ Most facts are additive (calculative); others are semi-additive, non-additive, or descriptive (e.g. factless fact table). ◦ Many facts can be derived from other facts. So, non-additive facts can be avoided by calculating it from additive facts. Grain ◦ The level of detail contained in the fact table ◦ The lowest level of detail is called atomic fact table 11 11 Dimensions The foundation of the dimensional model to describe the objects of the business The nouns of the DW/BI system ◦ Business processes (facts) are the verbs of the business Dimension tables link to all the business processes. A dimension shared across all processes is called conformed dimension The analysis involving data from more than one business process is called drill-across. 12 12 Data Cube Data cubes are multidimensional extensions of 2-D tables, just as in geometry a cube is a three-dimensional extension of a square. The word cube brings to mind a 3-D object, and we can think of a 3-D data cube as being a set of similarly structured 2-D tables stacked on top of one another. Data cubes aren't restricted to just three dimensions. Most OLAP systems can build data cubes with many more dimensions allows up to 64 dimensions. In practice, we often construct data cubes with many dimensions, but we tend to look at just three at a time. What makes data cubes so valuable is that we can index the cube on one or more of its dimensions. 13 13 Determining Granularity YEAR? QUARTER? MONTH? WEEK? DAY? 14 Star Schema Model Product Table Store Table Product_id Product_disc,... Store_id District_id,... Sales Fact Table Central fact table Product_id Store_id Item_id Day_id Sales_amount Sales_units, ... Denormalized dimensions Time Table Day_id Month_id Year_id,... Item Table Item_id Item_desc,... 15 15 Snowflake Schema Model Product Table Product_id Product_desc Store Table Store_id Store_desc District_id District Table District_id District_desc Sales Fact Table Item_id Store_id Product_id Week_id Sales_amount Sales_units Time Table Item Table Dept Table Mgr Table Week_id Period_id Year_id Item_id Item_desc Dept_id Dept_id Dept_desc Mgr_id Dept_id Mgr_id Mgr_name 16 16 Snowflake Schema Model ◦ ◦ ◦ ◦ ◦ ◦ Country Direct use by some tools More flexible to change Provides for speedier data loading Can become large and unmanageable Degrades query performance More complex metadata State County City 17 17 Dimensional Modeling Process High level dimensional model design ◦ ◦ ◦ ◦ Choosing business model Declaring the grain Choosing dimensions Identifying the facts Detailed dimensional model development Dimensional model review and validation ◦ IS ◦ Core users ◦ Business community Final design iteration ISQS 6339, Data Mgmt & BI, Zhangxi Lin 18 Example: Commrex Real Estate Data Warehousing Analytic themes ◦ How to encourage realtors to use the online ASP services Value Chain ◦ Listors create their account ◦ Listors post their real estate properties to the web-based database services and pay listing fees ◦ Property buyers search the website-based database and buy properties from listors. This is the incentive for listors to use the ASP services Business Processes ◦ ◦ ◦ ◦ ◦ Listor sign up Listor account management Property data posting Property search Property database maintenance 19 19 IMW’s Database ERD Model Property Listing Database Membership Database Property ID M:1 Listor ID M:M Property Type Listor ID Listor Name Property Type Type Name Address Company ID Subtype 1 City Subtype 2 Chapter Chapter Subtype n Feature Legends Primary Key M:M Functions Specializations Company ID Comp Name Address Secondary Key Telephone # Link to a table ISQS 6339, Data Mgmt & BI, Zhangxi Lin 20 IMW’s Data Warehouse Dimensional Model Property Listing Fact Property SubType Dimension Membership Dimension Property ID Listor ID Listor ID Prop SubType Listor Name Prop SubType SubType Name Address Company ID Property Type City Chapter Property Type Type Name Property Type Dimension Chapter Functions Feature Specializations Company ID Legends Primary Key Secondary Key Comp Name Company Dimension Address Telephone # Link to a table ISQS 6339, Data Mgmt & BI, Zhangxi Lin 21 Data Warehousing with Microsoft SQL Server 2005 ISQS 6339, Data Mgmt & BI, Zhangxi Lin 22 Unified Dimensional Model (UDM) A SQL Server 2005 technology A UDM is a structure that sits over the top of a data mart and looks exactly like an OLAP system to the end user. Advantages ◦ ◦ ◦ ◦ No need for a data mart. Can be built over one or more OLTP systems. Mixed data mart and OLTP system data Can include data from database from other vendors and XMLformatted data ◦ Allows OLAP cubes to be built directly on top of transactional data ◦ Low latency ◦ Ease of creation and maintenance ISQS 6339, Data Mgmt & BI, Zhangxi Lin 23 Microsoft BI Toolset Relational engine (RDBMS) ◦ T-SQL ◦ .NET Framework Command Language Runtime (CLR) SQL Server Integration Services (SSIS) – ETL ◦ Data Transformation Pipeline (DTP) ◦ Data Transformation Runtime (DTR) SQL Server Analysis Service (SSAS) – queries, ad hoc use, OLAP, data mining ◦ Multi-Dimensional eXpressions (MDX) – a scripting language for data retrieval from dimensional database ◦ Dimension design ◦ Cube design ◦ Data mining SQL Server Reporting Services (SSRS) – ad hoc query, report building Microsoft Visual Studio .NET is the fundamental tool for application development 24 Structure and Components of Business Intelligence MS SQL Server 2005 SSMS SSIS SSAS BIDS SSRS SAS EG SAS EM 25 Understanding the Cube Designer Tabs Cube Structure: Use this tab to modify the architecture of a cube. Dimension Usage: Use this tab to define the relationships between dimensions and measure groups, and the granularity of each dimension within each measure group. Calculations: Use this tab to examine calculations that are defined for the cube, to define new calculations for the whole cube or for a subcube, to reorder existing calculations, and to debug calculations step by step by using breakpoints. KPIs: Use this tab to create, edit, and modify the Key Performance Indicators (KPIs) in a cube. Actions: Use this tab to create or modify drillthrough, reporting, and other actions for the selected cube.. Partitions: Use this tab to create and manage the partitions for a cube. Partitions let you store sections of a cube in different locations with different properties, such as aggregation definitions. Perspectives: Use this tab to create and manage the perspectives in a cube. A perspective is a defined subset of a cube, and is used to reduce the perceived complexity of a cube to the business user. Translations: Use this tab to create and manage translated names for cube objects, such as month or product names. Browser: Use this tab to view data in the cube. ISQS 6339, Data Mgmt & BI, Zhangxi Lin 26 Case: Adventure Works Cycles (AWC) 27 Case: Adventure Works Cycles (AWC) A fictitious multinational manufacturer and seller of bicycles and accessories Based on Bothell, Washington, USA and has regional sales offices in several countries http://www.msftdwtoolkit.com/ ISQS 6339, Data Mgmt & BI, Zhangxi Lin 28 Basic Business Information Product orders by category Product Orders by Country/Region Product Orders by Sales Channel Customers by Sales Channel Snapshot ISQS 6339, Data Mgmt & BI, Zhangxi Lin 29 AWC Business Requirements Interview summary Interviewee: Brian Welker, VP of Sales Sales to resellers: $37 million last year 17 people report to him including 3 regional sales managers Previous problem: Hard to get information out of the company’s system Major analytic areas: Sales planning Growth analysis Customer analysis Territory analysis Sales performance Basic sales reporting Price lists Special offers Customer satisfaction International support Success criteria Easy data access, Flexible reporting and analyzing, All data in one place What’s missing? – A lot – No indication of business value ISQS 6339, Data Mgmt & BI, Zhangxi Lin 30 Business Processes Purchase Orders Distribution Center Deliveries Distribution Center Inventory Store Deliveries Store Inventory Store Sales ISQS 6339, Data Mgmt & BI, Zhangxi Lin 31 Analytic Themes See the Excel file AW_Analytic_Themes_List.xls ISQS 6339, Data Mgmt & BI, Zhangxi Lin 32 AWC’s Bus Matrix Dimensions Customer (Reseller) Customer (Internet) Sales Territory Currency X X X X X Orders X X X X X X X Call Tracking X X X X X X Returns X X X X X X X Facility Employee X Call Reason Product X Promotion Date Sales Forecasting Channel Business Process X X X X ISQS 6339, Data Mgmt & BI, Zhangxi Lin 33 Prioritization Grid High Customer Profitability Product Profitability Orders Orders Forecast Business Value / Impact Call Tracking Manufacturing Costs Low Low Feasibility Exchange Rates Returns High ISQS 6339, Data Mgmt & BI, Zhangxi Lin 34 Exercise 2 – A quick walk through an SSAS application Learning Objectives ◦ How to design a data source view with SSAS based on an existing data warehouse ◦ How to design and deploy a cube. Tasks ◦ Analysis Service Tutorial Lesson 1: Defining a Data Source View within an Analysis Services Project ◦ Analysis Service Tutorial Lesson 2: Defining and Deploying a Cube Deliverable: ◦ A Word file with the screenshot of the star schema emailed to zhangxi.lin@hotmail.com ◦ The subject of the email is: “ISQS 3358 Exercise 2” ISQS 6339, Data Mgmt & BI, Zhangxi Lin 35 Supplemental Slides : Data Warehouse Design Phases 36 Data Warehouse Database Design Phases Phase 1: Defining the business model Phase 2: Defining the dimensional model Phase 3: Defining the physical model 37 37 Phase 1: Defining the Business Model ◦ Performing strategic analysis ◦ Creating the business model ◦ Documenting metadata 38 38 Performing Strategic Analysis Identify crucial business processes Understand business processes Prioritize and select the business processes to implement High Business Benefit Low Low Feasibility High 39 39 Creating the Business Model Defining business requirements: ◦ Identifying the business measures ◦ Identifying the dimensions ◦ Identifying the grain ◦ Identifying the business definitions and rules Verifying data sources 40 40 Business Requirements Drive the Design Process ◦ Primary input Business Requirements ◦ Secondary input Existing Metadata Production ERD Model Research 41 41 Identifying Measures and Dimensions Measures The attribute varies continuously: ◦ ◦ ◦ ◦ attribute is perceived as constant or discrete: Balance Units Sold Cost Sales The ◦ ◦ ◦ ◦ Product Location Time Size Dimensions 42 42 Using a Business Process Matrix Business Processes Business Sales Dimension s Returns Inventor y Customer Date Product Channel Promotion Sample of business process matrix 43 43 Determining Granularity YEAR? QUARTER? MONTH? WEEK? DAY? 44 44 Identifying Business Rules Location Geographic proximity 0 - 1 miles 1 - 5 miles > 5 miles Time Month > Quarter > Year Product Type Monitor Status PC Server 15 inch 17 inch 19 inch None New Rebuilt Custom Store Store > District > Region 45 45 Documenting Metadata Documenting metadata should include: ◦ Documenting the design process ◦ Documenting the development process ◦ Providing a record of changes ◦ Recording enhancements over time 46 46 Metadata Documentation Approaches ◦ Automated Data modeling tools ETL tools End-user tools ◦ Manual 47 47 Phase 2: Defining the Dimensional Model ◦ Identify fact tables: Translate business measures into fact tables Analyze source system information for additional measures ◦ Identify dimension tables ◦ Link fact tables to the dimension tables ◦ Model the time dimension 48 48 Star Dimensional Modeling Product Table Product_id Product_desc ... Time Table Day_id Month_id Period_id Year_id Sales Fact Table Product_id Store_id Item_id Day_id Sales_amount Sales_units ... Store Table Store_id District_id ... Item Table Item_id Item_desc ... 49 49 Fact Table Characteristics ◦ ◦ ◦ ◦ Contain numerical metrics of the business Can hold large volumes of data Can grow quickly Can contain base, derived, Sales Fact Table and summarized data Product_id Store_id ◦ Are typically additive Item_id ◦ Are joined to dimension Day_id Sales_amount tables through foreign keys Sales_units that reference primary ... keys in the dimension tables 50 50 Dimension Table Characteristics Dimension tables have the following characteristics: ◦ Contain textual information that represents the attributes of the business ◦ Contain relatively static data ◦ Are joined to a fact table through a foreign key reference 51 51 Star Dimensional Model Characteristics ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ The model is easy for users to understand. Primary keys represent a dimension. Nonforeign key columns are values. Facts are usually highly normalized. Dimensions are completely denormalized. Fast response to queries is provided. Performance is improved by reducing table joins. End users can express complex queries. Support is provided by many front-end tools. 52 52 Using Time in the Data Warehouse ◦ Defining standards for time is critical. ◦ Aggregation based on time is complex. 53 53 The Time Dimension Time is critical to the data warehouse. A consistent representation of time is required for extensibility. Sales fact Time dimension Where should the element of time be stored? 54 54 Using Data Modeling Tools ◦ Tools with a GUI enable definition, modeling, and reporting. ◦ Avoid a mix of modeling techniques caused by: Development pressures Developers with lack of knowledge No strategy ◦ Determine a strategy. ◦ Write and publish formally. ◦ Make available electronically. 55 55 Phase 3: Defining the Physical Model Why ◦ Huge amount of data must be effectively processed and retrieved in realtime. How ◦ Translate the dimensional design to a physical model for implementation. ◦ Define storage strategy for tables and indexes. ◦ Perform database sizing. ◦ Define initial indexing strategy. ◦ Define partitioning strategy. ◦ Update metadata document with physical information. 56 56 Storage and Performance Considerations Database sizing Data partitioning Indexing Star query optimization 57 57 Database Sizing - Test Load Sampling Analyze a representative sample of the data chosen using proven statistical methods. Ensure that the sample reflects: ◦ Test loads for different periods ◦ Day-to-day operations ◦ Seasonal data and worst-case scenarios ◦ Indexes and summaries 58 58 Data Partitioning Breaking up of data into separate physical units that can be handled independently Types of data partitioning ◦ Horizontal partitioning. ◦ Vertical partitioning 59 59 Indexing Indexing is used for the following reasons: ◦ It is a huge cost saving, greatly improving performance and scalability. ◦ It can replace a full table scan by a quick read of the index followed by a read of only those disk blocks that contain the rows needed. 60 60 Parallelism Sales table Customers table P1 P2 P3 P1 P2 P3 Parallel Execution Servers 61 61 Using Summary Data Designing summary tables offers the following benefits: ◦ Provides fast access to precomputed data ◦ Reduces use of I/O, CPU, and memory 62 62