ISQS 6339, Business Intelligence Dimensional Modeling Zhangxi Lin Texas Tech University 1 1 Outline Principles of Dimensional Modeling Data Warehousing Methodology Three Phases of Dimensional Modeling 2 PRINCIPLES OF DIMENSIONAL MODELING 3 Dimensional Model Also called star schema (but snowflake schema is also fine) ◦ Fact table is in the middle and dimensions serving as the points on the star. ◦ A normalized fact table plus denormalized dimension tables Reference: database normalization ◦ Edgar F. Codd, the inventor of the relational model, introduced the concept of normalization and what we now know as the First Normal Form (1NF) in 1970. Codd went on to define the Second Normal Form (2NF) and Third Normal Form (3NF) in 1971, and Codd and Raymond F. Boyce defined the Boyce-Codd Normal Form (BCNF) in 1974. ◦ Informally, a relational database table is often described as "normalized" if it is in the Third Normal Form. Most 3NF tables are free of insertion, update, and deletion anomalies. 4 4 Star Schema Model Product Table Store Table Product_id Product_disc,... Store_id District_id,... Sales Fact Table Central fact table Product_id Store_id Item_id Day_id Sales_amount Sales_units, ... Denormalized dimensions Time Table Day_id Month_id Year_id,... Item Table Item_id Item_desc,... 5 5 Snowflake Schema Model Product Table Product_id Product_desc Store Table Store_id Store_desc District_id District Table District_id District_desc Sales Fact Table Item_id Store_id Product_id Week_id Sales_amount Sales_units Time Table Item Table Dept Table Mgr Table Week_id Period_id Year_id Item_id Item_desc Dept_id Dept_id Dept_desc Mgr_id Dept_id Mgr_id Mgr_name 6 6 Snowflake Schema Model ◦ Direct use by some tools ◦ More flexible to change ◦ Provides for speedier data loading ◦ Can become large and unmanageable ◦ Degrades query performance ◦ More complex metadata Country State County City 7 7 Facts Definition ◦ Measure – a numeric quantity expressing some aspect of the organization’s performance ◦ Aggregate – formed by combining values from a given dimension or set of dimensions to create a single value. Measurements associated with a specific business process. Most facts are additive (calculative); others are semiadditive, non-additive, or descriptive (e.g. factless fact table). Many facts can be derived from other facts. So, nonadditive facts can be avoided by calculating it from additive facts. Fact Table Characteristics ◦ ◦ ◦ ◦ Contain numerical metrics of the business Can hold large volumes of data Can grow quickly Can contain base, derived, and summarized data Sales Fact Table ◦ Are typically additive Product_id ◦ Are joined to dimension Store_id tables through foreign keys Item_id Day_id that reference primary keys in the dimension tables Sales_amount Sales_units ... 9 9 The Three Fact Table Types Transaction fact table ◦ The most basic and fundamental ◦ “One row per line in a transaction", e.g., every line on a receipt ◦ A transactional fact table holds data of the most detailed level ◦ have a great number of dimensions associated with it Periodic snapshot fact table ◦ Takes a "picture of the moment“ ◦ Cumulative performance over specific time intervals ◦ Dependent on the transactional table ◦ Valuable to combine data across several business processes in the value chain Accumulating snapshot fact table ◦ Used to show the activity of a process that has a well-defined beginning and end ◦ Constantly updated over time 10 Types of facts Week Date Trans# 1 1A1-1 1 2A1-2 1 4A1-3 2 2A2-1 2 2A2-2 3 1A3-1 4 2A4-1 4 3A4-2 4 3A4-3 4 5A4-4 Change OldBal NewBal 100 1000 1100 -50 1100 1050 200 1050 1250 -120 1250 1130 200 1130 1330 -300 1330 1030 -20 1030 1010 100 1010 1110 250 1110 1360 -220 1360 1140 Transaction fact: each row Periodic snapshot fact: (OldBal, NewBal) on each transaction Accumulating snapshot fact: The average numbers in a week, such as average balance, number of transactions, average amount of transactions, the total amount of trading in a given period. Dimensions Definition: a categorization used to spread out an aggregate measure to reveal its constituent part The foundation of the dimensional model to describe the objects of the business The nouns of the DW/BI system ◦ Business processes (facts) are the verbs of the business Dimension tables link to all the business processes. A dimension shared across all processes is called conformed dimension The analysis involving data from more than one business process is called drill-across. 12 12 Attributes An additional piece of information pertaining to a dimension member that is not the unique identifier or the description of the member. Attributes can be used to more fully describe dimension members Dimension Table Characteristics Dimension tables have the following characteristics: ◦ Contain textual information that represents the attributes of the business ◦ Contain relatively static data ◦ Are joined to a fact table through a foreign key reference 14 14 Star Dimensional Model Characteristics ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ The model is easy for users to understand. Primary keys represent a dimension. Nonforeign key columns are values. Facts are usually highly normalized. Dimensions are completely denormalized. Fast response to queries is provided. Performance is improved by reducing table joins. End users can express complex queries. Support is provided by many front-end tools. 15 15 The Time Dimension Time is critical to the data warehouse. A consistent representation of time is required for extensibility. Sales fact Time dimension Where should the element of time be stored? 16 16 Hierarchies Meaningful, standard ways to group the data within a dimension ◦ Variable-depth hierarchies ◦ Frequently changing hierarchies Examples of hierarchy in a dimension ◦ Address: street, city, state, country ◦ Organization: section, division, branch, region ◦ Time: year, quarter, month, date 17 Data Cube Data cubes are multidimensional extensions of 2-D tables, just as in geometry a cube is a three-dimensional extension of a square. The word cube brings to mind a 3-D object, and we can think of a 3-D data cube as being a set of similarly structured 2-D tables stacked on top of one another. Data cubes aren't restricted to just three dimensions. Most OLAP systems can build data cubes with many more dimensions allows up to 64 dimensions. In practice, we often construct data cubes with many dimensions, but we tend to look at just three at a time. What makes data cubes so valuable is that we can index the cube on one or more of its dimensions. 18 18 Data Cube Region Product Time OLAP system OLAP – allows users to retrieve information from data quickly for analysis purposes Features ◦ Multidimensional database ◦ Easily understood ◦ What is OLAP? 5’04” ◦ SQL OLAP Tutorial - Data Warehouse Schema Design 9’45” Dimensional Modeling Process High level dimensional model design ◦ ◦ ◦ ◦ Choosing business model in accordance with the analytic theme Declaring the grain Choosing dimensions Identifying the facts Detailed dimensional model development Dimensional model review and validation ◦ IS ◦ Core users ◦ Business community Final design iteration ISQS 6339, Data Mgmt & BI, Zhangxi Lin 21 DATA WAREHOUSING METHODOLOGY 22 Data Warehouse Development Approaches Data warehouse development approaches ◦ Kimball Model: Data mart approach ◦ Data marts - EDW Inmon Model: EDW approach EDW – Data Marts Which model is better? ◦ ◦ There is no one-size-fits-all strategy to data warehousing One alternative is the hosted warehouse 23 Comparison Kimball Model ◦ Kimball’s model follows a bottom-up approach. The Data Warehouse (DW) is provisioned from Datamarts (DM) as and when they are available or required. ◦ The Datamarts are sourced from OLTP systems are usually relational databases in Third normal form (3NF). ◦ The Data Warehouse which is central to the model is a de-normalized star schema. The OLAP cubes are built on this DW. Inmon Model ◦ Inmon’s model follows a top-down approach. The Data Warehouse (DW) is sourced from OLTP systems and is the central repository of data. ◦ The Data Warehouse in Inmon’s model is in Third Normal Form (3NF). ◦ The Datamarts (DM) are provisioned out of the Data Warehouse as and when required. Datamarts in Inmon’s model are in 3NF from which the OLAP cubes are built. Strengths and Weaknesses Scalable vs. structural ◦ Kimball’s model is more scalable because of the bottom-up approach and hence you can start small and scale-up eventually. The ROI is usually faster with Kimball’s model. Because of this approach it is difficult to created re-usable structures/ ETL for different data marts. ◦ On the other hand Inmon’s model is more structured and easier to maintain while it is rigid and takes more time to build. The significant advantage of Inmon’s model is because the DW is in 3NF; it is easier to build data mining models. Both Kimball and Inmon models agree and emphasis that DW is the central repository of data and OLAP cubes are built of denormalized star schemas. In conclusion, when it comes to data modeling, it is irrelevant which camp you belong to as long as you understand why you are adopting a specific model. Sometimes it makes sense to take a hybrid approach. General Data Warehouse Development Approaches “Big bang” approach Incremental approach: ◦ Top-down incremental approach ◦ Bottom-up incremental approach ISQS 6339, Data Mgmt & BI, Zhangxi Lin 27 “Big Bang” Approach Analyze enterprise requirements Build enterprise data warehouse Report in subsets or store in data marts ISQS 6339, Data Mgmt & BI, Zhangxi Lin 28 Incremental Approach to Warehouse Development Multiple iterations Shorter implementations Validation of each phase Increment 1 Strategy Definition Analysis Design Iterative Build Production ISQS 6339, Data Mgmt & BI, Zhangxi Lin 29 Top-Down Approach Analyze requirements at the enterprise level Develop conceptual information model Identify and prioritize subject areas Complete a model of selected subject area Map to available data Perform a source system analysis Implement base technical architecture Establish metadata, extraction, and load processes for the initial subject area Create and populate the initial subject area data mart within the overall warehouse framework ISQS 6339, Data Mgmt & BI, Zhangxi Lin 30 Bottom-Up Approach Define the scope and coverage of the data warehouse and analyze the source systems within this scope Define the initial increment based on the political pressure, assumed business benefit and data volume Implement base technical architecture and establish metadata, extraction, and load processes as required by increment Create and populate the initial subject areas within the overall warehouse framework ISQS 6339, Data Mgmt & BI, Zhangxi Lin 31 Note:There are many details about data warehouse design, which need a lot effort to learn. Because of limited time to spend for this part, here are only some of the details. THREE PHASES OF DATA WAREHOUSE DESIGN 32 Data Warehouse Database Design Phases Phase 1: Defining the business model Phase 2: Defining the dimensional model Phase 3: Defining the physical model 33 33 Phase 1: Defining the Business Model ◦ Performing strategic analysis ◦ Define business analytic theme ◦ Creating the business model ◦ Documenting metadata 34 34 Performing Strategic Analysis Identify crucial business processes Understand business processes Prioritize and select the business processes to implement High Business Benefit Low Low Feasibility High 35 35 Creating the Business Model Defining business requirements: ◦ Identifying the business measures ◦ Identifying the dimensions ◦ Identifying the grain ◦ Identifying the business definitions and rules Verifying data sources 36 36 Business Requirements Drive the Design Process ◦ Primary input Business Requirements ◦ Secondary input Existing Metadata Production ERD Model Research 37 37 Identifying Measures and Dimensions The Measures attribute is perceived as constant or discrete: ◦ Product ◦ Location ◦ Time ◦ Size ◦ ◦ ◦ ◦ attribute varies continuously: Balance Units Sold Cost Sales The Dimensions 38 38 Using a Business Process Matrix Business Processes Business Dimensions Sales Returns Inventory Customer Date Product Channel Promotion Sample of business process matrix 39 39 Determining Granularity YEAR? QUARTER? MONTH? WEEK? DAY? 40 40 Identifying Business Rules Location Geographic proximity 0 - 1 miles 1 - 5 miles > 5 miles Time Month > Quarter > Year Product Type Monitor Status PC Server 15 inch 17 inch 19 inch None New Rebuilt Custom Store Store > District > Region 41 41 Documenting Metadata Documenting metadata should include: ◦ Documenting the design process ◦ Documenting the development process ◦ Providing a record of changes ◦ Recording enhancements over time 42 42 Metadata Documentation Approaches ◦ Automated Data modeling tools ETL tools End-user tools ◦ Manual 43 43 Phase 2: Defining the Dimensional Model ◦ Identify fact tables: Translate business measures into fact tables Analyze source system information for additional measures ◦ Identify dimension tables ◦ Link fact tables to the dimension tables ◦ Model the time dimension 44 44 Illustrative case – IMW Data Transaction fact – The transaction table Periodic snapshot fact table – current records in Land & Office facts Accumulating snapshot fact table – N/A in this case Steps in designing a fact table Identify a business process for analysis (like sales). Identify measures or facts (sales dollar), by asking questions like 'What number of XX are relevant for the business process?', replacing the XX with various options that make sense within the context of the business. Identify dimensions for facts (product dimension, location dimension, time dimension, organization dimension), by asking questions that make sense within the context of the business, like 'Analyse by XX', where XX is replaced with the subject to test. List the columns that describe each dimension (region name, branch name, business unit name). Determine the lowest level (granularity) of summary in a fact table (e.g. sales dollars). An alternative approach is the four step design process described in Kimball. – Check what it is Using Time in the Data Warehouse ◦ Defining standards for time is critical. ◦ Aggregation based on time is complex. 47 47 Using Data Modeling Tools ◦ Tools with a GUI enable definition, modeling, and reporting. ◦ Avoid a mix of modeling techniques caused by: Development pressures Developers with lack of knowledge No strategy ◦ Determine a strategy. ◦ Write and publish formally. ◦ Make available electronically. 48 48 Phase 3: Defining the Physical Model Why ◦ Huge amount of data must be effectively processed and retrieved in realtime. How ◦ Translate the dimensional design to a physical model for implementation. ◦ Define storage strategy for tables and indexes. ◦ Perform database sizing. ◦ Define initial indexing strategy. ◦ Define partitioning strategy. ◦ Update metadata document with physical information. 49 49 Storage and Performance Considerations Database sizing Data partitioning Indexing Star query optimization 50 50 Database Sizing - Test Load Sampling Analyze a representative sample of the data chosen using proven statistical methods. Ensure that the sample reflects: ◦ ◦ ◦ ◦ Test loads for different periods Day-to-day operations Seasonal data and worst-case scenarios Indexes and summaries 51 51 Data Partitioning Breaking up of data into separate physical units that can be handled independently Types of data partitioning ◦ Horizontal partitioning. ◦ Vertical partitioning 52 52 Indexing Indexing is used for the following reasons: ◦ It is a huge cost saving, greatly improving performance and scalability. ◦ It can replace a full table scan by a quick read of the index followed by a read of only those disk blocks that contain the rows needed. 53 53 Parallelism Sales table Customers table P1 P2 P3 P1 P2 P3 Parallel Execution Servers 54 54 Using Summary Data Designing summary tables offers the following benefits: ◦ Provides fast access to precomputed data ◦ Reduces use of I/O, CPU, and memory 55 55