Atlanta Microsoft Database Forum Introduction to Data Warehousing Concepts Presented by Brian Thomas Solution Builders, Inc. March 8, 2004 Brian.Thomas@SolutionBuilders.com What is a Data Warehouse? Data collected from one or many systems that exist within and outside the organization. The Data is structured in such a way as to reduce the amount of time that it takes to produce reliable information. Why Build a Data Warehouse? • To Provide a Consistent Common Source for Corporate Information • To Store Large Volumes of Historical Detail Data from Mission Critical Applications • Improve the Ability to Access, Report Against, and Analyze Information • To Solve or Improve Upon Business Processes Turning Data into Information Functional Data Warehouse Sales System System Generated Reports Sales Analysis is extrapolated from the System Reports. Turning Data into Information Functional Data Warehouse Sales System Functional Data Warehouse of Sales Information Sales Information is available to a wider audience of decision makers. Turning Data into Information Division A Cross Organizational Functional Data Warehouse Division B Sales System Sales System Division C Centralized Data Warehouse of Sales Data from across the Organization Sales System Analysis performed and Decisions drawn from the Cross Organizational Sales Data Turning Data into Information Cross Functional Data Warehouse Marketing System Sales System Corporate Performance Analysis is extrapolated from the System Reports. Production Systems System Generated Reports Turning Data into Information Cross Functional Data Warehouse Marketing System Sales System Cross Functional Data Warehouse of Information Corporate Performance Analysis is available to a wider audience. Production Systems Turning Data into Information Division C Division B Division A Cross Organizational & Cross Functional Data Warehouse Centralized Cross Functional Data Warehouse of Information Analysis is performed and Decisions made from the Cross Functional Organizational Performance Data Data Warehouse Architecture Business Group Level Divisional Level Enterprise Data Warehouse DW / DM DM DM DW / DM DM DM DW / DM DM DM Data Access & Query Management Services Corporate Level Increased Level of Standardization Extraction Transformation Load (ETL) Division C External Data Data Warehouse Components Increased Local Specifications Division B Division A Source Systems Management Systems Planning & Forecasting Analytics & Modeling Access Methods Portal / Web Interface ` Desktop Applications Performance Management Printed Reports Scorecards & Dashboards Email Query & Reporting Mobile Devices Data Warehouse Architecture External Data Data Staging Area Extract, Transformation and Load (ETL) Division C Division B Division A Source Systems Data Warehouse Repository Data Warehouse Architecture Data Staging Area • Subject Area Oriented • Data Structure more closely mirrors Operational System Data Layouts • Supports Identification of Changed Data • Acts as a Working Area to Support the Transformation Process Data Warehouse Architecture • Perform Attribute Standardization and Cleansing • Apply Business Rules and Calculations • Consolidate using Matching and Merge / Purge Logic • Ensure Proper Linking and Tracking of History Extract, Transformation and Load (ETL) Extraction, Transformation & Load (ETL) Data Warehouse Architecture Extraction, Transformation & Load (ETL) App. A: Male , Female App. B: 1 , 0 App. C: x , y App. D: m , f Male, Female Lookup Function App. A: pipeline (cm) App. B: pipeline (inches) App. C: pipeline (mcf) App. D: pipeline (yds) pipeline (cm) Conversion Function App. A: Date (julian) App. B: Date (yyyymmdd) App. C: Date (mm/dd/yyyy) App. D: Date (absolute) Date (julian) Formatting Function App. A: Description App. B: Description App. C: Description App. D: Description App. A: balance on hand App. B: current balance App. C: cash in house App. D: balance Description Merging Function Balance Mapping Function Data Warehouse Architecture Data Warehouse Repository • Organized around Conformed Dimensions and Facts • Promotes Usability and Intuitiveness • Consolidated and Cross-Functional • Historical and Atomic Representation of Data •Insulated from Source System Modifications and Additions Data Warehouse Repository Star Schema Concepts Fact Table This table is the core of the Star Schema Structure and contains the Facts or Measures available through the Data Warehouse. These Facts answer the questions of “What”, “How Much”, or “How Many”. Some Examples: Sales Dollars, Units Sold, Gross Profit, Expense Amount, Net Income, Unit Cost, Number of Employees, Turnover, Salary, Tenure, etc. Data Warehouse Repository Star Schema Concepts Dimension Tables These tables describe the Facts or Measures. These tables contain the Attributes and may also be Hierarchical. These Dimensions answer the questions of “Who”, “What”, “When”, or “Where”. Some Examples: • Day, Week, Month, Quarter, Year • Sales Person, Sales Manager, VP of Sales • Product, Product Category, Product Line • Cost Center, Unit, Segment, Business, Company Data Warehouse Repository Star Schema Concepts Employee_Dim EmployeeKey EmployeeID . . . Time_Dim TimeKey TheDate . . . Shipper_Dim ShipperKey ShipperID . . . Sales_Fact TimeKey EmployeeKey ProductKey CustomerKey ShipperKey Required Data (Business Metrics) or (Measures) . . . Product_Dim ProductKey ProductID . . . Customer_Dim CustomerKey CustomerID . . . Data Warehouse Repository Markets Dimension Cube Concepts Atlanta Chicago Denver Grapes Cherries Melons Apples Dallas Q1 Q4 Q2 Q3 Time Dimension Data Warehouse Repository Markets Dimension Cube Concepts Sales Fact Atlanta Chicago Denver Grapes Cherries Melons Apples Dallas Q1 Q4 Q2 Q3 Time Dimension Data Warehouse Repository Storage Concepts • Relational On-Line Analytical Processing (ROLAP): The information that is stored in the Data Warehouse is held in a relational structure. Aggregations are performed on the fly either by the database or in the analysis tool. • Multidimensional On-Line Analytical Processing (MOLAP): This information is aggregated in a predefined manner based on the characteristics of the Measures and the defined hierarchy of the Dimensions. Since the data is preaggregated, navigating through the hierarchies is instantaneous. The user is simply navigating to a point within the Multidimensional Cube and not performing any on the fly aggregations. • Hybrid On-Line Analytical Processing (HOLAP): This is a combination of MOLAP and ROLAP. A portion of the data is predefined and aggregated. This would typically be the set of information that is accessed most frequently. Additional detail can be held in a ROLAP structure and allow a user to drill through the MOLAP structure into the ROLAP structure. Data Warehouse Repository Cube Concepts Client perspective Query performance Storage consumption MOLAP HOLAP ROLAP Fastest Faster Fast High Medium Low Microsoft Office, Reporting Services and .NET Framework Divisional Level DW / DM DM DM Increased Level of Standardization Extraction Transformation Load (ETL) Division A Division B Business Group Level Enterprise Data Warehouse DW / DM DM DM Increased Local Specifications Division C Corporate Level DW / DM DM DM Data Access & Query Management Services SQL Server Relational Database and Analysis Services Management Systems Data Warehouse Components Source Systems External Data Where does Microsoft fit in? Planning & Forecasting Analytics & Modeling Access Methods Portal / Web Interface ` Desktop Applications Performance Management Printed Reports Scorecards & Dashboards Email Query & Reporting Mobile Devices SQL Stored Procedures, SQL Views, MDX, and .NET Web Services SharePoint Portal, Exchange, and .NET Framework SQL Server DTS Q &A