Data Warehouse Why Data Warehouse and OLAP? The Intelligence Stage of Decision Making needs correct data Data also should be clean, organized with fast access Often the data should only be used for subsequent stages and should not be changed (in other words decision makers are users of data, not generators of data) The data for decision making comes from multiple sources OLTP databases, XML files, Flat files like CSV, PDF, etc. They need to be combined for one version of the truth These needs cannot be easily served by the traditional data sources, hence the concept of data warehouse/data marts Business Intelligence Process Increasing potential to support business decisions Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery End User Business Analyst Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP DBA Data Warehouse: Definitions Data warehouse There is no single definition, as it can encompass many aspects of the data management. Some of the ones found in the literature are: A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format Defined in many different ways, but not rigorously. A decision support database that is maintained separately from the organization’s operational database. A consistent database source that bring together information from multiple sources for decision support queries Support information processing by providing a solid platform of consolidated, historical data for analysis Data warehousing A process by which data from multiple sources are extracted, transformed and loaded into a data warehouse in a planned operation Data Warehouse vs. Operational DBMS • OLTP (on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. Aims at reliable and efficient processing of a large number of transaction and ensuring data consistency • OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making Aims at efficient multidimensional processing of large data volumes Fast, interactive answers to large aggregate queries • Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries OLTP vs. OLAP User Clerk, IT Professional Knowledge worker Function Day to day operations Decision support DB Design Application-oriented Subject-oriented Data Current, Isolated Historical, Consolidated View Detailed, Flat relational Summarized and Multidimensional Usage Structured, Repetitive Ad hoc Unit of work Short, Simple transaction Complex query Access Read/write Read Mostly Operations Index/hash on prim. Key Lots of Scans # Rec. accessed Tens Millions #Users Thousands Hundreds Db size 100 MB-GB 100GB-TB Metric Trans. throughput Query throughput, response Data Warehousing Characteristics Subject oriented Integrated Time variant (time series) Nonvolatile Web based Relational/multidimensional Client/server Real-time Include metadata Data Warehouse Architecture Data Warehouse Architecture Inmon (Bill Inmon) Model: EDW approach Kimball (Ralph Kimball) Model: Data mart approach Which model is best? There is no one-size-fits-all strategy to data warehousing One alternative is the hosted warehouse Alternative DW Architectures Teradata’s DW Architecture DW Architecture Categories/Types of Data Warehouses Data Mart Operational Data Stores A departmental data A type of database often warehouse that stores only used as an interim area for relevant data a data warehouse, especially for customer Dependent Data Mart information files A subset that is created Operational Data Marts directly from a data warehouse An operational data mart. Independent Data Mart An operational data mart is a small-scale data mart A small data warehouse typically used by a single designed for a strategic department or functional business unit or a area in an organization department Various OLAPs ROLAP: Relational OLAP. Analysis cube supported by a Multi-Dimensional Data Warehouse (or Data Mart) based on Relational database. MOLAP: Multidimensional OLAP. Analysis cube supported by a multi-dimensional data warehouse based on another multi-dimensional data storage. HOLAP: Hybrid OLAP. Combination of the above two to optimize performance. Your task: Search the web to find the relative advantages and disadvantages of each. Data Warehousing Architectures Issues to consider when deciding which architecture to use: Which database management system (DBMS) should be used? Will parallel processing and/or partitioning be used? Will data migration tools be used to load the data warehouse? What tools will be used to support data retrieval and analysis? Data Warehouse Development and Implementation Structured data Data Warehouse and BI Life Cycle Business Intelligence Reporting, Ad-hoc Query Analysis OLTP Data Warehouse (Enterprise) Unstructured data Phase I Data Warehouse and BI Life Cycle Extract -Transform - Load Data Integration Data Warehouse (Enterprise) ODS Source Data Reporting Tools Develop Discover Deliver Phase II Data Warehouse and BI Life Cycle Data Marts (Departments) Data Warehouse (Enterprise) Multi-Dimensional Cubes (Subject wise) Develop Discover Business Intelligence Deliver Data Warehouse and BI Life Cycle Data Warehouse OLAP Cube Total Sales for Customer 3, Product C, for July. From: http://oraclezine.blogspot.com/2009/01/data-warehousingand-olap-cube.html, accessed January 20, 2011 Data Integration and the Extraction, Transformation, and Load (ETL) Process Extraction reading data from a database Transformation Converting the extracted data from its previous form into the form in which it needs to be so that it can be placed into a data warehouse or simply another database Load Putting the data into the data warehouse Data Warehouse Schema Dimensional Modeling The Star Schema Dimension Tables that contain the Dimension for Analysis Example: Time, Region, Salesperson, etc. Fact Tables that contains the measures and aggregates Example: Average sales, total commission, total sales, etc. The Snowflake Schema Very similar to Star-schema with a central fact table, but the dimensions are in hierarchical fashion. Example: Listing agent is a part of the listing company, one city can have multiple zip code etc. Reduces the redundant data but can be inefficient for queries that do not follow patterns. Steps for DW OLP Design Decide on your Information Needs I want to know the Total Sales by month, by region and by salesperson Decide the sources of data for each information need Total Sales is not available, must be calculated from unit and quantity sold Most of the unit and qty sold are in OLTP, but some are also in CSV files for some of the stores not connected to the database The Monthly information can be obtained from the date and is in OLTP The region information is also in OLTP transaction records vi a region code The Salesperson information is maintained in an Excel file and is coded by region Declare the grain of the fact table (preferably at the most atomic level) A grain is the lowest level of information you are interested in. The finer is the grain (tending towards atomic), the more are the dimensions. Atomic grain is preferred as it can allow easy roll-ups. However, it can take a lot of space and processing. Often the grains in the fact tables are much coarser with ability to drill down Example of Grains Sales of Each Customer, for each day for each store. Monthly total sales for a particular region for all female customers Add dimensions for "everything you know" about this grain Add numeric measured facts true to the grain. Creating a Star* Schema Identify the dimensions (typically this is the analysis by). In our real estate listing example, it can be the city name, bedrooms, listing agent, etc. Identify the measures. In our case, it can be the average price/sq ft, total number of houses in a city, the average price, etc. Identify the attributes of each dimension. Attributes are the properties of a dimension. For example, if the listing agent is a dimension, then the first name, last name, phone number etc, of a listing agent will be the attributes. * Snowflake schema design is very similar with hierarchy of the dimensions separated in another table. Star Schema continued. Create the dimension tables with a surrogate Primary Key (PK). Include all the necessary attributes. Decide on the measures and calculations. Those will be in the fact table. For each PK in the dimension tables, create foreign keys (FK) in the star table. Example of Star Schema Example of Snowflake Schema Implementing the Star Schema 1. Extract Data From Multiple Sources 2. Integrate, Transform, and Restructure Data 3. Load Data Into Dimension Tables and Fact Tables The Star Schema Data Load Heterogeneous Data Sources Data Warehouse Staging Area Northwind OLTP Financial External Internal Files Files Sales Star DTS DTS External Files Extracting Data From DTS Transforming DTS Heterogeneous Sources Data Inventory Star Loading the Star Schema Verifying the Dimension Source Data Verifying Accuracy of Source Data Integrating data from multiple sources Applying business rules Checking structural requirements Correcting Invalid Data Managing Invalid Data Transforming data Rejecting invalid data Reassigning data values Saving invalid data to a log Dimension Data Load Examples: buyer_name reg_id Barr, Adam 2 Chai, Sean 4 O’Melia, Erin 6 ... ... buyer_code buyer_last reg_id Barr 2 A123 Chai 4 B456 O’Melia 6 ... ... ... buyer_name Barr, Adam Chai, Sean reg_id II IV buyer_name Smith, Jane Paper, Anne reg_id 2 4 DTS DTS DTS buyer_first Adam Sean Erin ... buyer_code U999 A123 B456 ... buyer_last reg_id Barr 2 Chai 4 O’Melia 6 ... ... buyer_last reg_id Barr 2 Chai 4 O’Melia 6 ... ... buyer_name Barr, Adam Chai, Sean Smith, Jane Paper, Anne reg_id 2 4 2 4 Maintaining Integrity of the Dimension Assigning a Surrogate Key to Each Record Defines the dimension’s primary key Relates to the foreign key fields of the fact table Loading One Record Per Application Key Maintains uniqueness in the dimension Depends on how you manage changing dimension data Maintains integrity of the fact table Managing Changing Dimension Data One of the problem in DW is to deal with the changing data on the reload and refresh. OLTP is operational and thus does not face this problem. These change areas are often called SCD or Slow Changing Dimensions Dimensions with Changing Column Values Inserts of new data Updates of existing data Slowly-Changing Dimension Design Solutions Type 1: Overwrite the dimension record Type 2: Write another dimension record Type 3: Add attributes to the dimension record Examining the Star Schema Sales Star Schema Fact Table Dimension Tables Dimension Table Type 1: Overwriting the Dimension Slide Product Dimension product key product name product size product package product dept product cat product subcat ... Before 001 Rice Puffs 10 oz. Bag Grocery Dry Goods Snacks ... After 001 Rice Puffs 12 Oz oz. Bag Grocery Dry Goods Snacks ... Existing record is changed Type 2: Writing Another Dimension Record Product Dimension product key product name product size product package product dept product cat product subcat effective_date … Before 001 Rice Puffs 10 oz. Bag Grocery Dry Goods Snacks 05-01-1995 ... 001 Rice Puffs 10 Oz oz. Bag Grocery Dry Goods Snacks 05-01-1995 ... After 731 Rice Puffs 12 Oz oz. Bag Grocery Dry Goods Snacks 10-15-1998 ... Adds a new record Type 3: Adding Attributes in the Dimension Record Product Dimension product key product name product productsize size product package product dept product cat product subcat current product size date previous product previous product size size previous product size size date date previous product 2nd previous product size 2nd previous product size date ... Before 001 Rice Puffs 10 10 Oz oz. Bag Grocery Dry Goods Snacks 05-01-1995 11 11 Oz oz. 03-20-1994 03-20-1994 (null) (null) ... After 001 Rice Puffs 12 12 oz. oz Bag Grocery Dry Goods Snacks 10-15-1998 10-15-1998 10 oz. 05-01-1995 05-01-1995 11 11Oz oz. 03-20-1994 03-20-1994 ... Additional information is stored in an existing record Verifying the Fact Table Source Data Verifying Accuracy of Source Data Integrating data from multiple sources Applying business rules Checking structural requirements Creating calculated fields Correcting Invalid Data Managing Invalid Data Transforming data Rejecting invalid data Reassigning data values Saving invalid data to a log Assigning Foreign Keys Dimension Tables customer_dim 201 ALFI Alfreds Source Data customer id ALFI product id order date quantity_sales amount_sales 123 1/1/2000 400 10,789 product_dim 25 123 Chai Sales Fact Data cust_key time_dim 134 1/1/2000 201 prod_key time_key 123 25 1/1/2000 134 quantity_sales amount_sales 400 10,789 Defining Measures Loading Measures from the Source System Calculating Additional Measures customer_id VINET ALFI HANAR ... product_id 9GZ 1KJ 0ZA ... price .55 1.10 .98 ... qty 32 48 9 ... Source System Data customer_key 100 238 437 ... product_key 512 207 338 ... qty 32 48 9 ... Fact Table Data total_sales 17.60 52.80 8.82 ... Maintaining Data Integrity Adhering to the Fact Table Grain A fact table can only have one grain You must load a fact table with data at the same level of detail as defined by the grain Enforcing Column Constraints NOT NULL constraints FOREIGN KEY constraints Implementing Staging Tables Centralize and Integrate Source Data Break Up Complex Data Transformations Facilitate Error Recovery market_stage shipments_stage Staging Area sales_stage inventory_stage DTS Functionality Accessing Heterogeneous Data Sources Importing, Exporting, and Transforming Data Creating Reusable Transformations and Functions Automating Data Loads Managing Metadata Customizing and Extending Functionality Defining DTS Packages Identifies Data Sources and Destinations Defines Tasks or Actions Implements Transformation Logic Defines Order of Operations Identifying Package Components Connections: Access Data Sources and Destinations Tasks: Describe Data Transformations or Functions Steps: Define the Order of Task Operations or Workflow Global Variables: Store Data that Can Be Shared Across Tasks Creating Packages Using the DTS Import / Export Wizard Perform ad-hoc table and data transfers Develop a prototype package Using DTS Package Designer Edit packages created with the DTS Import/Export Wizard Create packages with a wide range of functionality Programming DTS Applications Directly access the functionality of the DTS Object Model Requires Microsoft Visual Basic or Microsoft Visual C++ Using DTS to Populate the Sales Star Populating the Sales Star Dimensions Populating the Sales Star Fact Table Populating the Sales Star Dimensions Product Tab Delimited Files product_dim DTS customer_dim Northwind OLTP SQL Server Stored Procedure DTS time_dim DTS Populating the Sales Star Fact Table Sales Data File product_dim DTS sales_stage DTS customer_dim sales_stage time_dim sales_fact Why not Just Excel Pivot Tables? For small project, Pivot Tables are excellent However, if you insert/delete records, you have to refresh the pivot table and may have to change the table reference (though Table features make it easy) Calculation cannot be done for one field that will be reflected for all records. You have to copy it for all the records! No pre-processing can be done. The pivot table is refreshed and recalculated every time. Can lead to inefficiencies No built-in integration mechanism with a data source (can be done by connecting to cube servers and ODBC compliant databases, but again an extra step and work) Selective drill-down is difficult, mainly all or none (again grouping can help, but you have to be skilled in Pivot Tables) Data Warehouse Implementation Implementation issues Implementing a data warehouse is generally a massive effort that must be planned and executed according to established methods There are many facets to the project lifecycle, and no single person can be an expert in each area Major Tasks for Successful Implementation of a DW 1. Establishment of 2. 3. 4. 5. service-level agreements and datarefresh requirements Identification of data sources and their governance policies Data quality planning Data model design ETL tool selection 6. Relational database 7. 8. 9. 10. 11. software and platform selection Data transport Data conversion Reconciliation process Purge and archive planning End-user support From Solomon, 2005 Data Warehouse Implementation Implementation factors that can be categorized into three criteria Organizational issues Project issues Technical issues User participation in the development of data and access modeling is a critical success factor in data warehouse development Data Warehouse Management Data warehouse administrator (DWA) A person responsible for the administration and management of a data warehouse Security Concerns “All eggs in one basket” Unauthorized access Usage Concerns Users need to understand how to use the data correctly Why was data collected, how was it stored, how do these affect the current use of the data Data Warehouse Management Effective security in a data warehouse should focus on four main areas: Establishing effective corporate and security policies and procedures Implementing logical security procedures and techniques to restrict access Limiting physical access to the data center environment Establishing an effective internal control review process with an emphasis on security and privacy Data Warehouse Benefits Direct benefits of a data warehouse Allows end users to perform extensive analysis Allows a consolidated view of corporate data Better and more timely information Enhanced system performance Simplification of data access Data Warehouse Benefits Indirect benefits result from end users using these direct benefits Enhance business knowledge Present competitive advantage Enhance customer service and satisfaction Facilitate decision making Help in reforming business processes Data Warehouse Vendor Selection Important guidelines for selecting a DW vendor Financial strength ERP linkages Qualified consultants Market share Industry experience Established partnerships DW Development Best Practices Project must fit with The project must be corporate strategy and business objectives There must be complete buy-in to the project by executives, managers, and users It is important to manage user expectations about the completed project The data warehouse must be built incrementally Build in adaptability managed by both IT and business professionals Develop a business/supplier relationship Only load data that have been cleansed and are of a quality understood by the organization Do not overlook training requirements Be politically aware From Weir, 2002 DW Development: What NOT to do Cultural issues being ignored Inappropriate architecture Unclear business objectives Missing information Unrealistic expectations Low levels of data summarization Low data quality Believing promises of performance, capacity, and scalability Believing that your problems are over when the data warehouse is up and running Believing that DW database design is the same as transactional database design Choosing a DW manager who is technology oriented rather than user oriented Focusing on traditional internal record-oriented data and ignoring the value of external data and of text, images, and, perhaps, sound and video Focusing on ad hoc data mining and periodic reporting instead of alerts Delivering data with overlapping and confusing definitions