Introduction to Data Warehousing Pasquale LOPS Gestione della Conoscenza d’Impresa A.A. 2003-2004 Introduction Data warehousing and decision support have given rise to a new class of databases. Design strategies for OLAP databases differ significantly from OLTP systems. Today’s decision support systems must deliver multidimensional analysis capabilities. 1 Terminology – What is a Data Warehouse? A database - typically read-only - Data stored in relational or multidimensional format - Multidimensional db often populated from relational db Populated from existing source systems - Secondary sources of data - Populated from existing internal or external data sources - It is possible to build DSS on top of operational systems Used for reporting purposes - not transaction-based Used primarily for reporting purposes Must be designed for analysis purposes OLAP; not OLTP Terminology – Decision Support and Multidimensional Analysis Decision Support Systems (DSS) Facilitate business analysis Support business decision makers by providing various types of analysis: trend, comparison and ad hoc reporting Multidimensional Analysis Allows analysis by business dimension Equates to ‘Flexible Reporting’ Allows for drill down, drill up and iterative data analysis 2 Terminology – OLTP and OLAP OLTP On-Line Transaction Processing - support specific application - Maintain integrity of data OLAP On-Line Analytical Processing - support business analysis Points of Difference Orientation or alignment of data Integration History—time horizon of data Data access and manipulation Usage patterns OLTP vs. OLAP – Orientation or Alignment of Data OLTP OLAP Organized around Applications Organized for Business Dimension Different systems hold different types of data. All types of data are integrated into one system. Data is inherently organized by application. Data is organized by defined dimensions of the business. Information from different systems stored in a single database. Different system. information in a different 3 OLTP vs. OLAP – Integration OLTP Typically Not Integrated Different key structures Different naming conventions Different file formats Different hardware platforms OLAP Must Be Integrated Standard key structures Standard naming conventions Standard file format One warehouse server – Logical server OLTP vs. OLAP – History OLTP Recent or Current Data 60-90 days Current values only No time key No time series analysis Primary source OLAP Historical Data 2 or more years Historical snapshots of OLTP data Time key Time series analysis Secondary source – Until data is purged or lost from OLTP (after 60-90 days) 4 OLTP vs. OLAP – Data Access and Manipulation OLTP Transactions OLAP Bulk Processes Inserts, Updates, Deletes, Selects Small amount of data involved in each transaction Highly ‘indexable’ RDBMS focus – Locking – Concurrency – Logical Unit of Work Selects only Large amount of data involved in each process Not always ‘indexable’ RDBMS focus – Parallel Loader, Query – Star Join – Bit mapped Indexes OLTP vs. OLAP – Usage Patterns OLTP Fairly Consistent Maintain a constant system utilization pattern OLAP Spiked or Uneven Large period of light use and spiked usage pattern System Resource Utilization Graphs 5 OLTP vs. OLAP – Summary Warehouse Headaches OLTP OLAP Alignment: Aligned by Application Aligned by Dimension Integration: Typically Not Integrated Must Be Integrated History: Recent or Current Data Historical Data Data Access: Transactions Bulk Processes Usage: Fairly Consistent Spiked or Uneven Batch Maintenance Tuning Intro to ERM and ERD Terms and Concepts 6 ENTITIES ERM Terminology ERM - Entity Relationship Model (design) ERD - Entity Relationship Diagram (graphical) Entity - things of interest to the business, represented by boxes and implemented as tables Attributes - things to know about an entity, implemented as columns in tables Relationships - how entities relate, represented by lines on ERD and implemented as foreign keys 7 Entity Paradigms Rounded corners - ERD, Square corners - Relational Naming Should be singular in nature Consistency, communication, compatibility RELATIONSHIPS 8 Relationships and Business Rules RX RX Transaction Relationship - Line and “crow’s foot” represent a foreign key relationship from RX Transaction and RX Cardinality - crow’s foot means “one or more”, absence means “one” Relationships and Business Rules Is allowed by RX allows RX Transaction Optionality Solid bar means that the relationship MUST exist Circle means that relationship MAY exist Use words near entities with optionality symbols to complete sentences for definition RX may allow RX Transactions RX Transactions must be allowed by an RX 9 Relationships Be managed by Department Vice President manage One-to-One relationship: Each department must be managed by one VP. Each VP may manage one department. Relationships Management Team contains Vice President Is contained in One-to-Many relationship: Each management team must contain many VP’s. Each VP may be contained in one management team. 10 Relationships hold Employee Degree Is held by Many-to-Many relationship: Each Employee must hold one or more degrees. Each Degree may be held by one or more employees. ATTRIBUTES 11 Attributes - Terminology Attributes are the information we wish to keep about a particular entity Example: Inventory Inventory Store_id Item_id Amt Units Attributes - Implementation Entities and their attributes are shown as: ENTITY NAME (Attribute1, Attribute2, Attribute3) Inventory (Store_id, Item_id,Amount, Units) To specify a primary key for an entity/table, underline the appropriate Attribute(s) DRUG (Store_id, Item_id, Amount, Units) For the purposes of normalization, repeating groups of attributes may be shown in brackets SALES (ITEM_ID, DATE, SALES_AMT, {ITEM_NAME, CLASS}) 12 Warehouse Architecture Overview Warehouse Overview Basic components Warehouse Server Design Strategies Source Systems Warehouse Access Tool 13 Warehouse Overview Designers must consider and understand unique characteristics and requirements of all three previous components Ideally, a project team should pick the best-ofclass tools for storing and accessing data. In reality, all three pieces should be selected with regard to the others, to ensure that each component will complement the others. Source Systems One or more operational systems will be the source(s) of the data stored in the data warehouse. Source System A Source System B Source System C User Group A User Group B User Group C 14 Source Systems (cont’d) Source systems are typically not integrated. 9Have unique key structures and unique naming conventions 9Possess overlapping data Source systems hold current value data. Source systems will indirectly define the scope of a warehouse. 9Only data found in source systems can be included in data warehouse; no “new” data can be created. 9Each operational system will have unique characteristics (levels of detail or granularity of data, types of data or metrics available) Warehouse Server Distributed Architecture A DWH RDBMS DWH RDBMS Distributed Architecture B DWH RDBMS DWH RDBMS HW Platform (typically UNIX-based) Gateway DWH RDBMS 15 Warehouse Access Tool / Architecture MOLAP MDDB Calls Client MDDB SQL HW Platform Messaging SQL App Server HW Platform SQL DWH RDBMS HW Platfom ROLAP (2-3 tier) Design Overview 16 The Warehouse Trade-Off Triangle Query Performance Schema Data Warehouse Maintenance User Requirements The ETL Process ETL = Extraction, Transformation and Loading 17 Batch Process – Overview Source System File Transfer DWH RDBMS Extract Program Extract File Source System Server Load File Landing Space Warehouse Server Batch Process – Extracts Extracts are programs that generate data files. Source System Perform data transformations, data cleaning. Perform key conversions. Extract Program Extract File Reformat data to the standards of the warehouse. Must produce data in a file format suitable for loading into the data warehouse (delimiters, capitalization, etc.). May build aggregation tables. Source System Server 18 Batch Process – Extracts Source System Extract Program Extract File Basic Types of Extracts 1) Facts tables Must provide load files for the following tables: –Base tables –Historical tables –Aggregate tables 2) Lookup tables Must provide data to populate the following tables: –Lookup tables –Relationship tables Source System Server Batch Process – Extracts (cont’d) Static Extraction for the first loading of the DWH Incremental Extraction for the update of the DWH 19 Batch Process – File Transfers Source System File Transfer Extract Program Extract File Source System Server Load File Landing Space Warehouse Server Batch Process – File Transfers (cont’d) File Transfer: Process of moving data files to data warehouse server. After Extracts, generated files must be moved from source systems to data warehouse server. Design Considerations 9Transfer method and network impact ¾Usually transferred via FTP ¾Data volumes are usually large, therefore, the impact on the network and the transfer rate should be tested and understood ¾Landing space must have enough disk space on the data warehouse to temporarily store extract files before they are loaded into the warehouse 20 Batch Process – File Transfers (cont’d) Design Considerations 9Scheduling routines ¾If there is not enough landing space, scheduling routines must be designed. ¾Routines must coordinate file transfers and database loads, transferring a new data file only after an existing file has been loaded and is no longer needed. Batch Process – Data Loads Source System File Transfer DWH RDBMS Extract Program Extract File Source System Server Load File Landing Space Warehouse Server 21 Batch Process – Data Loads (cont’d) Data Load: data loaded from extract file into database. 9Post-load processes 9Aggregation routines Basic Types of Load Procedures 9Append new records to existing table. 9Drop table and reload updated data file. 9Update existing records. Batch Process – Data Loads (cont’d) Post Load Processes 9Must update table indexes after data loads. 9For statistics-based optimizers, must update table and index statistics after loads. Aggregation Routines 9Must run aggregation routines if aggregate data preparation is performed in the data warehouse database. 22 Batch Jobs – Overview Basic Refresh Jobs: necessary to update tables in warehouse with current information 9 Lookup Table 9 Fact Table 9 Aggregate Table Maintenance Jobs: necessary to maintain tables in warehouse 9 Updating fact data 9 Re-organizing data Basic Refresh Jobs – Lookup Tables Purpose 9 Apply changes in existing “organizational” systems to lookup data in data warehouse. 9 Changes include addition of new items or changes to descriptive information. 9 No changes to attribute keys or attribute relationships. 23 Basic Refresh Jobs – Lookup Tables (cont’d) Basic Methods 9 No refresh ¾Extract is run once to populate DWH ¾Often used in pilot or prototype systems 9 Drop and reload ¾Existing table is dropped or emptied ¾Extract is re-run to capture current information ¾Table is loaded with new extract 9 Append to existing table ¾ Extract is re-run to capture current information ¾ New extract and “old” or “master” lookup file are compared. ¾ New “Delta” file is generated. ¾ Delta is applied to master lookup file and lookup table in warehouse. ¾ Delta file may be loaded into warehouse directly, OR ¾ Delta may be applied to master lookup file and then use Drop and Reload method, loading the master lookup file. ¾ Sophisticated batch routines, normally used in production Basic Refresh Jobs – Lookup Tables (cont’d) Org Source System Extract File 1/96 Lookup Table Extract Program Extract File 2/96 DWH RDBMS 24 Basic Refresh Jobs – Lookup Tables (cont’d) Org Source System Master Lookup Delta File Extract Program Compare Program Lookup Table DWH RDBMS Extract File 2/96 Basic Refresh Jobs – Fact Tables Purpose 9 Refresh or update fact data in DWH with the new data from source systems. Basic methods 9 Bulk or historical insert ¾Extract is run to capture all data existing in source systems ¾Data is bulk-loaded into data warehouse fact tables ¾Simple batch routine ¾Used to “start” warehouse or provide initial data sets ¾Often doesn’t perform any cleansing or integration 9 Drop and reload 9 Append to existing table 25 Basic Refresh Jobs – Fact Tables (cont’d) Fact Source System Extract File for 9/95 thru 1/96 Fact Table Extract Program DWH RDBMS Basic Refresh Jobs – Fact Tables (cont’d) Basic methods 9 Drop and reload ¾Historical or Bulk extract is re-run to capture all available data ¾Existing warehouse table is emptied or truncated ¾File is inserted into empty fact table ¾Simple batch routine ¾Used in prototypes or pilot ¾Not feasible for large data sets 9 Append to existing table ¾Extract is re-run to capture current information ¾New extract is added to “end” of existing fact table ¾Most common method used in production systems 26 Basic Refresh Jobs – Fact Tables (cont’d) Fact Source System Fact Table 9/95 - 1/96 Extract Program Append 2/96 Extract File 2/96 DWH RDBMS Basic Refresh Jobs – Aggregate tables Purpose 9 Refresh or aggregate tables Basic methods 9Aggregate in warehouse RDBMS ¾Produce atomic extract ¾Transfer and load atomic extract into atomic fact table ¾Produce aggregate values using SQL accessing atomic fact table ¾Insert aggregate values into aggregate fact table 9Aggregate in batch (on source systems or warehouse server) ¾Produce atomic extract ¾Transfer and load atomic extract into atomic fact table ¾Produce aggregate extract from atomic extract ¾Transfer and load aggregate extract into aggregate fact table 27 Basic Refresh Jobs – Aggregate Tables (cont’d) Aggregate Extract Source System Aggregate Fact Table Aggregate SQL Routines Aggregate Program Extract Program Atomic Fact Table Atomic Extract Source System Server Warehouse RDBMS Maintenance Jobs – Updating Fact Data Purpose 9As changes are made to source system data, they should be reflected in the data warehouse. Basic Methods 9Ignore changes. 9Wait until audited data is available. 9Drop and reload day’s extract. 9Capture and apply changes. 9Transfer changes. 28 Maintenance Jobs – Updating Fact Data (cont’d) Scenario Sun Mon Sun Pre Audit Data Tue Wed Thu Fri Sat Wed Pre Audit Data Sun Post Audit Data Audit Process produces clean data set 3 days after initial set is posted. Maintenance Jobs – Updating Fact Data (cont’d) Fact Source System Sun Pre Data Sunday Extract Program Compare Program Sun Post Data Wed Pre Data Fact Table Delta File DWH RDBMS 29 Maintenance Jobs – Re-Organizing Data Region Relationship between Region and Store changes Store Lookup Region Lookup Store Region_id Region_desc Store_id Store_desc Region_id Must update foreign key in Store Lookup Store_id Store_desc Region_id 13 21 24 27 35 57 San Fran Boston Dallas Philly DC Las Vegas 2 1 2 1 1 2 Dallas is moved to the East Region 1 Maintenance Jobs – Re-Organizing Data (cont’d) Lookup Region Fact Sales Region_id Region_desc Lookup Store Region_id Store_id Store_desc Region_id Store_id Item_id Week_id Sales_dollars Sales_units Must update key for Store in all tables Best Case Worst Case Region_id Store_id Store_desc Region_id Store_id Store_desc 1 1 1 2 2 2 04 07 11 03 08 11 Boston Philly DC San Fran Dallas Las Vegas 1 1 1 2 2 2 04 07 11 03 07 11 Boston Philly DC San Fran Dallas Las Vegas 1 1 X X 30 Maintenance Jobs – Re-Organizing Data (cont’d) Lookup Region Region Sales Region_id Region_desc Region_id Item_id Date Sales_Dollars Sales_Units Lookup Store Store_id Store_desc Region_id It is necessary to Re-aggregate table values Store Sales Store_id Store_desc Region_id 13 21 24 27 35 57 San Fran Boston Dallas Philly DC Las Vegas 2 1 2 1 1 2 1 Store_id Item_id Date Sales_Dollars Sales_Units Batch Process – Frequency 9 Batch job frequencies differ with data sources and level of detail 9 Typically there will be a set of batch routines dedicated to each level of time detail (daily, weekly and monthly batch job) 9 Frequency is often but not necessarily tied to the level of time detail included in the data files to be loaded during that batch routine 31 Batch Process – Frequency (cont’d) Frequency vs. Detail Chart Frequency Daily Weekly Daily A B Weekly C?? A Level of Detail Current week problem Weekly Tables Daily Tables Current Week References Golfarelli, M., Rizzi, S., Data Warehouse: Teoria e pratica della progettazione, McGraw-Hill, 2002. 32