Data Warehousing Dale-Marie Wilson, Ph.D. Evolution of Data Warehousing Since 1970s, organizations gained competitive advantage Automated business processes More efficient and cost-effective services to customer Resulted in accumulation of growing amounts of data in operational databases Evolution of Data Warehousing Increased focus on ways to use operational data to support decisionmaking Means of gaining competitive advantage Operational systems not designed to support such business activities Typically numerous operational systems with overlapping and contradictory definitions Organizations need to turn archives of data into source of knowledge Goal: single integrated / consolidated view of organization’s data presented to user Solution: Data Warehouse Provides system capable of supporting decision-making, receiving data from multiple operational data sources Data Warehousing Concepts A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process (Inmon, 1993) Subject-oriented Data Warehouse organized around major subjects of the enterprise e.g. customers, products, and sales Not major application areas (e.g. customer invoicing, stock control, and product sales) Stores decision-support data not application-oriented data Integrated Data Integrates corporate application-oriented data from different source systems Includes inconsistent data Integrated data source made consistent Presents unified view of data to users Time-variant Data Data accurate and valid at instance in time or over time interval Time-variance shown in: Extended time data held Implicit/explicit association of time with data Data represents series of snapshots Non-volatile Data Data not updated real-time Refreshed from operational systems on regular basis New data added as supplement not replacement Data Webhouse Web is source of behavioral data Clickstream – user’s path thru Website and Web history Data webhouse is a distributed data warehouse with no central data repository that is implemented over the Web to harness clickstream data Benefits of Data Warehouse Potential high returns on investment Competitive advantage Increased productivity of corporate decision-makers Comparison of OLTP Systems and Data Warehousing Data Warehouse Queries Queries Range from relatively simple to highly complex Dependent on end-user access tools used End-user access tools: Reporting, query, and application development tools Executive information systems (EIS) OLAP tools Data mining tools Examples of Typical Data Warehouse Queries What was the total revenue for Scotland in the third quarter of 2004? What was the total revenue for property sales for each type of property in Great Britain in 2003? What are the three most popular areas in each city for the renting of property in 2004 and how does this compare with the figures for the previous two years? What is the monthly revenue for property sales at each branch office, compared with rolling 12-monthly prior figures? What would be the effect on property sales in the different regions of Britain if legal costs went up by 3.5% and Government taxes went down by 1.5% for properties over £100,000? Which type of property sells for prices above the average selling price for properties in the main cities of Great Britain and how does this correlate to demographic data? What is the relationship between the total annual revenue generated by each branch office and the total number of sales staff assigned to each branch office? Problems of Data Warehousing Underestimation of resources for data loading Hidden problems with source systems Required data not captured Increased end-user demands Data homogenization High demand for resources Data ownership High maintenance Long duration projects Complexity of integration Typical Architecture of Data Warehouse Operational Data Resources Mainframe first generation hierarchical and network databases Departmental propriety file systems (e.g. VSAM, RMS) Relational DBMSs (e.g. Informix, Oracle) Private workstations and servers External systems Internet Commercially available databases Databases associated with organization’s suppliers or customers Operational Data Store (ODS) Repository of current and integrated operational data used for analysis Structured and supplied with data like data warehouse May act as staging area for data to be moved into warehouse Created when legacy operational systems incapable of achieving reporting requirements Benefits: Provides users with ease-of-use of relational database Distant from decision support functions of data warehouse Load Manager Performs operations associated with extraction and loading of data Size and complexity varies between data warehouses Constructed using combination of vendor data loading tools and custom-built programs Warehouse Manager Performs operations associated with management of data Constructed using vendor data management tools and custom-built programs Warehouse Manager Performs operations associated with management of data Constructed using vendor data management tools and custom-built programs Operations: Data analysis to ensure consistency Transformation and merging of source data from temporary storage Creation of indexes and views on base tables Generation of denormalizations, (if necessary) Generation of aggregations, (if necessary) Backing-up and archiving data Warehouse Manager Generates query profiles to determine which indexes and aggregations are appropriate Query profile Can be generated for each user, group of users, or the data warehouse Describes characteristics of queries • Frequency • Target table(s) • Size of results set Query Manager Performs operations associated with management of user queries Constructed using vendor end-user data access tools, data warehouse monitoring tools, database facilities, and custom-built programs Complexity determined by facilities provided by end-user access tools and database Operations: Directing queries to appropriate tables Scheduling execution of queries Can generate query profiles Allows warehouse manager to determine appropriate indexes and aggregations Detailed Data Detailed data stored in database schema Not stored online Aggregated to next level of detail Regularly added to warehouse to supplement aggregated data Lightly and Highly Summarized Data Stores pre-defined lightly and highly aggregated data generated by warehouse manager Transient - changes to respond to changing query profiles Purpose of summary information Improve query performance Removes requirement to continually perform summary operations in answering user queries Summary data updated continuously as new data loaded into warehouse Archive/Backup Data Stores detailed and summarized data for archiving and backup Data transferred to storage archives magnetic tape or optical disk Metadata Stores metadata (data about data) definitions used by all processes in warehouse Used for: Extraction and loading processes • Used to map data sources to common view of information within warehouse Warehouse management process • Used to automate production of summary tables Query management process • Used to direct query to most appropriate data source Metadata Metadata structure differs between processes Different purposes Issues: Multiple copies of metadata describe same data item Vendor tools and end-user data access use own versions of metadata Copy management tools use metadata to understand mapping rules that are applied to convert source data into common form End-user access tools use metadata to understand how to build a query The management of metadata within data warehouse is very complex task that should not be underestimated End-User Access Tools Principal purpose of data warehousing: To provide information to business users for strategic decision-making Users interact with warehouse using end-user access tools Data warehouse must efficiently support ad hoc and routine analysis High performance achieved by: Pre-planning requirements for joins Summations Periodic reports by end-users (where possible) Main groups of access tools Data reporting and query tools Application development tools Executive information system (EIS) tools Online analytical processing (OLAP) tools Data mining tools Data Warehouse Information Flows Data Warehouse Information Flows Inflow - Processes associated with extraction, cleansing, and loading data from source systems Upflow - Processes associated with adding value to data in warehouse through summarizing, packaging, and distribution Downflow - Processes associated with archiving and backing-up/recovery of data Outflow - Processes associated with making data available to end-users Metaflow - Processes associated with management of metadata Data Warehousing Tools and Technologies Building data warehouse is complex task No vendor that provides an ‘end-to-end’ set of tools Necessitates data warehouse built using multiple products from different vendors Major challenge: Ensuring products work well together and are fully integrated Data Warehousing Tools and Technologies Tasks of capturing data from source systems, cleansing and transforming it, and loading results into target system can be carried out either by separate products, or by a single integrated solution Integrated solutions include Code Generators Database Data Replication Tools Dynamic Transformation Engines Data Warehouse DBMS Requirements Load performance Load processing Data quality management Query performance Terabyte scalability Mass user scalability Networked data warehouse Warehouse administration Integrated dimensional analysis Advanced query functionality Administration and Management Tools Monitoring data loading from multiple sources Data quality and integrity checks Managing and updating metadata Monitoring database performance to ensure efficient query response times and resource utilization Auditing data warehouse usage to provide user chargeback information Administration and Management Tools Replicating, subsetting, and distributing data Maintaining efficient data storage management Purging data Archiving and backing-up data Implementing recovery following failure Security management Typical Data Warehouse and Data Mart Architecture Data Mart A subset of a data warehouse that supports the requirements of a particular department or business function Characteristics: Focuses on requirements of one department or business function Does not normally contain detailed operational data unlike data warehouses More easily understood and navigated Reasons for Creating a Data Mart Give users access to data they need to analyze most often Provide data in form that matches collective view of data by group of users in a department or business function area Improve end-user response time Reduction in volume of data to be accessed Provide appropriately structured data as dictated by requirements of enduser access tools Building data mart is simpler compared with establishing corporate data warehouse Cost of implementing data marts less than that required to establish data warehouse Potential users of data mart more clearly defined More easily targeted to obtain support for data mart project Designing Data Warehouses Initially, need answers for questions such as: Which user requirements are most important and which data should be considered first? Which data should be considered first? Should the project be scaled down into something more manageable? Should the infrastructure for a scaled down project be capable of ultimately delivering a fullscale enterprise-wide data warehouse? Designing Data Warehouses Use of data marts avoids complexities associated with designing data Difficult to commit to enterprisewide design that must meet all user requirements Interim solution => build data marts Goal: creation of data warehouse that supports requirements of enterprise Designing Data Warehouses Requirements collection and analysis stage: Involves interviewing appropriate members of staff (such as marketing users, finance users, and sales users) • Identify prioritized set of requirements data warehouse must meet Interviews conducted with members of staff responsible for operational systems • Identify, which data sources can provide clean, valid, and consistent data that will remain supported over next few years Interviews provide necessary information for top-down view (user requirements) and bottom-up view (available data sources) Database component of data warehouse described using technique called dimensionality modeling Dimensionality Modelling Logical design technique that aims to present data in standard, intuitive form that allows for high-performance access Uses Entity-Relationship modeling concepts with important restrictions: Every dimensional model (DM) composed of one table with a composite primary key, called fact table, and set of smaller tables called dimension tables Each dimension table has simple (non-composite) primary key that corresponds exactly to one component of composite key in fact table Forms ‘star-like’ structure called star schema or star join Dimensionality Modelling Natural keys replaced with surrogate keys Every join between fact and dimension tables based on surrogate keys, not natural keys Surrogate key – generalized structure based on integers Allows data in warehouse independence from data used and produced by OLTP systems Star schema for property sales of DreamHome Dimensionality Modelling Star schema - logical structure Has fact table containing factual data in center Surrounded by dimension tables containing reference data, which can be denormalized Facts generated by events that occurred in the past, Unlikely to change, regardless of how analyzed Dimensionality Modelling Fact tables: Where bulk of data in data warehouse Can be extremely large Important to treat fact data as read-only reference data that will not change over time Most useful fact tables contain one or more numerical measures, or ‘facts’ that occur for each record and are numeric and additive Dimensionality Modelling Dimension tables: Usually contain descriptive textual information Dimension attributes used as constraints in data warehouse queries Star schemas speeds up query performance by denormalizing reference information into single dimension table Dimensionality Modelling Snowflake schema Variant of the star schema where dimension tables do not contain denormalized data Starflake schema Hybrid structure that contains mixture of star (denormalized) and snowflake (normalized) schemas Allows dimensions to be present in both forms to cater for different query requirements Property sales with normalized version of Branch dimension table Dimensionality Modelling Advantages of predictable, standard form of underlying dimensional model: Efficiency Ability to handle changing requirements • Star schema handles ad hoc user queries well Extensibility • Supports changes e.g. adding new dimension, facts Ability to model common business situations Predictable query processing Comparison of DM and ER models ER model Reduces data redundancy Beneficial to transaction processing Single ER model normally decomposes into multiple DMs Multiple DMs are associated through ‘shared’ dimension tables Database Design Methodology for Data Warehouses ‘Nine-Step Methodology’: Choosing the process Choosing the grain Identifying and conforming the dimensions Choosing the facts Storing pre-calculations in the fact table Rounding out the dimension tables Choosing the duration of the database Tracking slowly changing dimensions Deciding the query priorities and the query modes Step 1: Choosing the process The process (function) refers to subject matter of particular data mart First data mart built should be: Most likely to be delivered on time Within budget Answers the most commercially important business questions Business process of DreamHome case study Example – Chosen Data Mart Step 2: Choosing the grain Decide what a record of fact table represents Identify dimensions of fact table Grain decision for fact table also determines grain of each dimension table Include time as core dimension Always present in star schemas Step 3: Identifying and Conforming dimensions Dimensions set context for asking questions about the facts in fact table If any dimension occurs in two data marts: Must be exactly same dimension Or one must be mathematical subset of other Dimension used in more than one data mart referred to as being conformed Star schemas for property sales and property advertising Step 4: Choosing the facts Grain of fact table determines which facts can be used in data mart Facts should be numeric and additive Unusable facts include: non-numeric facts non-additive facts fact at different granularity from other facts in table Property rentals with a badly structured fact table Property rentals with fact table corrected Step 5: Storing precalculations in the fact table Once facts selected Re-examine to determine whether there are opportunities to use pre-calculations Step 6: Rounding out the dimension tables Text descriptions are added to dimension tables Text descriptions should be intuitive and understandable to users Usefulness of data mart determined by scope and nature of attributes of dimension tables Step 7: Choosing the duration of the database Duration measures how far back in time fact table goes Very large fact tables raises two very significant data warehouse design issues: Often difficult to source increasing old data Mandatory that old versions of important dimensions be used, not the most current versions - aka ‘Slowly Changing Dimension’ problem Step 8: Tracking slowly changing dimensions Slowly changing dimension problem Generalized key assigned to important dimensions Proper description of old dimension data must be used with old fact data Allows distinction multiple snapshots of dimensions over period of time Three basic types of slowly changing dimensions: Type 1 - where changed dimension attribute overwritten Type 2 - where changed dimension attribute causes new dimension record to be created Type 3 - where a changed dimension attribute causes alternate attribute to be created • Both the old and new values of attribute simultaneously accessible in the same dimension record Step 9: Deciding the query priorities and the query modes Most critical physical design issues affecting end-user’s perception includes: Physical sort order of fact table on disk Presence of pre-stored summaries or aggregations Additional physical design issues: Administration Backup Indexing performance Security Database Design Methodology for Data Warehouses Methodology designs data mart: Supports requirements of particular business process Allows easy integration with other related data marts to form enterprise-wide data warehouse A dimensional model, which contains more than one fact table sharing one or more conformed dimension tables, Referred to as fact constellation Fact and dimension tables for each business process of DreamHome Dimensional model (fact constellation) for the DreamHome data warehouse Chapters 31 & 32 Omit material specific to oracle