Designing the data warehouse / data marts Methodologies and Techniques Basic principles Life cycle of the DW First time load Operational Databases Warehouse Database Refresh Refresh Purge or Archive Refresh Oracle Warehouse Any Source Components Any Data Any Access Operational data Relational / Multidimensional Relational tools Oracle Medi` External data Text, image Spatial Web Audio, video OLAP tools Applications/ Web Oracle Intelligence Tools IS develops user’s Views Current Oracle Reports Business users Tactical Oracle Discoverer Analysts Strategic Oracle Express Oracle Data Mart Suite Data Modeling Oracle Data Mart Designer OLTP Databases OLTP Engines Warehousing Engines Data Mart Database Oracle8 SQL*PLUS Data Extraction Data Management Data Access & Analysis Oracle Data Mart Builder Oracle Enterprise Manager Discoverer & Oracle Reports • “Big Bang” Approach: Advantages and Disadvantages Advantages: – warehouse built as part of major project (eg: BPR) – Having a “big picture” of the data warehouse before starting the data warehousing project • Disadvantages: – Involves a high risk, takes a longer time – Runs the risk of needing to change requirements Incremental Approach to Warehouse Development Strategy Definition Analysis Design Build Production • Multiple iterations • Shorter implementations • Validation of each phase Benefits of an Incremental Approach • Delivers a strategic data warehouse solution through incremental development efforts • Provides extensible, scalable architecture • Quickly provides business benefits and ensures a much earlier return of investment • Allows a data warehouse to be built based on a subject or application area at a time • Allows the construction of an integrated data mart environment Data Mart • A subset of a data warehouse that supports the requirements of a particular department or business function. • Characteristics include: – Do not normally contain detailed operational data unlike data warehouses. – May contain certain levels of aggregation Dependent Data Mart Flat Files Operational Systems Marketing Marketing Sales Finance Human Resources Data Warehouse Sales Finance Data Marts External Data Independent Data Mart Operational Systems Flat Files Sales or Marketing External Data Reasons for Creating a Data Mart • To give users more flexible access to the data they need to analyse most often. • To provide data in a form that matches the collective view of a group of users • To improve end-user response time. • Potential users of a data mart are clearly defined and can be targeted for support Reasons for Creating a Data Mart • To provide appropriately structured data as dictated by the requirements of the end-user access tools. • Building a data mart is simpler compared with establishing a corporate data warehouse. • The cost of implementing data marts is far less than that required to establish a data warehouse. Data Marts Issues • • • • Data mart functionality Data mart size Data mart load performance Users access to data in multiple data marts • Data mart Internet / Intranet access • Data mart administration • Data mart installation Example of DW tool OLAP • Rotate and drill down to successive levels of detail. • Create and examine calculated data interactively on large volumes of data. • Determine comparative or relative differences. • Perform exception and trend analysis. • Perform advanced analytical functions for example forecasting, modeling, and regression analysis Original OLAP Rules 1. Multidimensional conceptual view 2. Transparency 3. Accessibility 4. Consistent reporting performance 5. Client-server architecture Original OLAP Rules 6. Multiuser support 7. Unrestricted cross-dimensional operations 8. Intuitive data manipulation 9. Flexible reporting 10. Unlimited dimensions and aggregation levels Relational Database Model Attribute 1 Attribute 2 Attribute 3 Attribute 4 Name Age Gender Emp No. Row 1 Anderson 31 F 1001 Row 2 Green 42 M 1007 Row 3 Lee 22 M 1010 Row 4 Ramos 32 F 1020 The table above illustrates the employee relation. Multidimensional Database Model Customer Store Store Time SALES Product Time FINANCE GL_Line The data is found at the intersection of dimensions. Two dimensions Three dimensions Specialised Multidimensional tool • Benefits: – Quick access to very large volumes of data – Extensive and comprehensive libraries of complex functions • analysis • Strong modeling and forecasting capabilities – Can access multidimensional and relational database structures – Caters for calculated fields • Disadvantages: – Difficulty of changing model – Lack of support for very large volumes of data – May require significant processing power MOLAP Server • The application layer stores data in a multidimensional structure DSS client • The presentation layer provides the MOLAP multidimensional view Engine • Efficient storage and processing Application layer • Complexity hidden from the user • Analysis using preaggregated summaries and precalculated Warehouse measures ROLAP Server • The warehouse stores DSS client atomic data. • The application layer ROLAP generates SQL for the engine three- dimensional view. Application • The presentation layer Multiple layer SQL provides the multidimensional view. Warehouse server MOLAP MDDB Query Periodic load Warehouse Data Express Server Express user ROLAP Cache Live fetch Query Data cache Warehouse Data Express Server Express user Also Hybrid (HOLAP) Choosing a Reporting Architecture • • • • • • • Business needs Good Potential for growth Query interface Performance enterprise architecture OK Network architecture Speed of access Openness MOLAP ROLAP Simple Complex Analysis Data Acquisition • Identify, extract, transform, and transport source data • Consider internal and external data • Perform gap analysis between source data and target database objects • Plan move of data between sources and target • Define first-time load and refresh strategy • Define tool requirements • Build, test, and execute data acquisition modules Modeling • Warehouses differ from operational structures: – Analytical requirements – Subject orientation • Data must map to subject oriented information: – Identify business subjects – Define relationships between subjects – Name the attributes of each subject • Modeling is iterative • Modeling tools are available Modeling the Data Warehouse 1 1. Defining the business model 2. Creating the dimensional model 2, 3 3. Modeling summaries 4. Creating the physical model 4 Physical model Select a business process Identifying Business Rules Location Geographic proximity 0 - 1 miles 1 - 5 miles > 5 miles Time Month > Quarter > Year Product Type Monitor Status PC Server 15 inch 17 inch 19 inch None New Rebuilt Custom Store Store > District > Region Creating the Dimensional Model Identify fact tables – Translate business measures into fact tables – Analyze source system information for additional measures – Identify base and derived measures – Document additivity of measures Identify dimension tables Link fact tables to the dimension tables Create views for users Dimension Tables Dimension tables have the following characteristics: • Contain textual information that represents the attributes of the business • Contain relatively static data • Are joined to a fact table through a foreign key reference Product Channel Facts (units, price) Customer Time Fact Tables Fact tables have the following characteristics: • Contain numeric measures (metrics) of the business • May contain summarized (aggregated) data • May contain date-stamped data • Are typically additive • Have key value that is typically a concatenated key composed of the primary keys of the dimensions • Joined to dimension tables through foreign keys that reference primary keys in the dimension tables Dimensional Model (Star Schema) Fact table Product Channel Facts (units, price) Customer Time Dimension tables Star Schema Model Product Table Product_id Product_desc … • Central fact table • Radiating dimensions • Denormalized model Time Table Day_id Month_id Period_id Year_id Store Table Store_id District_id ... Sales Fact Table Product_id Store_id Item_id Day_id Sales_dollars Sales_units ... Item Table Item_id Item_desc ... Star Schema Model • • • • • • • Easy for users to understand Fast response to queries Simple metadata Supported by many front end tools Less robust to change Slower to build Does not support history Snowflake Schema Model Product Table Product_id Product_desc Store Table Store_id Store_desc District_id District Table District_id District_desc Sales Fact Table Item_id Store_id Sales_dollars Sales_units Time Table Week_id Period_id Year_id Item Table Item_id Item_desc Dept_id Dept Table Dept_id Dept_desc Mgr_id Mgr Table Dept_id Mgr_id Mgr_name Snowflake Schema Model • • • • Direct use by some tools More flexible to change Provides for speedier data loading May become large and unmanageable • Degrades query performance • More complex metadata Using Summary Data Phase 3: Modeling summaries • Provides fast access to precomputed data • Reduces use of I/O, CPU, and memory • Is distilled from source systems and precalculated summaries • Usually exists in summary fact tables Designing Summary Tables • Average • Maximum • Total • Percentage Units Product A Total Product B Total Product C Total Sales(€) Store Summary Tables Example SALES FACTS Sales Region Month 10,000 North Jan 99 12,000 South Feb 99 11,000 North Jan 99 15,000 West Mar 99 18,000 South Feb 99 20,000 North Jan 99 10,000 East Jan 99 2,000 West Mar 99 SALES BY MONTH/REGION Month Region Tot_Sales$ Jan 99 North 41,000 Jan 99 East 10,000 Feb 99 South 40,000 Mar 99 West 17,000 SALES BY MONTH Month Tot_Sales Jan 99 51,000 Feb 99 40,000 Mar 99 17,000 Summary Management in Oracle8i Sales summary Sales Region State City Product Time Summary advisor Summary usage Summary recommendations Space requirements The Time Dimension • Time is critical to the data warehouse. • A consistent representation of time is required for extensibility. Sales fact Time dimension How and where should it be stored?