Data Warehouse Architecture Lecture #3 IMRAN KHAN IBA What & Why Architecture? “An architecture is a set of rules to adhere to when building something” Because a data warehouse can become quite large and complex, using an architecture is essential for success Strict rules for how to architect a data warehouse do not exist over the last 15 years a few common architectures have emerged Imran Khan IBA-FCS City Campus What & Why Architecture? According to research conducted in 2006 by The Data Warehousing Institute (TDWI) , five possible ways to architect a data warehouse: 1. 2. 3. 4. 5. Independent data marts—Each data mart is built and loaded individually; there is no common or shared metadata. This is also called a stovepipe solution. Data mart bus—The Kimball solution with conformed dimensions. Hub and spoke (corporate information factory)—The Inmon solution with a centralized data warehouse and dependent data marts. Centralized data warehouse—Similar to hub and spoke, but without the spokes; i.e. all end user access is directly targeted at the data warehouse. Federated—An architecture where multiple data marts or data warehouses already exist and are integrated afterwards. A common approach to this is to build a virtual data warehouse where all data still resides in the original source systems and is logically integrated using special software solutions. Imran Khan IBA-FCS City Campus Imran Khan IBA-FCS City Campus Imran Khan IBA-FCS City Campus The Big Debate: Inmon Versus Kimball In the beginning there were basically two approaches to modeling the data warehouse. Inmon popularized the term data warehouse Strong proponent of a centralized and normalized approach Kimball took a different perspective with his Data Marts and Conformed Dimensions. Imran Khan IBA-FCS City Campus Differences between the Inmon and Kimball approach 1. 2. 3. Data warehouse versus data marts with conformed dimensions Centralized approach versus iterative/decentralized approach Normalized data model versus dimensional data model Imran Khan IBA-FCS City Campus Conceptual DW Architectures Direct data mart Short term, quick results Architected, enterprise data warehouse Long term foundation for future development Imran Khan IBA-FCS City Campus Data Mart Data Mart A subset of data (from the data warehouse) designed to answer specific business questions Also called: Departmental Data Warehouse (Silverston & Graziano) Dimensional Data Warehouse (Kimball) Imran Khan IBA-FCS City Campus Direct Data Mart Transformation Routines (ETL) Source 1 Sales Data Mart Source 2 Financial Data Mart Source 3 Customer Service Data Mart Imran Khan IBA-FCS City Campus Direct Data Marts Pros: Build individual data marts faster Good for prototyping Imran Khan IBA-FCS City Campus Direct Data Marts Cons: Requires redundant coding Must transform each source multiple times Once for each data mart New data marts require new transform for each source New sources require multiple transformations If business rules change, must change code in multiple routines Increased number of routines may require more processing power Multiple points of failure for ETL Data marts can get out of sync Imran Khan IBA-FCS City Campus Architected Data Warehouse Core enterprise data warehouse design Based on corporate (logical) data model May include an ODS Specific Based departmental data marts on business needs Imran Khan IBA-FCS City Campus Architected Data Warehouse Sales Data Mart Source 1 Source 2 Enterprise Data Warehouse Financial Data Mart Customer Service Data Mart Source 3 Imran Khan IBA-FCS City Campus Architected Data Warehouse Pros: Reduced long term maintenance Complex source transformations occur once From source to staging area (or ODS) If business rules change, code changes required in only one place Reduces points of failure Second set of ETL routines handle simple aggregations and data segmentation Easier to create new data marts Enterprise DW becomes source of historical data Imran Khan IBA-FCS City Campus Architected Data Warehouse Cons: Requires more disk space Requires 2 sets of ETL routines Imran Khan IBA-FCS City Campus Corporate Information Factory Information Workshop Library & Toolbox Workbench Information Feedback External API Data Warehouse ERP Internet API API Legacy API Other Data Acquisition CIF Data Management Data Delivery Operational Data Store TrI Operational Systems Systems Management Exploration Warehouse DSI Data Mining Warehouse DSI OLAP Data Mart DSI Oper Mart DSI Meta Data Management Data Acquisition Management Operation & Administration Service Management Imran Khan IBA-FCS City Campus Change Management Multi-Tiered Architecture other Metadata sources Operational DBs Extract Transform Load Refresh Monitor & Integrator OLAP Server Serve Data Warehouse Analysis Query Reports Data mining Data Marts Data Sources Data Storage OLAP Engine Front-End Tools Imran Khan IBA-FCS City Campus Source Data Component Production Data Internal Data “private” spreadsheets, documents, customer profiles, and sometimes even departmental databases. Archived Data data comes from the various operational systems of the enterprise e.g. financial systems, manufacturing systems, systems along the supply chain, and customer relationship management systems. Some data is archived after a year. Sometimes data is left in the operational system databases for as long as five years. External Data For example, the data warehouse of a car rental company contains data on the current production schedules of the leading automobile manufacturers. This external data in the data warehouse helps the car rental company plan for its fleet management. Imran Khan IBA-FCS City Campus Data Staging Component Data Extraction Data Transformation Data Loading Data staging provides a place and an area with a set of functions to Clean Change Combine Convert Deduplicate Prepare source data for storage and use in the data warehouse. Imran Khan IBA-FCS City Campus Type of Meta Data Operational metadata Extraction & Transformation metadata End-user metadata Why is metadata especially important in a data warehouse? First, it acts as the glue that connects all parts of the data warehouse. Next, it provides information about the contents and structures to the developers. Finally, it opens the door to the end-users and makes the contents recognizable in their own terms. Imran Khan IBA-FCS City Campus OLAP Server Architectures Relational OLAP (ROLAP) Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services greater scalability Multidimensional OLAP (MOLAP) Array-based multidimensional storage engine (sparse matrix techniques) fast indexing to pre-computed summarized data Hybrid OLAP (HOLAP) User flexibility, e.g., low level: relational, high-level: array Specialized SQL servers specialized support for SQL queries over star/snowflake schemas Imran Khan IBA-FCS City Campus OLAP:On-Line Analytical Processing an environment for the analysis of multi-dimensional data dice rotate drill-down rollup OLAP provides advanced database support involving attribute selection, attribute encoding, row sampling, data cleansing and allows the use of multiple different search engines easy to use user-interface open system architecture using local processing power Imran Khan IBA-FCS City Campus Roll-up, Drill-down, Slicing, Dicing Drill-Down pop92 | state | | NOR_EAS NOR_CEN SOUTH WEST Total | ------------------------------------------------------------------------------------LAR_CITY | 3.62% 8.59% 15.68% 13.28% 41.17% | MED_CITY | 3.35% 5.36% 5.18% 7.02% 20.91% | SMA_CITY | 2.58% 5.66% 4.85% 5.16% 18.25% | SUP_CITY | 8.30% 3.54% 2.54% 5.29% 19.67% | ------------------------------------------------------------------------------------Total | 17.84% 23.15% 28.25% 30.75% 100.00% | | state |E_N_CEN E_SO_CE MID_ATL ... --------------------------------------------------------LAR_C | 5.46% 2.76% 2.09% ... MED_C | 3.84% 0.44% 1.38% ... SM_C | 4.12% 0.92% 1.49% ... SUP_C | 3.54% 0.00% 8.30% ... --------------------------------------------------------Total | 16.96% 4.12% 13.26% ... Dicing | state | | MID_ATL NEW_ENG NOR_EAS | -----------------------------------------------------------------50000~60000 | 12.26% 13.69% 25.96% | 60000~70000 | 10.93% 7.13% 18.05% | 70000~80000 | 10.52% 14.83% 25.35% | 80000~90000 | 4.89% 9.56% 14.45% | 90000~99999 | 2.79% 13.40% 16.19% | -----------------------------------------------------------------MED_CITY | 41.39% 58.61% 100.00% | pop92 pop92 Imran Khan IBA-FCS City Campus pop92 Slicing | state |MID_ATL NEW_ENG NOR_EAS | --------------------------------------------------------LAR_C | 11.72% 8.56% 20.28% MED_C| 7.76% 10.99% 18.75% SM_C | 8.34% 6.11% 14.45% SUP_C | 46.52% 0.00% 46.52% --------------------------------------------------------Total | 74.34% 25.66% 100.00% | | | | | | Data Warehouse Design Process Top-down, bottom-up approaches or a combination of both Top-down: Starts with overall design and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid) From software engineering point of view Waterfall: structured and systematic analysis at each step before proceeding to the next Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around Typical data warehouse design process Choose a business process to model, e.g., orders, invoices, etc. Choose the grain (atomic level of data) of the business process Choose the dimensions that will apply to each fact table record Choose the measure that will populate each fact table record Imran Khan IBA-FCS City Campus A practical approach (blend of top down & bottom up) The steps in this practical approach are as follows: 1. Plan and define requirements at the overall corporate level 2. Create a surrounding architecture for a complete warehouse 3. Conform and standardize the data content 4. Implement the data warehouse as a series of supermarts, one at a time Imran Khan IBA-FCS City Campus Imran Khan IBA-FCS City Campus