F4: DW Architecture and Lifecycle Erik Perjons, DSV, SU/KTH perjons@dsv.su.se The data warehouse architecture The back room The front room Analysis/OLAP Data warehouse External sources Extract Transform Load Operational source systems Serve Data marts Productt Time1 Value1 Value11 Product2 Time2 Value2 Value21 Product3 Time3 Value3 Value31 Product4 Time4 Value4 Value41 Query/Reporting Data mining Falö aöldf flaöd aklöd falö alksdf Operational source Data staging systems (RK) area (RK) Legacy systems Back end tools OLTP/TP systems Data presentation area (RK) ”The data warehouse” Data access tools (RK) End user applications Business Intelligence tools Presentation (OLAP) servers 1 Operational Source Systems Operational source systems characteristics: Operational source systems • the source data often in OLTP (Online Transaction Processing) systems, also called TPS (Transaction Processing Systems) • high level of performance and availability • often one-record-at-a time queries • already occupied by the normal operations of the organisation OLTP vs. DSS (Decision Support Systems) OLTP vs. OLAP (Online analytical processing) Operational Source Systems More operational source systems characteristics: Operational source systems • a OLTP system may be reliable and consistent, but there are often inconsistencies between different OLTP systems • different types of data format and data structures in different OLTP systems AND DIFFERENT SEMANTICS 2 Operational Source Systems Kimball et al´s assumptions (p 7): Operational source systems •Source systems are not queried in the broad and unexpected ways •Maintain little historical data •Each source systems is often a natural stovepipe application DW architecture: Data staging area Analysis/OLAP Data warehouse External sources Operational source systems Extract Transform Load Serve Data marts Productt Time1 Value1 Value11 Product2 Time2 Value2 Value21 Product3 Time3 Value3 Value31 Product4 Time4 Value4 Value41 Query/Reporting Data mining Falö aöldf flaöd aklöd falö alksdf Operational source systems Data staging area Data presentation area Data access tools 3 The Data Staging Area Often the most complex part in the architecture, and involves... • • • • Extract Transform Load Extraction (E) Transformation (T) Load (L) indexing ETL-tools can be used Scripts for extraction, transformation and load are implemented Data staging area Extract Transform Load Extraction means reading and understanding the source data and copying the data needed for the data warehouse into staging area for further manipulation, i.e. transformation 4 Data staging area Transformation involves… Extract Transform Load • data conversion/transformation (specify transformation rules to convert to a common data format and common terms/semantics) • data cleaning/cleansing – data scrubbing (use domain-specific knowledge (e.g postal adresses) to check the data) – data auditing (discover suspicious pattern, discover violation of stated rules) • combining data from multiple sources • assigning warehouse (surrogate) keys • data aggregation Data staging area A debate questions: Extract Transform Load Should the data in the data staging area be stored in a 3NF relational database and loaded into the presentation area for querying and reporting? Kimball (p 8-9): a 3NF relational database in data staging area requires more time and resources for development, periodic loading and updating and more capacity of storing the multiple copies of the data 5 A Real World Example Flat file C DB2Connect DB2 table(s) D’ Various source files Customer data F Customer data G Start balance H Fees (manually adjusted to individual agreements) I Staging area for checking, analysing, cleaning, complementing etc transaction data Three star/join schemas comprising altogether 8 tables Fact tables: - transactions (10 attributes) - fees (7 attributes) - start balance (4 attributes) Dimensional tables: - time (7 attr) - customer (> 40 attr) - company (> 90 attr) - product (13 attr) - ”Service charged” (2 attr) SQL, C++ ?? Some cleansing and scrubbing may be needed here DB2 Preliminary target DW E +aggregation (new program) DB2 Final target DW E’ E complemented with some aggregated tables DW architecture: Data presentation area Analysis/OLAP Data warehouse External sources Operational source systems Extract Transform Load Serve Data marts Productt Time1 Value1 Value11 Product2 Time2 Value2 Value21 Product3 Time3 Value3 Value31 Product4 Time4 Value4 Value41 Query/Reporting Data mining Falö aöldf flaöd aklöd falö alksdf Operational source systems Data staging area Data presentation area Data access tools 6 Data presentation area Data warehouse OLAP servers Data marts • • • • What is OLAP? Dimensional modelling vs. 3 NF modelling Data Marts ROLAP/MOLAP servers What is OLAP? • Acronym for “On-line analytical processing” • A decision support system (DSS) that support ad-hoc querying, i.e. enables managers and analysts to interactively manipulate data. The idea is to allow the users to easy and quickly manipulate and visualise the data through multidimensional views, i.e. different perspectives. Service quarter e fic of Quarter Facts Office product Kimball: Dimensional modelling 7 Dimensional modelling Service Dimension Service Key Service group S1 Local call Group A S2 Intern. call Group A S3 SMS Group B S4 WAP Group C 1 0..* Time Dimension Date/ Key 991011 991012 C210 C210 C212 C213 C214 S1 S3 S2 S1 S4 F11 F11 F13 F13 F13 991011 991011 991011 991011 991012 0..* Office Sundsvall Sundsvall Kista 1 Year 99 99 Number of calls 3 1 1 1 1 0..* Customer Dimension Sales Dimension Seller Anders C Lisa B Janis B Sum 25:00 05:00 89:00 12:00 08:00 Quarter 4 - 99 4 - 99 1 Fact table - Transactions 0..* Key F11 F12 F13 Month 9910 9910 Key C210 C211 C212 C213 C214 1 Customer Anna N Lars S Erik P Danny B Åsa S Address Stockholm Malmö Rättvik Stockholm Stockholm Region Stockholm Skåne Dalarna Stockholm Stockholm Income group B B C A A Dimensional modelling Service Dimension Service Key Service group S1 Local call Group A S2 Intern. call Group A S3 SMS Group B S4 WAP Group C Time Dimension Date/ Key 991011 991012 S1 S3 S2 S1 S4 F11 F11 F13 F13 F13 991011 991011 991011 991011 991012 Sum 25:00 05:00 89:00 12:00 08:00 Number of calls 3 1 1 1 1 Σ=37:00 Key F11 F12 F13 Seller Anders C Lisa B Janis B Office Sundsvall Sundsvall Kista Quarter 4 - 99 4 - 99 Year 99 99 Fact table - Transactions C210 C210 C212 C213 C214 Sales Dimension Month 9910 9910 Query: For how much did customers in Sthlm use service “Local call” in october 1999? Customer Dimension Key C210 C211 C212 C213 C214 Customer Anna N Lars S Erik P Danny B Åsa S Address Stockholm Malmö Rättvik Stockholm Stockholm Region Stockholm Skåne Dalarna Stockholm Stockholm Income group B B C A A 8 3 NF modelling vs. Dimensional modelling Key difference between 3NF and Dimensional modelling: - the degree of normalisation 3 NF modelling - a logical design technique to eliminate data redundancy to keep consistency and storage efficiency, and makes transaction simple and deterministic - ER models for enterprise are usually complex, e.g. they often have hundreds, or even thousands, of entities/tables Dimensional modelling - a logical design technique that present data in a intuitive, i.e. easier to navigate for the user - allow high performance access/queries (the complexity of 3NF models overwhelms the database systems optimizer, which means bad performance) [Kimball et al, p 10-11] - aims at model decision support data Data presentation area – Data marts Kimball et al (p.10-12 and 396) “we refer to the presentation area as a series of integrated data marts” “a data mart is a flexible set of data, ideally based on the most atomic (granular) data possible to extract from operational source, and presented in a symmetric (dimensional) model that is resilient when faced with unexpected user queries” “in its most simplistic form a data mart represent data from a single business process” (business process=purchase order, store inventory and so on) 9 Data marts Service Quarter Calls Service Quarter Office Subscription orders Office Service Quarter Calls Office Subscription orders The data warehouse bus architecture A data mart A data mart Orders ction Produ Dimensions Time Sales Rep Customer Promotion Product Plant Distr. Center [Kimball et al, p 78-79] 10 Data marts • A dimensional model for a large data warehouse consists of between 10 and 25 similar-looking data marts. Each data marts will have 5 to 15 dimensional tables. The Data marts Kimball et al’s strong opinions (p.10-12) • all data in the presentation area should be presented, stored and accesses in dimensional models • the data marts must contain detailed, atomic data (it is unacceptable that the detailed data should be locked up in 3 NF models for drill-down) • the data marts dimensions should be conformed for drill-across techniques, which tie the data marts together in the data warehouse bus architecture 11 The Data marts More about data marts: • far smaller data volumes, fewer data sources • easier data cleaning process, faster roll-out • allows a “piecemeal” approach to some of the enormous integration problems involved in creating an enterprise wide data model, but complex integration in the long term Dependent vs. Independent Data marts Independent Data marts Data warehouse Dependent Data marts Data warehouse 12 The presentation/OLAP servers Extended Relational DBMS (ROLAP servers) – – – – data stored in RDB star-join schemas support SQL extensions index structures Data warehouse OLAP servers Data marts Multidimensional DBMS (MOLAP servers) – – – – data stored in arrays (n-dimensional array) direct access to array data structure excellent indexing properties poor storage utilisation, especially when the data is sparse. More about presentation servers What is characteristics regarding data warehouse, according to Chaudhiri&Dayal : • Index structures (bit map indexes, join indexes) • SQL extensions (operators like Cube, Crossjoin) • Materialised views (pre-aggregations) 13 DW architechture: Metadata repository Monitoring & Administration Metadata repository OLAP servers Data warehouse External sources Operational source systems Extract Transform Load Refresh Serve Analysis Productt Time1 Value1 Value11 Product2 Time2 Value2 Value21 Product3 Time3 Value3 Value31 Product4 Time4 Value4 Value41 Query/Reporting Data mining Data marts Operational source systems Falö aöldf flaöd aklöd falö alksdf Data staging area Data presentation area Data access tools What is metadata? “Data about data”/”Information about data” Main functions are to give... • data definitions • the origin of data • the structure of data • rules for the selection and transfer of data • qualitative and quantitative data about data Contained in metadata repository Æ 14 The metadata repository An integrated complete source of metadata • is at the heart of the data warehouse architecture • supports the information needs of... – system developers – data administrators – system administrators – users – applications on the data warehouse • very complex data structure • must contain full version history • must always be up to date Metadata life cycle activities • Collection • identify and capture metadata in a central repository • Maintenance • establish processes to synchronise metadata with the changing data structure • Deployment • provide metadata to users in the right form and with the right tools 15 Different types of metadata • Administrative metadata (includes all information necessary for setting up and using a DW, e.g. Information about source databases, dw schemas, dimensions, hierachies, predefined queries, physical organisation, rules and script for extraction, transformation and load, back-end and front end tools) • Business metadata (business terms and definitions, ownership of data) • Operational metadata (information collected during the operations of the DW, e. g. usage statistics, error reports) DW architecture: End user applications Monitoring & Administration Metadata repository OLAP servers Data warehouse External sources Operational DBs Extract Transform Load Refresh Serve Analysis Productt Time1 Value1 Value11 Product2 Time2 Value2 Value21 Product3 Time3 Value3 Value31 Product4 Time4 Value4 Value41 Query/Reporting Data mining Data marts Operational source systems Falö aöldf flaöd aklöd falö alksdf Data staging area Data presentation area Data access tools 16 End user applications Analysis Productt Time1 Value1 Value11 Product2 Time2 Value2 Value21 Product3 Time3 Value3 Value31 Product4 Time4 Value4 Value41 Query/Reporting • OLAP tools, BI apps, DSS • Query/Reporting tools • Data mining Data mining Falö aöldf flaöd aklöd falö alksdf Spreadsheet output of OLAP tool product product group mounth quarter Column headers (join constraints) Product Group Group A Group A Group B Group B office region Column header (application constraint) Region ABC XYZ ABC XYZ Answer set representing focal event First Quarter - 1997 1245 34534 45543 34533 Row headers 17 Graphical output of OLAP tool Functionalities of OLAP tools • Drill-down - decreasing the level of aggregation • Drill-up/Roll-up/Consolidation - increasing the level of aggregation • Drill-across - move between different star-join schemas using conformed dimensions and joins • Slicing and dicing – ability to look at the database from different views, e.g. one slice shows all sales of product type within regions, another slice shows all sales by sales channel within each product type • Pivoting - e.g. change columns to rows, rows to columns • Ranking - sorting “Think of an OLAP data structure as a Rubik´s Cube of data that users can twist and twirl in different ways to work through what-if an what-happend scenarios” [Lee Thé] 18 Business Intelligence (BI) apps Strategic Who: strategic leaders What: formulate strategy and monitor corporate performance Examples: Balance scorecard, Strategic Planning Operational Who: operational managers What: execution of strategy againts objectives Examples: Budgeting, Sales forcasting Analytical Who: analysts, knowledge worker, controller What: ad-hoc analysis Examples: Financial and Sales Analysis, Customer Segmentation, Clickstream analysis Problems of Data Warehousing • Complexity of integration – Hidden problems with source systems – Data homogenisation – Underestimation of resources for data loading • Required data not captured • High maintenance • Long duration projects • Why not integrating the legacy applications (OLTP systems) instead? 19 Operational Data Store (ODS) No singel universal defintion... ODS definition 1: Implemented to deliver operational reporting, especially when neither the legacy nor the modern OLTP systems provide adequate operational reports – fixed queries and for tactical decision making ODS definition 2: Built to support real-time interactions, especially in Customer Relationsship Management applications – the tradtional data warehouse typically is not in a position to support the demand for near-real-time data OMG’s standards Meta Object Facility (MOF) M3 layer M2 layer Meta metamodel Metamodel UML Metamodel CWM Metamodel M1 layer M0 layer Model Instances Helen Nagy Invoice no 34 20 Common Warehouse Metamodel (CWM) Data Source Analysis Data Mart Reporting Data Source Operational Data Store ETL Data Warehouse Data Mart Visualization Data Mart Data Mining Data Source The collection of metamodels by CWM can be used to model the whole data warehousing environment i.e from data sources to end use analysis, and data warehouse management Common Warehouse Metamodel • Common Warehouse Metamodel (CWM) is a language specifically design to model data warehousing and data mining applications, i.e. integrating data warehousing and business analysis (business intelligence) tools • CWM has a lot in common with the UML metamodel but has a number of special metamodels (metaclasses), e.g modelling relational databases, multidimensional databases, OLAP, schema transformations, XML [Kleppe et al, p.139-140 (2003)] 21 Why metamodelling? Event consists of Meta metamodel level or Reference model consists of Precedes Transformation State Succedes Precedes/ Succedes Precedes Function State Activity Event Metamodel level Precedes Succedes Succedes Order recieved Model level Capture ordered items Capture ordered items Ordered item [captured] Ordered item captured Check material on stock Check material on stock Material on stock [checked] X Material is not on stock Material is on stock [Rosemann, Green, 2002] CWM packages Management Warehouse Process Analysis Transformation Resource Relational Foundation Object Model Business Information Core Warehouse Operation OLAP Record Data Types Information Visualization Data Mining Expressions Behavioral Business Nomenclature Multi-Dimensional XML Keys and Indexes Type Mapping Relationships Software Deployment Instance Packages/Metamodels 22 CWM packages layers • Object layer - base metamodels/packages, which are (re)used by the other metamodels/packages • Foundation layer - extends the object layer with services required which are (re)used by the other metamodels/packages, e.g “unique key” in the Key Indexes metamodel/package is used by relational databases, OO-databases and record-oriented • Resource layer - defines metamodels/packages for various types of data resouces • Analysis layer - analysis-oriented metadata • Management layer - describing the data warehousing process as a whole [Poole et al, p.36-40 (2002)] CWM packages relations Core package Element ModelElement Namespace re atu rFe Feature sifie Expression s Cla StructuralFeature Classifier ProcedureExpression Class Attribute Relational package Datatype package ColumnSet NamedColumnSet Table Column QueryExpression QueryColumnSet View 23 CWM classifyer equality Object Package Classifier (Klass) Feature (Attribut) Relational Schema Table Column Record Record file RecordDef Field Multi Dimensional Schema Dimenson Dimension ed Objct Element Type Attribute XML Schema More about CWM Tool Y Metamodel Common Representation Tool X Metamodel Tool Z Metamodel <<metamodels>> CWM Packages 24 Business Dimensional Lifecycle Technical Technical Architecture Architecture Design Design Product Product Selection Selection & & Installation Installation Business Business Project Project Planning Planning Requirement Requirement Dimensional Dimensional Modeling Modeling Physical Physical Design Design Data Data Staging Staging Design Design & & Development Development Deployment Deployment Definition Definition End-User End-User Application Application Specification Specification Maintenance Maintenance and and Growth Growth End-User End-User Application Application Development Development Project Project Management Management The Data Warehouse Architecture Framework Level of detail Data ARCHITECTURE AREA Back room Front room Infrastructure Info needed for better decisions Enterprise models How get, transform, make available data Major business issues. How measure How analyse HW/SW capabilities needed vs what we have Architecture models and documents Focal events, facts, dimensions Dimensional models Capabilities needed to get and transform data Major data stores User’s needs Major classes of analyses Priorities Where is data coming from Calc and storage reqs Detailed models and specs Logical and physical models Domains, derivation rules Standards, prods to provide capabilities How hook together Report layouts, derivation For whom, when How interact with capabilities System utilties, calls, APIs ... Implementation DB, indexes backup ... Write extracts, loads Automate process Implement report and analysis env Build rpt Train users Install, test infrastructure. Connect sourcesto targets to desktop Business reqs and audit 25