Data Warehouses BUAD/American University Data Warehouses 1 Definition • Data Warehouse: An integrated and consistent store of subject-oriented data that is obtained from a variety of sources and formatted into a meaningful context to support decision-making in an organization. BUAD/American University Data Warehouses 2 Need for Data Warehousing • Integrated, company-wide view of highquality information. • Separation of operational and informational systems and data – operational system: a system that is used to run a business in real time, based on current data – informational system: systems designed to support decision making based on stable pointin-time or historical data BUAD/American University Data Warehouses 3 Factors Allowing Data Warehousing • Relational DBMS. • Advances in hardware: speed and storage capacity. • End-user computing interfaces and tools. BUAD/American University Data Warehouses 4 Data Warehouse Architectures • Two-level – source system files containing operational data – transformed and integrated data warehouse • Three-level – Operational data. – Enterprise data warehouse (EDW)- single source of data for decision making. – Data marts - limited scope; data selected from EDW; customized decision-support for individual user groups BUAD/American University Data Warehouses 5 Generic data warehouse architecture BUAD/American University Data Warehouses 6 Three-layer architecture BUAD/American University Data Warehouses 7 Reasons for the Three-Level Architecture • EDW and data marts have different purposes and data architectures. • Data transformation is complex and is best performed in two steps. • Data marts customized decision support for different groups • Architecture – Operational data, reconciled data, Derived data. BUAD/American University Data Warehouses 8 Three-layer data architecture BUAD/American University Data Warehouses 9 Data Characteristics • Status vs. Event data. – A transaction is a business activity that triggers one or more business events: event data captures them • Transient vs. Periodic data. – Transient: data in which changes to existing records are written over previous records, thus destroying previous data content – periodic data: data that are never physically altered or deleted once added BUAD/American University Data Warehouses 10 Example of DBMS log entry BUAD/American University Data Warehouses 11 Transient operational data BUAD/American University Data Warehouses 12 Reconciled Data Characteristics • • • • • Detailed Historical Normalized Enterprise-wide Quality controlled BUAD/American University Data Warehouses 13 The Data Reconciliation Process • Capture: capture the relevant data from source files to fill EDW – Static - initial load. – Incremental - ongoing update. • Scrub or data cleansing – missing data, name reconciliation – Pattern recognition and other artificial intelligence techniques. BUAD/American University Data Warehouses 14 Steps in data reconciliation BUAD/American University Data Warehouses 15 The Data Reconciliation Process • Transform – Convert the data format from the source to the target system. – Record-Level Functions • Selection. • Joining. • Aggregation (for data marts). – Field-Level Functions • Single-field transformation • Multi-field transformation BUAD/American University Data Warehouses 16 The Data Reconciliation Process • Load and Index – Refresh Mode • When the warehouse is first created. • Static data capture. – Update Mode • Ongoing update of the warehouse. • Incremental data capture. BUAD/American University Data Warehouses 17 Derived Data Characteristics • Type of data – Detailed, possibly periodic. – Aggregated. • Distributed to departmental servers. • Implemented in star schema. BUAD/American University Data Warehouses 18 Star Schema • Also called the dimensional model. • Fact and dimension tables. – Fact table: consists of factual or quantitative data about the business – Dimension table: hold descriptive data • Grain of a fact table - time period for each record. BUAD/American University Data Warehouses 19 Components of a star schema BUAD/American University Data Warehouses 20 Star schema example BUAD/American University Data Warehouses 21 Star schema with sample data BUAD/American University Data Warehouses 22 Example of snowflake sample BUAD/American University Data Warehouses 23 Size of the fact table • • • • Total number of stores: 1,000 Total number of products: 10,000 Total number of periods: 24 Total rows: 1000 * 10,000 * 24 = 240,000,000 • On average 50% items record sales, – no of rows = 120,000,000 BUAD/American University Data Warehouses 24 Types of Data Marts • Dependent - Populated from the EDW. • Independent - Data taken directly from the operational databases. BUAD/American University Data Warehouses 25 The User Interface • • • • The role of metadata. Traditional query and reporting tools. On-line analytical processing (OLAP) The use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques. BUAD/American University Data Warehouses 26 The User Interface – Slicing a cube. – Pivot • Rotate the view for a particular data point to obtain another perspective. • E.g. take a value from the units column and obtain by-store values. – Drill-down BUAD/American University Data Warehouses 27 Slicing a data cube BUAD/American University Data Warehouses 28