1 Course Objectives The main objective of this course is to provide students with the key knowledge of data warehouse. Other objectives are as follows: • Understand concepts and principles of data Warehouse. • Reconcile different views of the same data • Evaluate the fundamental theories and requirements that influence the design of modern data warehouse • Critically evaluate alternative designs and architectures for data warehouses • Analyze the background processes involved in queries and transactions, and explain how this impact on database operation and design. 2 Course Learning Outcomes Upon completion of the course, Students will be able to: • Explain accepted data warehouse terminology • Explain the goals of data warehousing • Identify the stages of the data warehousing lifecycle • Apply the star schema model to a business case problem • Denormalize relational tables into high level summary tables • Design and Implement a multi-dimensional data cube using SQL Server Analysis Services 3 About Theory Course Course Code: Course Title: Credit Hours: Abbreviation: Prerequisite: CSC-454 Data Warehousing 3 DWH Database Management System – CSC 220 Type of Course: Elective Course Description: Introduction to Data Ware Housing, Normalization, DeNormalization, De-Normalization Techniques, Issues of DeNormalization, Online Analytical Processing (OLAP),Extract Transform Load (ETL), Data Cleansing, Data Quality Management (DQM), DWH Lifecycle 4 Course Assessment Quizzes 10% Assignments (Theoretical) 20% Midterm Examination 20% Final Examination 50% Total 100% Asgns 20% Quizzes 10% Scoring Final 50% Midterm 20% 5 Books • “Data warehousing fundamentals for IT professionals", 2nd Ed, by Paulraj Ponniah. 2010 • “the Data warehouse Toolkit” , latest edition, by Ralph Kimball and Margy Ross .2013 • “Mastering Data ware housing design” by Claudio Imhoff, latest edition 6 Reference Books • “Building the data warehouse” by William H Inmon • “The Unified Star Schema: An Agile and Resilient Approach to Data Warehouse and Analytics Design” by Bill Inmon and Francesco. 2020 • “The Modern Data Warehouse in Azure: Building with Speed and Agility on Microsoft’s Cloud Platform” by Matt how. 2020 7 Today’s Outline Why DWH? What is DWH? History Inmon VS Rimball data cycle Characteristics of DWH Traditional DWH Problems of Data warehousing 8 Introduction to Data Warehousing 9 10 Data, data everywhere yet… 11 Why study Data warehouse? • New technologies (multidimensional modeling, business intelligence, OLAP, querying models, etc) • Research potential (data mining, business intelligence, ETL algorithms, multidimensional data analysis, query optimizations, etc) • Industry demand • High market value of DW experts • Fulfill degree requirements • Easy course, want to sleep through it 12 What is a Data Warehouse? • A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in business context. 13 What is a Data Warehouse? • A process of transforming data into information and making it available to users in a timely enough manner to make a difference 14 What is a Data Warehouse? • A Data Warehouse is a relational database that is designed for query and analysis rather than for transaction processing. • It usually contains historical data derived from transaction data, but it can include data from other sources. • It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. 15 What is DWH • A collection of small databases. • A central location where consolidate data from multiple location(db) are stored • Strategic information/data is saved in DWH. – Eg: (total sale, famous product, non famous product) . • Note: data warehouse is not loaded every time new data is added to database. 16 Data WareHouse • In addition to a relational database, a data warehouse environment includes: 1. an Extraction, Transformation, and Loading (ETL) solution. 2. an Online Analytical Processing (OLAP) engine. 3. client analysis tools. 4. other applications that manage the process of gathering data and delivering it to business users. 17 History of Data Warehousing 18 History of Data Warehousing 19 Definition by founder of DWH Bill Inmon • A data warehouse is a • subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process. Ralph Kimball a copy of transaction data specifically structured for query and analysis” Dimensional modelling focuses on ease of end-user accessibility. 20 21 22 Inmon VS kimball 23 Parameters Kimball Inmon Introduced by Introduced by Ralph Kimball. Introduced by Bill Inmon. Approach It has Bottom-Up Approach for implementation. It has Top-Down Approach for implementation. Data Integration It focuses Individual business areas. It focuses Enterprise-wide areas. Building Time It is efficient and takes less time. It is complex and consumes a lot of time. Cost It has iterative steps and is cost effective. Initial cost is huge and development cost is low. Skills Required It does not need such skills but a generic It needs specialized skills to make team will do job. work. Maintenance Here maintenance is difficult. Here maintenance is easy. Data Model It prefers data to be in De-normalized model. It prefers data to be in normalized model. Data Store Systems In this, source systems are highly stable. In this, source systems have high rate of change. 24 Characteristics of Data Warehousing • we must refer the definition of INMON, 1993 for characteristics. “A data warehouse is a Subject-Oriented , Integrated , TimeVariant and Non-Volatile collection of data in support of management’s decision-making process.” • So characteristics of a good data warehouse is: 1. Subject Oriented. (detail info of particular thing) 2. Integrated. (different source system) 3. Time-Variant. (historical data, change over time) 4. Non-Volatile. (unchangedable) 25 1. Subject Oriented: • Data Warehouse is organized around major subjects of the enterprise/ specific business areas (e.g. customers, products, sales) rather than major application areas (e.g. customer invoicing, stock control, product sales). • This is reflected in the need to store decision-support data rather than application-oriented data. • Example: – To learn more about your company’s sales data, you can build a warehouse that concentrates on SALES. – Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" – This ability to define a data warehouse by subject matter, SALES in this case, makes the data warehouse Subject Oriented. 26 2. Integrated Data: • The data warehouse integrates corporate application-oriented data from different source systems, which often includes data that is inconsistent. • The integrated data source must be made consistent to present a unified view of the data to the users. 27 3. Nonvolatile: • Nonvolatile means that, once entered into the warehouse, data should not change. • Data in the warehouse is not updated in real-time but is refreshed from operational systems on a regular basis. • New data is always added as a supplement to the database, rather than a replacement. • This is logical because the purpose of a warehouse is to enable you to analyze what has occurred. 28 4. Time Variant: • In order to discover trends in business, analysts need large amounts of data. • This is very much in contrast to Online Transaction Processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. • A data warehouse’s focus on change over time is what is meant by the term time variant 29 Traditional DWH • Central repository for all internal data in a company • Overall relational schema • The predictable data structure and quality optimized processing and reporting • Data is in disk block formatting • Fundamental operation is read a row • Indexing via B-tree 30 Key Trends Breaking traditional DWH • Big data technology enables IT to leverage multiple sources of data. Following are some of the sources 31 Problems of Data Warehousing 1. Underestimation of resources for data loading: – Many developers underestimate the time required to extract, clean and load the data into the warehouse. – This process may account for a significant proportion of the total development time, although better data cleansing and management tools should ultimately reduce the time and effort spent. 2. Hidden problems with source systems: – Hidden problems associated with the source systems feeding the data warehouse may be identified, possibly after years of being undetected. – The developer must decide whether to fix the problem on the data warehouse and/or fix the source systems. • For example, when entering the details of a new property, certain fields may allow null, which may result in staff entering incomplete property data, even when available and applicable. 32 Problems of Data Warehousing 3. Required data not captured: – Data Warehouse projects often highlight a requirement for data not being captured by the existing source systems. – The organization must decide whether to modify the OLTP systems or create a system dedicated to capture the missing data. 4. Increased end-user demands: – – After end user receive query and reporting tools, requests for support from IS (Information System) staff may increase rather than decrease. This is caused by an increasing awareness of the users on the capabilities and values of the data warehouse. This problem can be partially resolved by investigating easier to use, more powerful tools, or in providing better training for the users. 33 Problems of Data Warehousing 5. High demand for resources: – Data warehouse can use large amount of disk space. – Many relational databases used for decision support are designed around Star, Snowflake and Starflake schemas. These approaches result in creation of very large fact tables. 6. High maintenance: – Data warehouse are high maintenance systems. – Any reorganization of the business processes and source systems may affect the data warehouse. – To remain a valuable resource, data warehouse must remain consistent with the organization that it supports. 34 Problems of Data Warehousing 7. Long duration projects: – A data warehouse represents a single data resource for the organization. – However, the building a warehouse can take up to three years, which is why some organizations are building the data marts. – Data marts support only the requirements of a particular department or functional area and can therefore be built more rapidly. 8. Complexity of integration: – The most important area for the management of the data warehouse is the integration capabilities. – This means an organization must spend a significant amount of time determining how well the various different data warehousing tools can be integrated into the overall solution that is needed. 35 Modern Data Wearhousing 36 37