Data Warehousing Concepts Dr. Awad Khalil Computer Science Department AUC Object_Oriented Databases, by Dr. Khalil 1 Content Why Data Warehousing? What Data Warehousing? Benefits of Data Warehousing Problems of Data Warehousing Object_Oriented Databases, by Dr. Khalil 2 Why Data Warehousing? Corporate decision-makers require access to all the organization’s data, wherever it is located. To provide comprehensive analysis of the organization, its business, its requirements, and any trends, requires access to not only the current values in the database, but also to historical data. To facilitate this type of analysis, the data warehouse has been created to hold data drawn from several data sources, maintained by different operating units, together with historical and summary transformations. However, decision-makers also require powerful analysis tools. Two main types of analysis tools have emerged over the last few years: Online Analytical Processing (OLAP) Data Mining Object_Oriented Databases, by Dr. Khalil 3 What Data Warehousing? Data Warehousing: A subject-oriented, integrated, timevariant, and non-volatile collection of data in support of management’s decision-making process. Subject-oriented: as the warehouse is organized around the major subjects of the enterprise (such as customers, products, and sales) rather than the major application areas (such as customer invoicing, stock control, and product sales). Integrated: because of the coming together of source data from different enterprise-wide applications systems. Time-variant: because data in the warehouse is only accurate and valid at some point in time or over some time interval. Non-volatile: as the data is not updated in real time but refreshed from operational systems on a regular basis. Object_Oriented Databases, by Dr. Khalil 4 Benefits of Data Warehousing Potential high returns on investment Competitive advantage Increase productivity of corporate decision-makers Object_Oriented Databases, by Dr. Khalil 5 OLTP versus Data Warehousing Data Warehousing OLTP Systems Holds current data Stores detailed data Data is dynamic Repetitive processing High level of transaction throughput Predictable pattern of usage Transaction-driven Application-oriented Supports day-to-day decisions Serve large number of users Holds historical data Stores detailed, lightly, and highly summarized data Data is largely static Ad hoc, unstructured and heuristic processing Medium to low level of transaction throughput Unpredictable pattern of usage Analysis-driven Subject-oriented Supports strategic decisions Serves relatively low number of managerial users Object_Oriented Databases, by Dr. Khalil 6 Problems of Data Warehousing Underestimation of resources for data loading Hidden problems with source systems Required data not captured Increased end-user demands Data homogenization High demand for resources Data ownership High maintenance Long-duration projects Complexity of integration Object_Oriented Databases, by Dr. Khalil 7 Data Warehouse Architecture Operational Data Operational Data Store (ODS) Load Manager Warehouse Manager Query Manager Detailed Data Lightly and Highly Summarized Data Archive/Backup Data Metadata End-User Access Tools Object_Oriented Databases, by Dr. Khalil 8 Data Warehouse Architecture (Cont’d) Operational Data Mainframe operational data held in first generation hierarchical and network databases. Departmental data held in proprietary file systems such as VSAM, RMS, and relational DBMSs such as Oracle and DB2. Private data held on workstations and private servers. External systems such as the Internet, commercially available databases, or databases associated with an organization’s suppliers or customers. Operational Data Store (ODS) ODS is a repository of current and integrated operational data used for analysis. It is often structured and supplied with data in the same way as the data warehouse , but may in fact act simply as a staging area for data to be moved into the data warehouse. Object_Oriented Databases, by Dr. Khalil 9 Data Warehouse Architecture (Cont’d) Load Manager The load manager (also called the frontend component) performs all the operations associated with the extraction and loading of data into the warehouse. The operations performed by the load manager may include simple transformations of the data to prepare the data for entry into the warehouse. The size and complexity of this component will vary between data warehouses and may be constructed using a combination of vendor data loading tools and custombuilt programs. Object_Oriented Databases, by Dr. Khalil 10 Data Warehouse Architecture (Cont’d) Warehouse Manager The warehouse manager performs all the operations associated with the management of the data in the warehouse. This component is constructed using vendor data management tools and custom-built programs. The operations performed by the warehouse manager include: Analysis of data to ensure consistency. Transformation and merging of source data from temporary storage into data warehouse tables. Creation of indexes and views on base tables. Generation of denormalizations (if necessary). Generation of aggregations (if necessary). Backing-up and archiving data. Query Manager The query manager (also called the backend component) performs all the operations associated with the management of user queries. This component is typically constructed using vendor end-user data access tools, data warehouse monitoring tools, database facilities, and custom-built programs. Object_Oriented Databases, by Dr. Khalil 11 Data Warehouse Architecture (Cont’d) Detailed Data This area of the warehouse stores all the detailed data in the database schema. In most cases, the detailed data is not stored online but is made available by aggregating the data to the next level of detail. However, on a regular basis, detailed data is added to the warehouse to supplement the aggregated data. Lightly and Highly Summarized Data This area stores all the predefined lightly and highly summarized (aggregated) data generated by the warehouse manager. It is transient area as it will be subject to change on an ongoing basis in order to respond to changing query profiles. The purpose of summary information is to speed up the performance of the queries. Object_Oriented Databases, by Dr. Khalil 12 Data Warehouse Architecture (Cont’d) Archive/Backup Data This area of the warehouse stores the detailed and summarized data for the purpose of archiving and backup. Metadata This area stores all the metadata (data about data) definitions used by all the processes in the warehouse. Metadata is used for a variety of purposes including: The extraction and loading processes. The warehouse management process. As part of the query management process. Object_Oriented Databases, by Dr. Khalil 13 Data Warehouse Architecture (Cont’d) End-User Access Tools The principal purpose of data warehousing is to provide information to business users for strategic decision-making. These users interact with the warehouse using end-user access tools. The data warehouse must efficiently support ad hoc and routine analysis. Although the definition of end-user access tools can overlap, they can be categorized into five main groups: Reporting and query tools; Application development tools; Executive Information Systems (EIS) tools; Online Analytical Processing (OLAP) tools; Data mining tools. Object_Oriented Databases, by Dr. Khalil 14 Data Warehouse Data Flows Data warehousing focuses on the management of five primary data flows, namely the inflow, upflow, downflow, outflow, and metaflow: Inflow: Extracting, cleansing, and loading of the source data. Upflow: Adding value to the data in the warehouse through summarizing, packaging, and distribution of the data. Downflow: Archiving and backing-up the data in the warehouse. Outflow: Making the data available to end-users Metaflow: Managing the metadata. Object_Oriented Databases, by Dr. Khalil 15 Data Warehouse Tools and Technologies Extraction, Cleansing, and Transformation Tools Selecting the correct extraction, cleansing, and transformation tools are critical steps in the construction of a data warehouse. The tasks of capturing data from a source system, cleansing and transforming it, and then loading the results into target system can be carried out either by separate products, or by single integrated solution. Integrated solutions fall into one of the following categories: Code generators; Database replication tools; Dynamic transformation engines. Object_Oriented Databases, by Dr. Khalil 16 Data Warehouse Tools and Technologies (Cont’d) Data Warehouse DBMS Requirements: The specialized requirements for a relational DBMS suitable for data warehousing are as follows: Load performance and load processing Data quality management Query performance Terabyte scalability and mass user scalability Networked data warehouse Warehouse administration Integrated dimensional analysis Advanced query functionality Parallel DBMSs: Data warehousing requires the processing of enormous amounts of data and parallel database technology offers a solution to providing the necessary growth in performance. The success of parallel DBMSs depends on the efficient operation of many resources including processors, memory, disks and network connections. The aim behind using parallel DBMS is to solve decision-support problems using multiple nodes working on the same problem. The major characteristics of parallel DBMSs are scalability, operability, and availability. The parallel DBMS performs many database operations simultaneously, splitting individual tasks into smaller parts so that tasks can be spread across multiple processors. Parallel DBMSs must be capable of running parallel queries. Parallel DBMSs must be capable of parallel data loading, table scaling, and data archiving and backup. Object_Oriented Databases, by Dr. Khalil 17 Data Warehouse Tools and Technologies (Cont’d) Data Warehouse Metadata The major purpose of metadata is to show the pathway back to where the data begun, so that the warehouse administrators know the history of any item in the warehouse. Metadata has several functions within the warehouse that relates to the processes associated with data transformation and loading, data warehouse management, and query generation. The metadata associated with data transformation and loading must describe the source data and any changes that were made to the data. The metadata associated with data management describes the data as it is stored in the warehouse. The metadata is also required by the query manager to generate appropriate queries. In turn, the query manager generates additional metadata about the queries that are run, which can be used to generate a history on all the queries and a query profile for each user, group of users, or the data warehouse. There is also metadata associated with the users of queries that includes, for example, information describing what the term ‘price’ or ‘customer’ means in a particular database and whether the meaning has changed over time. Synchronizing Metadata The major integration issue is how to synchronize the various types of metadata used throughout the data warehouse. The various tools of a data warehouse generate and use their own metadata, and to achieve integration, we require that these tools are capable of sharing their metadata. The challenge is to synchronizeObject_Oriented metadata between different Databases, by Dr. products from different 18 vendors using different metadata stores. Khalil Data Warehouse Tools and Technologies (Cont’d) Administration and Management Tools A data warehouse requires tools to support the administration and management of such a complex environment. The data warehouse administration and management tools must be capable of supporting the following tasks: Monitoring data loading from multiple sources; Data quality and integrity checks; Managing and updating metadata; Monitoring database performance to ensure efficient query response times and resources utilization; Auditing data warehouse usage to provide user chargeback information; Replicating, subsetting, and distributing data; Maintaining efficient data storage management; Purging data; Archiving and backing-up data; Implementing recovery following failures; Security management. Object_Oriented Databases, by Dr. Khalil 19 Data Marts Data Mart is a subset of a data warehouse that supports the requirements of a particular department or business function. The characteristics that differentiate data marts and data warehouses include: A data mart focuses on only the requirements of users associated with one department or business function; Data marts do not normally contain detailed operational data, unlike data warehouse; As data marts contain less data compared with data warehouse, data marts are more easily understood and navigated. Object_Oriented Databases, by Dr. Khalil 20 Reasons for Creating a Data Mart To give users access to the data they need to analyze most often; To provide data in a form that matches the collective view of the data by a group of users in a department or business function; To improve end-user response time due to the reduction in the volume of data to be accessed; To provide appropriately structured data as dictated by the requirements of end-user access tools such as OLAP and data mining tools, which may require their own internal database structures. Data marts normally use less data so tasks such as data cleansing, loading, transformation, and integration are far easier, and hence implementing and setting up a data mart is simpler than establishing a corporate data warehouse. The cost of implementing data marts is normally less than that required to establish a data warehouse. The potential users of a data mart are more clearly defined and can be more easily targeted to obtain support for a data mart project rather than a corporate data warehouse project. Object_Oriented Databases, by Dr. Khalil 21 Data Marts Issues Data mart functionality: The capabilities of data marts have increased with the growth in their popularity. Rather than being simply small, easy-to-access databases, some data marts must now be scalable to hundreds of gigabytes, and provide sophisticated analysis using OLAP and/or data mining tools. Data mart size: Users expect faster response times from data marts than from data warehouse, however, performance deteriorates as data marts grow in size. Data mart load performance: A data mart has to balance two critical components: enduser response time and data loading performance. User’s access to data in multiple data marts: One approach is to replicate data between different data marts or, alternatively, build virtual data marts. Virtual data marts are views of several physical data marts or the corporate data warehouse tailored to meet the requirements of specific group of users. Data mart Internet/Intranet access: Internet/Intranet technology offers users low-cost access to data marts and the data warehouse using Web browsers. Data mart administration: As the number of data marts in an organization increases, so does the need to centrally manage and coordinate data mart activities. Data mart installation: Data marts are becoming increasingly complex to build. Vendors are offering products referred to as “Data marts in a box” that provide a low-cost source of data mart tools. Object_Oriented Databases, by Dr. Khalil 22 Thank you Object_Oriented Databases, by Dr. Khalil 23