CHAPTER 7: ARCHITECTURAL COMPONENTS CHAPTER OBJECTIVES Understand data warehouse architecture Examine how the architectural framework supports the flow of data Study the functions and services of the architectural components Revisit the five major architectural types UNDERSTANDING DATA WAREHOUSE ARCHITECTURE Architecture: Definitions Architecture in Three Major Areas ARCHITECTURAL FRAMEWORK Architecture Supporting Flow of Data The Management and Control Module TECHNICAL ARCHITECTURE Data Acquisition Data Storage Information Delivery ARCHITECTURAL TYPES Centralized Corporate Data Warehouse Independent Data Marts Federated Hub-and-Spoke UNDERSTANDING DATA WAREHOUSE ARCHITECTURE We were introduced to the building blocks of the data warehouse. At that stage, we quickly looked at the list of components and reviewed each very briefly. In this chapter, we want to review the data warehouse architecture from different perspectives. You will study the architectural components in the order in which they enable the flow of data from the sources as business intelligence to the end-users. Then you will be able to look at each area of the architecture and examine the functions, procedures, and features in that area. Architecture: Definitions The structure that brings all the components of a data warehouse together is known as the architecture. In your data warehouse, The architecture includes the integrated data that is the centerpiece. The architecture includes everything that is needed to prepare the data and store it. The architecture includes all the means for delivering information from your data warehouse. The architecture is further composed of the rules, procedures, and functions that enable your data warehouse to work and fulfill the business requirements. The architecture is made up of the technology that empowers your data warehouse. What is the general purpose of the data warehouse architecture? The architecture provides the overall framework for developing and deploying your data warehouse; it is a comprehensive blueprint. The architecture defines the standards, measurements, general design, and support techniques. Architecture in Three Major Areas As you already know, the three major areas in the data warehouse are: Data acquisition Data storage Information delivery Figure 7-1 groups these major architectural components into the three areas. ARCHITECTURAL FRAMEWORK Architecture Supporting Flow of Data This collection of data from the various sources moves to the staging area. What happens next? The extracted data goes through a detailed preparation process in the staging area before it is sent forward to the data warehouse to be properly stored. From the data warehouse storage, data transformed into useful information is retrieved by the users or delivered to the user desktops as required. Figure 7-2 shows the flow of data from beginning to end and also highlights the architectural components enabling the flow of data as the data moves along. Figure 7-2 Architectural framework supporting the flow of data. The Management and Control Module This architectural component is an overall module managing and controlling the entire data warehouse environment. It is an umbrella component working at various levels and covering all the operations. Major functions: Monitor all the ongoing operations Recover from problems when things go wrong. Manages and controls the data acquisition functions, ensuring that extracts and transformations are carried out correctly and in a timely fashion. Manages backing up significant parts of the data warehouse and recovering from failures. Monitoring the growth and periodically archiving data from the data warehouse. Governs data security and provides authorized access to the data warehouse. Interfaces with the end-user information delivery component to ensure that information delivery is carried out properly. Figure 7-3 shows how the management component relates to and manages all of the data warehouse operations. Figure 7-3 The management and control component. TECHNICAL ARCHITECTURE Let us now consider the technical architecture in each of the three major areas of the data warehouse , Data Acquisition, Data Storage, Information Delivery. Data Acquisition This area covers the entire process of extracting data from the data sources, moving all the extracted data to the staging area, and preparing the data for loading into the data warehouse repository. Figure 7-4 Data acquisition: technical architecture. 1) Data Flow Flow In the data acquisition area, the data flow begins at the data sources and pauses at the staging area. After transformation and integration, the data is ready for loading into the data warehouse repository. Data Sources 1. Usually, these systems are supported by relational DBMSs. Here you may use an SQL-based language for extracting data. 2. A fairly large number of companies have adopted ERP (enterprise resource planning) systems. ERP data sources provide an advantage in that the data from these sources is already consolidated and integrated. 3. For including data from outside sources, you will have to create temporary files to hold the data received from the outside sources. After reformatting and rearranging the data elements, you will have to move the data to the staging area. Intermediary Data Stores As data gets extracted from the data sources, it moves through temporary files. Sometimes, extracts of homogeneous data from several source applications are pulled into separate temporary files and then merged into another temporary file before moving it to the staging area. Typically, the general practice is to use flat files to extract data from operational systems. Staging Area This is the place where all the extracted data is put together and prepared for loading into the data warehouse. The staging area is like an assembly plant or a construction area. In this area, you examine each extracted file, review the business rules, perform the various data transformation functions, sort and merge data, resolve inconsistencies, and cleanse the data. When the data is finally prepared either for an enterprise-wide data warehouse or one of the conformed data marts, the data temporarily resides in the staging area repository waiting to be loaded into the data warehouse repository. 2) Functions and Services The list of functions and services in this section relates to the data acquisition area and is broken down into three groups. This is a general list. It does not indicate the extent or complexity of each function or service. For the technical architecture of your data warehouse, you have to determine the content and complexity of each function or service. List of Functions and Services Data Extraction Generate automatic extract files from operational systems using replication and other techniques. Create intermediary files to store selected data to be merged later. Transport extracted files from multiple platforms. Reformat input from outside sources. Reformat input from departmental data files, databases, and spreadsheets. Generate common application codes for data extraction. Data Transformation Map input data to data for data warehouse repository. Clean data, de-duplicate, and merge/purge. Convert data types. Calculate and derive attribute values. Aggregate data as needed. Resolve missing values. Consolidate and integrate data. Data Staging Provide backup and recovery for staging area repositories. Sort and merge files. Create files as input to make changes to dimension tables. If data staging storage is a relational database, create and populate database. Resolve and create primary and foreign keys for load tables. If staging area storage is a relational database, extract load files. Data Storage This covers the process of loading the data from the staging area into the data warehouse repository. All functions for transforming and integrating the data are completed in the data staging area. Figure 7-5 shows a summarized view of the technical architecture for data storage. Figure 7-5 Data storage: technical architecture. 1) Data Flow Flow For data storage, the data flow begins at the data staging area. The transformed and integrated data is moved from the staging area to the data warehouse repository. If the data warehouse is an enterprise-wide data warehouse being built in a top-down fashion, then there could be movements of data from the enterprisewide data warehouse repository to the repositories of the dependent data marts. Data Groups Prepared data waiting in the data staging area fall into two groups. The first group is the set of files or tables containing data for a full refresh. This group of data is usually meant for the initial loading of the data warehouse. Occasionally, some data warehouse tables may be refreshed fully. The other group of data is the set of files or tables containing ongoing incremental loads. The Data Repository Almost all of today’s data warehouse databases are relational databases. All the power, flexibility, and ease of use capabilities of the RDBMS become available for the processing of data. 2) Functions and Services The general list of functions and services given in this section is for your guidance. The list relates to the data storage area and covers the broad functions and services. This is a general list. It does not indicate the extent or complexity of each function or service. For the technical architecture of data warehouse, you have to determine the content and complexity of each function or service. List of Functions and Services Load data for full refreshes of data warehouse tables. Support loading into multiple tables at the detailed and summarized levels. Optimize the loading process. Provide automated job control services for loading the data warehouse. Provide backup and recovery for the data warehouse database. Provide security. Monitor and fine-tune the database. Periodically archive data from the database. Information Delivery This area spans a broad spectrum of methods for making information available to users. For your users, the information delivery component is the data warehouse. They do not come into contact with the other components directly. For the users, the strength of your data warehouse architecture is mainly concentrated in the flexibility of the information delivery component. The information delivery component makes it easy for the users to access the information either directly from the enterprise-wide data warehouse, from the dependent data marts, or from the set of conformed data marts. Almost all modern data warehouses provide for online analytical processing (OLAP). In this case, the primary data warehouse feeds data to proprietary multidimensional databases (MDDBs) where summarized data is kept as multidimensional cubes of information. The users perform complex multidimensional analysis using the information cubes in the MDDBs. Refer to Figure 7-6 for a summarized view of the technical architecture for information delivery. Figure 7-6 Information delivery: technical architecture. 1) Data Flow Flow For information delivery, the data flow begins at the enterprise-wide data warehouse and the dependent data marts when the design is based on the top down technique. When the design follows the bottom-up method, the data flow starts at the set of conformed data marts. Generally, data transformed into information flows to user desktops during query sessions. Sometimes, the result sets from individual queries or reports are held in proprietary data stores of the query or reporting tool vendors. Recently progressive organizations implement dashboards and scorecards as part of information delivery. Dashboards are real time or near real time information display devices. Service Locations In your information delivery component, you may provide query services from the user desktop, from an application server, or from the database itself. This will be one of the critical decisions for your architecture design. Data Stores For information delivery, you may consider the following intermediary data stores: Proprietary temporary stores to hold results of individual queries and reports for repeated use Data stores for standard reporting Proprietary multidimensional databases 2) Functions and Services Review the general list of functions and services given below and use it as a guide to establish the information delivery component of your data warehouse architecture. The list relates to information delivery and covers the broad functions and services. Again, this is a general list. It does not indicate the extent or complexity of each function or service. For the technical architecture of your data warehouse, you have to determine the content and complexity of each function or service. Provide security to control information access. Monitor user access to improve service and for future enhancements. Allow users to browse data warehouse content. Simplify access by hiding internal complexities of data storage from users. Automatically reformat queries for optimal execution. Govern queries and control runaway queries. Provide self-service report generation for users, consisting of a variety of flexible options to create, schedule, and run reports. Store result sets of queries and reports for future use. Make provision for the users to perform complex analysis through online analytical processing (OLAP). ARCHITECTURAL TYPES 1) Centralized Corporate Data Warehouse In this architecture type, a centralized enterprise data warehouse is present. There are no data marts, whether dependent or independent. Therefore all information delivery is from the centralized data warehouse. Figure 7-7 Overview of the components of a centralized data warehouse. 2) Independent Data Marts In this architecture type, the data warehouse is a collection of unconnected, disparate data marts, each serving a specific department or purpose. Each data mart delivers information to its own group of users. Figure 7-8 Overview of the components of independent data marts. 3) Federated In the federated architectural type, common data elements in the various data marts and even data warehouses that compose the federation are integrated physically or logically. The goal is to strive for a single version of truth for the organization; a centralized enterprise data warehouse is present. There are no data marts, whether dependent or independent. Therefore all information delivery is from the centralized data warehouse. Figure 7-9 Overview of the components of a federated data warehouse. 4) Hub-and-Spoke In this architecture type, a centralized enterprise data warehouse is present. In addition, there are data marts that depend on the enterprise data warehouse for data feed. Information delivery can, therefore, be both from the centralized data warehouse and the dependent data marts. Figure 7-10 Overview of the components of a hub-and-spoke type of data warehouse. 5- Data-Mart Bus In this architecture type, no distinct, single data warehouse exists. The collection of all the data marts form the data warehouse because the data marts are conformed “super-marts” 6-27 CHAPTER SUMMARY Architecture is the structure that brings all the components together. Data warehouse architecture consists of distinct components with the read-only data repository as the centerpiece. A few typical data warehouse architectural types are in use at various organizations. Broadly these types reflect how data is stored and made available—centrally as the single enterprise data warehouse database or as a collection cohesive data marts. The architectural components support the functioning of the data warehouse in the three major areas of data acquisition, data storage, and information delivery. Data warehouse architecture is wide, complex, expansive, and has several distinguishing characteristics. The architectural framework enables the flow of data from the data sources at one end to the user’s desktop at the other. The technical architecture of a data warehouse is the complete set of functions and services provided within its components. It includes the procedures and rules needed to perform the functions and to provide the services. The flow of data from the source systems to end-users as business intelligence depends on the architectural type.