REQUIREMENTS AS THE DRIVING FORCE FOR DATA WAREHOUSING CHAPTER OBJECTIVES Understand why business requirements are the driving force Discuss how requirements drive every development phase Specifically learn how requirements influence data design Review the impact of requirements on architecture Note the special considerations for ETL and metadata Examine how requirements shape information delivery DEFINING THE BUSINES REQUIREMENTS In several ways, building a data warehouse is very different from building an operational system. This becomes notable especially in the requirements gathering phase. Because of this difference, the traditional methods of collecting requirements that work well for operational systems cannot be applied to data warehouses. Dimensional Nature of Business Data In data warehousing system, the users are generally unable to define their requirements clearly. users cannot define precisely what information they really want from the data warehouse, nor can they express how they would like to use the information or process it. Managers think of the business in terms of business dimensions. If your users of the data warehouse think in terms of business dimensions for decision making, you should also think of business dimensions while collecting requirements. Dimensional Nature of Business Data Figure 5-2 shows the analysis of sales units along the three business dimensions of product, time, and geography. Examples of business dimensions information package diagrams. Information Packages – novel idea for determining and recording information requirements for a data warehouse. New methodology for determining requirements for a data warehouse system based on business dimensions. Using the new methodology, you come up with the measurements and the relevant dimensions that must be captured and kept in the data warehouse.You come up with what is known as an information package for the specific subject. Example :information package for analyzing sales for a certain business. Example :information package for analyzing sales for a certain business The subject here is sales. The measured facts or the measurements that are of interest for analysis are shown in the bottom section of the package diagram. In this case, the measurements are actual sales, forecast sales, and budget sales. The business dimensions along which these measurements are to be analyzed are shown at the top of diagram as column headings. In our example, these dimensions are time, location, product, and demographic age group. Each of these business dimensions contains a hierarchy or levels. For example, the time dimension has the hierarchy going from year down to the level of individual day. The other intermediary levels in the time dimension could be quarter, month, and week. These levels or hierarchical components are shown in the information package diagram. IPD enables you to…. Define the common subject areas Design key business metrics Decide how data must be presented Determine how users will aggregate or roll up Decide the data quantity for user analysis or query Decide how data will be accessed Establish data granularity Estimate data warehouse size Determine the frequency for data refreshing Ascertain how information must be packaged Dimension Hierarchies/Categories When a user analyzes the measurements along a business dimension, the user usually would like to see the numbers first in summary and then at various levels of detail. What the user does here is to traverse the hierarchical levels of a business dimension for getting the details at various levels. the hierarchy of the time dimension consists of the levels of year, quarter, and month. The dimension hierarchies are the paths for drilling down or rolling up in our analysis. Within each major business dimension there are categories of data elements that can also be useful for analysis. Hierarchies and categories are included in the information packages for each dimension. Business dimensions for automobile manufacturer Structure for Key Measurements Key measurements are the measures that are used for business analysis and monitoring. Users measure performance by using and comparing key measurements. For example, if an information package diagram has product, customer, time, and location as the business dimensions, these four dimensions will be four distinct components in the structure of the data model. Fundamental principle Figure 6-1 Business requirements as the driving force. DATA DESIGN In the data design phase you come up with the data model for the following data repositories: The staging area where you transform, cleanse, and integrate the data from the source systems in preparation for loading into the data warehouse repository. The data warehouse repository itself. Figure 6-2 Requirements driving the data model. Figure 6-2. The base of the pyramid represents the data model for the enterprise-wide data repository and the top half of the pyramid denotes the dimensional data model for the data marts. What do you need in the requirements definition to build the two halves of the pyramid? Two basic pieces of information are needed: the source system data models and the information package diagrams. If you are adopting the practical approach of building your data warehouse as a conglomeration of conformed data marts, your data model at this point will consist of the dimensional data model for your first set of data marts. On the other hand, your company may decide to build the large corporate-wide data warehouse first along with the initial data mart fed by the large data warehouse. In this case, your data model will include both the data model for the large data warehouse and the data model for the initial data mart. THE ARCHITECTURAL PLAN Planning the architecture: 1) Involves reviewing each of the major architectural components 2) Involves the interfaces among the various components 3) How can the management and control module be designed to coordinate and control the functions of the different components? 4) What is the information you need to do the planning? 5) How will you know to size up each component and provide the appropriate infrastructure to support it? Of course, the answer is business requirements. All the information you need to plan the architecture must come from the requirements definition. Composition of the Components Let us review each component and ascertain what exactly is needed in the requirements definition to plan for the data warehouse architecture. In the following list, the points under each component indicate the type of information that must be contained in the requirements definition to drive the architectural plan. Figure 6-4 provides a useful summary of the architectural components driven by requirements. The figure indicates the impact of business requirements on the data warehouse architecture. Figure 6-4 Impact of requirements on architecture Special Considerations Having reviewed the impact of requirements on the architectural components in some detail, we now turn our attention to a few functions that deserve special consideration. We need to bring out these special considerations because if these are missed in the requirements definition, serious consequences will occur. When you are in the requirements definition phase, you have to pay special attention to these factors. Data Extraction/Transformation/Loading (ETL). Data Extraction: Clearly identify all the internal data sources. Specify all the computing platforms and source files from which the data is to be extracted. If you are going to include external data sources, determine the compatibility of your data structures with those of the outside sources. Also indicate the methods for data extraction. Data Extraction/Transformation/Loading (ETL). Data Transformation. Many types of transformation functions are needed before data can be mapped and prepared for loading into the data warehouse repository. functions include input selection, separation of input structures, normalization and denormalization of source structures,and conversions of names and addresses. this turns out to be a long and complex list of functions. Examine each data element planned to be stored in the data warehouse. Data Extraction/Transformation/Loading (ETL). Data Loading. Define the initial load. Determine how often each major group of data must be kept up-to-date in the data warehouse. How much of the updates will be nightly updates? Does your environment warrant more than one update cycle in a day? How are the changes going to be captured in the source systems? Define how the daily, weekly, and monthly updates will be initiated and carried out. Data Quality Bad data leads to bad decisions. if the data quality of your data warehouse is suspect, the users will quickly lose confidence and flee the data warehouse. Data quality in a data warehouse is sacrosanct. Therefore, right in the early phase of requirements definition, identify potential sources of data pollution in the source systems. Also, be aware of all the possible types of data quality problems likely to be encountered in your operational systems. Data Pollution Sources System conversions and migrations Heterogeneous systems integration Inadequate database design of source systems Data aging Incomplete information from customers Input errors Internationalization/localization of systems Lack of data management policies/procedures Types of Data Quality Problems Dummy values in source system fields Absence of data in source system fields Multipurpose fields Cryptic data Contradicting data Improper use of name and address lines Violation of business rules Reused primary keys Nonunique identifiers Metadata. Metadata in a data warehouse is much more than details that can be carried in a data dictionary or data catalog. Metadata acts as a glue to tie all the components together. When data moves from one component to another, that movement is governed by the relevant portion of metadata. When a user queries the data warehouse, metadata acts as the information resource to connect the query parameters with the database components. we had categorized the metadata in a data warehouse into three groups: operational, data extraction and transformation, and end-user. Figure 6-5 displays the impact of business requirements on the metadata architectural component. Figure 6-5 Impact of requirements on metadata. DATA STORAGE SPECIFICATIONS If your company is adopting the top-down approach , define the storage specifications for: The data staging area The overall corporate data warehouse Each of the dependent data marts, beginning with the first Any multidimensional databases for OLAP if your company adopting the bottom-up approach, you need specifications for: The data staging area Each of the conformed data marts, beginning with the first Any multidimensional databases for OLAP DBMS Selection Whatever your choice of the database management system may be that system will have to interact with back-end and front-end tools. The back-end tools are the products for data transformation, data cleansing, and data loading. The front-end tools relate to information delivery to the users. If you are trying to find the best tools to suit your environment, the chances are these tools may not be from the same vendors who supplied the database products. Therefore, one important criterion for the database management system is that the system must be open. It must be compatible with the chosen back-end and frontend tools. Broadly, the following elements of business requirements affect the choice of the DBMS: The following elements of business requirements affect the choice of the DBMS: Level of User Experience. If the users are totally inexperienced with database systems, the DBMS must have features to monitor and control runaway queries. If many of your users are power users, then they will be formulating their own queries. Types of Queries. The DBMS must have a powerful optimizer if most of the queries are complex and produce large result sets. Alternatively, if there is an even mix of simple and complex queries, there must be some sort of query management in the database software to balance the query execution. Data Loads. The data volumes and load frequencies determine the strengths in the areas of data loading, recovery, and restart. Metadata Management. If your metadata component does not have to be elaborate, then a DBMS with an active data dictionary may be sufficient. Data Repository Locations. Is your data warehouse going to reside in one central location, or is it going to be distributed? The answer to this question will establish whether the selected DBMS must support distributed databases. Data Warehouse Growth. Your business requirements definition must contain information on the estimated growth in the number of users, and in the number and complexity of queries. Storage Sizing How big will your data warehouse be? How much storage will be needed for all the data repositories? What is the total storage size? Answers to these questions will impact the type and size of the storage medium. How do you find answers to these questions? You need to estimate the storage sizes for the following in the requirements definition phase: Data Staging Area. Calculate storage estimates for the data staging area of the overall corporate data warehouse from the sizes of the source system data structures for each business subject. Overall Corporate Data Warehouse. Estimate the storage size based on the data structures for each business subject. You know that data in the data warehouse is stored by business subjects. For each business subject, list the various attributes, estimate their field lengths, and arrive at the calculation for the storage needed for that subject. Data Marts—Conformed, Independent, Dependent, or Federated. While defining requirements, you create information diagrams. A set of these diagrams constitutes a data mart. Each information diagram contains business dimensions and their attributes. The information diagram also holds the metrics or business measurements that are meant for analysis. Use the details of the business dimensions and business measures found in the information diagrams to estimate the storage size for the data marts. Multidimensional Databases. These databases support OLAP or multidimensional analysis. How much online analytical processing (OLAP) is necessary for your users? The corporate data warehouse or the individual conformed or dependent data mart supplies the data for the multidimensional databases. INFORMATION DELIVERY STRATEGY The impact of business requirements on the information delivery mechanism in a data warehouse is straightforward. During the requirements definition phase, users tell you what information they want to retrieve from the data warehouse. . The broad areas of the information delivery component directly impacted by business requirements are: Queries and reports Types of analysis Information distribution Real time information delivery Decision support applications Growth and expansion Queries and Reports Find out who will be using predefined queries and preformatted reports. Get the specifications for the production and distribution frequency for the reports. How many users will be running the predefined queries? The second type of queries is the users formulate their own queries and they themselves run the queries. The set of reports in which the users supply the report parameters and print fairly sophisticated reports themselves. Get as many details of this type of queries and this type of report sets as you can. Types of Analysis Most data warehouse and business intelligence environments provide several features to run interactive sessions and perform complex data analysis. Analysis encompassing drill-down and roll-up methods is fairly common. Review with your users all the types of analysis they would like to perform. Get information on the anticipated complexity of the types of analysis. In addition to the analysis performed directly on the data marts, most of today’s data warehouse and business intelligence environments equip users with OLAP. Using the OLAP facilities, users can perform multidimensional analysis and obtain multiple views of the data from multidimensional databases. This type of analysis is called slicing and dicing. CHAPTER SUMMARY Accurate requirements definition in a data warehouse project is many times more important than in other types of projects. Clearly understand the impact of business requirements on every development phase. Business requirements condition the outcome of the data design phase. Every component of the data warehouse architecture is strongly influenced by the business requirements. In order to provide data quality, identify the data pollution sources, the prevalent types of quality problems, and the means to eliminate data corruption early in the requirements definition phase itself. Data storage specifications, especially the selection of the DBMS, are determined by business requirements. Make sure you collect enough relevant details during the requirements phase. Business requirements strongly influence the information delivery mechanism. Requirements define how, when, and where the users will receive information from the data warehouse.