Warehouse Design Goals and Objectives Warehouse design goals and objectives Enabling data integration. The data warehouse concept seeks to integrate data across time and across different subject areas in such a way that users of the warehouse can easily obtain facts about the university's business. This integration is key to making university administration more fact-driven and more responsive to the changing environment. Many subsidiary goals for data design and these guidelines as a whole stem from the integration concept of a data warehouse. Integrating mainframe data across time is difficult because old records must be purged from live application systems (after they have been written to archive files), because live application systems must evolve over time (making old records incompatible with the current ones) and because the current generation of application systems are designed to transact current business, not facilitate historical comparisons. The most promising approach to all of these problems is to design the data warehouse to correspond to today's view of the world and load data from yesterday's files (going back in history as far as possible), making an effort to be as faithful as possible to the data as it was written. It is inappropriate and impractical to re-format the authoritative back-up or frozen files that are used to populate the data warehouse; it is not easy to change Oracle tables and data extracts for the historical tables in the data warhouse, but it is well worth the effort. The difficulty of integrating data across subject areas occurs for two main reasons. The first is that there are inherent differences between subject areas (e.g., the payroll and instructional calendars really are different). Integration across subject areas requires a concerted effort at the data modeling stage to look below the surface and find common concepts and common data. The second difficulty stems from the fact that the University purchases application packages from different software vendors. This has warehouse design implications at many different levels, as discussed elsewhere in these guidelines. The implications range from a requirement that the modeling process discover data elements from different sources that are really "the same" (although they may be named differently on the mainframe), to the requirement that data values be sufficiently understood so that they can be put in common terms in the data warehouse. Data standardization and normalization. Standardization and normalization of data are the fundamental means of making a data warehouse really useful. They enable integration across time and subject areas, making it easier to learn how to use and to understand the data warehouse, and simplifying design and maintenance of the data warehouse. As the following pages make clear, both standardization and normalization are complex subjects. "Standardizing data" means that a series of explicit rules are applied to the data as it is designed, entered, documented and used. Following the guidelines set forth in this document would be one step toward standardizing the data in the data warehouse. "Normalizing" the data is a fundamental aspect of data standardization. Data normalization is a process of logical analysis that determines the simplest, most parsimonious and most stable structure for a database. The example in exhibits 1 and 2 suggests that data normalization is really very intuitive; a detailed discussion and further definition can be found in Durell (1985). Although it is required by a relational database management system (RDBMS) such as Oracle, the significance and utility of normalization go far beyond the requirements of the RDBMS. Putting the data from many different application systems (and many different data formats) into one format, aside from any other inherent advantages, permits correlation of data from different subject areas. Properly normalizing the data removes many of the traps that are found in other data formats and permits making meaningful connections more easily. Most people have a much easier time thinking about data in a normalized form than they do about data in other formats. Documenting normalized data is easier than documenting "the same data" when it is in another format. Current hardware and software technology permit us to take advantage of these characteristics of normalized data. Normalization example. The following made-up example illustrates the idea of normalizing mainframe data and suggests some of the processes that are involved. Consider two records from a mainframe file belonging to an application which lists the phone numbers of university employees. Exhibit 1 | PHONE NUMBER APPLICATION FILE -----------------------------------|50803920601303492SMITH 9473BEAVER 9126DEWENDER9290LACOMBE 9998... |50160891220303492ANDERSON9479ACHATZ 9955BELLOWS 9477BERRY 9328... This record design reflects economic assumptions that prevailed a decade or more ago: 1) storage (on disk or in memory) was very expensive, so that data compression was essential and 2) it was assumed that the data would be viewed (and updated) only through one screen (that was smart enough to compress and decompress the data correctly). The data is designed for COBOL programs (and programmers). As a consequence few casual data users could or actually did access such records directly. Today storage is much less expensive (especially compared to the cost of human labor). Reusability and portability of data are major design goals, and the RDBMS is commonplace. Here is "the same" data, normalized and transformed into two separate tables, as they might appear in the data warehouse. Exhibit 2 |-- PERSON TABLE ----------------|ACHATZ CENTRAL UMS 303 492-9955 |ANDERSON CENTRAL UMS 303 492-9479 |BEAVER CENTRAL IRM 303 492-9126 |BELLOWS CENTRAL UMS 303 492-9477 |BERRY CENTRAL UMS 303 492-9328 |DEWENDER CENTRAL IRM 303 492-9290 |LACOMBE CENTRAL IRM 303 492-9998 |SMITH CENTRAL IRM 303 492-9473 |... |-- DEPARTMENT TABLE ------|50183 CENTRAL IRM 01JUL1992 |50160 CENTRAL UMS 20DEC1989 |.. In this example, all of the information contained in the mainframe record has been retained while the structure of the data has changed considerably. The structure has changed from a "tree-like" record with repeating segments to a collection of rectangular tables that can be connected to each other. In addition, department numbers have been parsed and translated to department and campus names (note that "campus" and "department" are in both tables); dates have been translated from a string of six numbers to an internal number that makes it easy to do date arithmetic and print the date in any one of multiple date formats; and the concept of extension number has been changed to phone number by propagating the area code and exchange number from the root segment to the rows in the person table. The RDBMS hides the physical location and structure of the data so that it is retrieved by name. The importance of names is therefore underscored in an RDBMS, compared to an environment where specifying the location is almost always an alternative way of retrieving a specific piece of data. This example does not show column names even though giving meaningful and systematic column names is one of the important ways in which the data warehouse adds value to the data. As far as SQL is concerned, the order of the rows and columns in an RDBMS table is arbitrary. Building complexity on a firm foundation. The effort to design an integrated, standardized and normalized data warehouse is justified in terms by the ease and power of casual, ad hoc queries. The design effort is also a necessary pre-requisite to Decision Support Systems (DSS) and Executive Information Systems (EIS) as well. The Exhiit 3 suggests how higher-level constructs rest on lower-level foundations: Exhibit 3 Information systems at different levels of aggregation *-------------------------------------------------------------* | Generic Delivery | | Name Example Unit Subjects Activities | |-------------------------------------------------------------| | T | | EIS A groups projection | | B departments analysis | | DSS L months/ comparison | | E years | | data CIW S | | warehouse | | S | | C | | R | | application, FRS, FMS, E individuals editing | | system of SIS, HRS E transactions retrieval | | record N real time entry | | S | *-------------------------------------------------------------* The data warehouse is a foundation for higher-level constructs for a third reason, as well. Giving people access to their data through the data warehouse encourages them to think about issues such as data consistency and completeness across multiple rows of a table. This has a positive effect on data quality as a whole, which pays off during the construction of DSS and EIS applications.