Warehouse design goals and objectives
Enabling data integration. The data warehouse concept seeks to integrate data across
time and across different subject areas in such a way that users of the warehouse can
easily obtain facts about the university's business. This integration is key to making
university administration more fact-driven and more responsive to the changing
environment. Many subsidiary goals for data design and these guidelines as a whole stem
from the integration concept of a data warehouse.
Integrating mainframe data across time is difficult because old records must be purged
from live application systems (after they have been written to archive files), because live
application systems must evolve over time (making old records incompatible with the
current ones) and because the current generation of application systems are designed to
transact current business, not facilitate historical comparisons. The most promising
approach to all of these problems is to design the data warehouse to correspond to today's
view of the world and load data from yesterday's files (going back in history as far as
possible), making an effort to be as faithful as possible to the data as it was written. It is
inappropriate and impractical to re-format the authoritative back-up or frozen files that
are used to populate the data warehouse; it is not easy to change Oracle tables and data
extracts for the historical tables in the data warhouse, but it is well worth the effort.
The difficulty of integrating data across subject areas occurs for two main reasons. The
first is that there are inherent differences between subject areas (e.g., the payroll and
instructional calendars really are different). Integration across subject areas requires a
concerted effort at the data modeling stage to look below the surface and find common
concepts and common data. The second difficulty stems from the fact that the University
purchases application packages from different software vendors. This has warehouse
design implications at many different levels, as discussed elsewhere in these guidelines.
The implications range from a requirement that the modeling process discover data
elements from different sources that are really "the same" (although they may be named
differently on the mainframe), to the requirement that data values be sufficiently
understood so that they can be put in common terms in the data warehouse.
Data standardization and normalization. Standardization and normalization of data are
the fundamental means of making a data warehouse really useful. They enable integration
across time and subject areas, making it easier to learn how to use and to understand the
data warehouse, and simplifying design and maintenance of the data warehouse. As the
following pages make clear, both standardization and normalization are complex subjects.
"Standardizing data" means that a series of explicit rules are applied to the data as it is
designed, entered, documented and used. Following the guidelines set forth in this
document would be one step toward standardizing the data in the data warehouse.
"Normalizing" the data is a fundamental aspect of data standardization. Data
normalization is a process of logical analysis that determines the simplest, most
parsimonious and most stable structure for a database. The example in exhibits 1 and 2
suggests that data normalization is really very intuitive; a detailed discussion and further
definition can be found in Durell (1985). Although it is required by a relational database
management system (RDBMS) such as Oracle, the significance and utility of
normalization go far beyond the requirements of the RDBMS. Putting the data from many
different application systems (and many different data formats) into one format, aside
from any other inherent advantages, permits correlation of data from different subject
areas. Properly normalizing the data removes many of the traps that are found in other
data formats and permits making meaningful connections more easily. Most people have
a much easier time thinking about data in a normalized form than they do about data in
other formats. Documenting normalized data is easier than documenting "the same data"
when it is in another format. Current hardware and software technology permit us to take
advantage of these characteristics of normalized data.
Normalization example. The following made-up example illustrates the idea of
normalizing mainframe data and suggests some of the processes that are involved.
Consider two records from a mainframe file belonging to an application which lists the
phone numbers of university employees.
Exhibit 1
| PHONE NUMBER APPLICATION FILE -----------------------------------|50803920601303492SMITH
9473BEAVER 9126DEWENDER9290LACOMBE 9998...
|50160891220303492ANDERSON9479ACHATZ 9955BELLOWS 9477BERRY
This record design reflects economic assumptions that prevailed a decade or more ago: 1)
storage (on disk or in memory) was very expensive, so that data compression was
essential and 2) it was assumed that the data would be viewed (and updated) only through
one screen (that was smart enough to compress and decompress the data correctly). The
data is designed for COBOL programs (and programmers). As a consequence few casual
data users could or actually did access such records directly.
Today storage is much less expensive (especially compared to the cost of human labor).
Reusability and portability of data are major design goals, and the RDBMS is
commonplace. Here is "the same" data, normalized and transformed into two separate
tables, as they might appear in the data warehouse.
Exhibit 2
|-- PERSON TABLE ----------------|ACHATZ
CENTRAL UMS 303 492-9955
CENTRAL IRM 303 492-9126
CENTRAL UMS 303 492-9328
CENTRAL IRM 303 492-9473
|50160 CENTRAL UMS 20DEC1989
In this example, all of the information contained in the mainframe record has been
retained while the structure of the data has changed considerably. The structure has
changed from a "tree-like" record with repeating segments to a collection of rectangular
tables that can be connected to each other. In addition, department numbers have been
parsed and translated to department and campus names (note that "campus" and
"department" are in both tables); dates have been translated from a string of six numbers
to an internal number that makes it easy to do date arithmetic and print the date in any one
of multiple date formats; and the concept of extension number has been changed to phone
number by propagating the area code and exchange number from the root segment to the
rows in the person table.
The RDBMS hides the physical location and structure of the data so that it is retrieved by
name. The importance of names is therefore underscored in an RDBMS, compared to an
environment where specifying the location is almost always an alternative way of
retrieving a specific piece of data. This example does not show column names even
though giving meaningful and systematic column names is one of the important ways in
which the data warehouse adds value to the data. As far as SQL is concerned, the order
of the rows and columns in an RDBMS table is arbitrary.
Building complexity on a firm foundation. The effort to design an integrated, standardized
and normalized data warehouse is justified in terms by the ease and power of casual, ad
hoc queries. The design effort is also a necessary pre-requisite to Decision Support
Systems (DSS) and Executive Information Systems (EIS) as well. The Exhiit 3 suggests
how higher-level constructs rest on lower-level foundations:
Exhibit 3
Information systems at different levels of aggregation
Activities |
projection |
analysis |
comparison |
| warehouse
| application,
| system of
retrieval |
real time
The data warehouse is a foundation for higher-level constructs for a third reason, as well.
Giving people access to their data through the data warehouse encourages them to think
about issues such as data consistency and completeness across multiple rows of a table.
This has a positive effect on data quality as a whole, which pays off during the
construction of DSS and EIS applications.