Data Warehousing Primer - Optimal Enterprise. Achieve More.

advertisement
Understanding
Data Warehouse Management
A Primer
John Deitz
For Viasoft, Inc.
Status: 02September1998
Preface
In this primer, I describe a business management practice called “data warehousing” and many of the
management issues that contribute to making data warehousing successful. This is a fairly non-technical
coverage of the subject, which should be useful to Sales, Services, Development and Practices personnel
who work with Viasoft’s solutions.
In one respect, the advent and evolution of data warehousing marks a return to basic values for running a
successful business. IT organizations have long been caught up in the technology and application systems
needed to automate the business. Although these applications are called “information systems”, it is closer
to the truth that they are merely “business transaction” systems -- generating little real “information” that
could be useful for analyzing the business.
Savvy business managers, who rise above the whir and confusion of the IT technology engine, recognize
the need to accurately define the key concepts and processes at play in their business, and to begin
measuring business operations against them. It takes little effort to see that the reports produced, and raw
data processed, while running the business offer virtually no insights into:





Customer segmentation, profiling and retention,
Market basket or cross-product linkage analysis,
Churn and trend Analysis,
Fraud and abuse analysis, or
Forecasting and risk analysis.
These are not the “raw” elements of data found in highly normalized transaction system database schema;
rather, they are aggregations and synthesis of data into “information” … and ideally into useful knowledge.
Data warehousing is the process of creating useful information that can be used in measuring and managing
the business.
Understanding Data Warehousing – A Primer
2
Table of Contents
1.
What’s a Data Warehouse or Data Mart? ................................................. 6
1.1
How are Data Warehouses Produced? .......................................................................... 6
1.2
What is the Purpose of the Data Warehouse? .............................................................. 7
Measurement ............................................................................................................................... 7
Discovery ...................................................................................................................................... 8
Trend Analysis ............................................................................................................................. 8
2.
How do Data Warehouse Databases Differ from Production Ones? .. 9
2.1
Decision Support versus Transactional Systems ......................................................... 9
2.2
Data Quality and Understanding .................................................................................. 9
2.3
General Data Usage Patterns ...................................................................................... 10
Time Series Data.........................................................................................................................11
Summarized or Aggregated Data..............................................................................................11
2.4
Separate DSS and OLTP Systems .............................................................................. 11
2.5
DSS-specific Usage Patterns ....................................................................................... 12
2.6
Operational Data Stores (ODS) ................................................................................... 13
3.
Data Warehouse Architectures................................................................ 14
3.1
Marts and Warehouses ................................................................................................ 14
What do you mean by “Architecture” ? ...................................................................................15
3.2
Physical Warehouse Architecture ............................................................................... 15
Centralized ..................................................................................................................................15
Independent Data Mart .............................................................................................................16
Dependent Data Marts, with Distribution Data Warehouse ..................................................17
Operational Data Stores - with Marts and Data Warehouse ..................................................17
Virtual Data Warehouse ............................................................................................................18
“Hub-and-Spoke” Architecture ................................................................................................19
3.3
Warehouse Framework Architecture .......................................................................... 20
Production or External Database Layer ..................................................................................21
Data Access Layer ......................................................................................................................21
Transformation Layer ...............................................................................................................21
Data Staging Layer.....................................................................................................................21
Data Warehouse Layer ..............................................................................................................21
Information Access Layer ..........................................................................................................21
Application Messaging Layer ....................................................................................................21
Meta Data Directory Layer .......................................................................................................22
Process Management Layer ......................................................................................................22
4.
What is Data Mart/Warehouse Management? ....................................... 23
4.1
Managing Data Administration: Defining “Information” ......................................... 23
Terms and Semantics Directory (Glossary) .............................................................................23
Value Domains and Abbreviations ...........................................................................................24
Naming Conventions ..................................................................................................................24
Data Type and Format Standards ............................................................................................24
Automated Data Quality Measurement ...................................................................................25
Information Owners ...................................................................................................................25
Managing IT-related Meta Data ...............................................................................................25
Understanding Data Warehousing – A Primer
3
Table of Contents (continued)
4.2
Managing Business Segment Analysis & Mart Design ............................................. 26
Exploration .................................................................................................................................26
Schema Design ............................................................................................................................26
Locating Data Sources ...............................................................................................................27
Prototyping .................................................................................................................................27
4.3
Managing Database Administration ........................................................................... 28
Managing Production Data Store Schema ...............................................................................28
Managing Data Warehouse, Mart and ODS Schema ..............................................................28
Applying Data Types, Naming Conventions and Format Descriptors ..................................28
4.4
Managing Data Valid Value Domains ........................................................................ 29
4.5
Managing Data Extraction .......................................................................................... 29
Identifying and Qualifying Candidate Stores to Draw From .................................................29
Managing Queries and Data Extraction ...................................................................................29
4.6
Managing Data Transformation .................................................................................. 30
Data Conjunction .......................................................................................................................30
Data Aggregation........................................................................................................................30
Data Cleansing ............................................................................................................................31
Data Dimensioning .....................................................................................................................31
4.7
Managing Data Traceability ........................................................................................ 31
4.8
Managing Schedules and Dependencies ..................................................................... 32
4.9
Managing Data Quality and Usage ............................................................................. 33
Use-based Data Quality Audits .................................................................................................34
Data Quality Re-Design .............................................................................................................34
Data Quality Training ................................................................................................................34
Data Quality Continuous Improvement ...................................................................................34
Data Quality S.W.A.T. Tactics ..................................................................................................34
Data Cleansing Management ....................................................................................................35
4.10
Managing Data Warehouse Architectures .................................................................. 35
4.11
Managing Operations, Facilities and End User Tools ............................................... 36
4.12
Managing Basic Database Operations ........................................................................ 36
4.13
Managing Rules, Policies and Notifications ............................................................... 36
4.14
Managing Meta Data Definition, Population and Currency ..................................... 37
4.15
Managing History, Accountability and Audits ........................................................... 38
4.16
Managing Internal Access, External Access and Security ........................................ 38
4.17
Managing Systems Integration Aspects ..................................................................... 39
4.18
Managing Usage, Growth, Costs, Charge-backs and ROI ......................................... 39
Usage............................................................................................................................................39
Growth ........................................................................................................................................40
Cost Management .......................................................................................................................40
Return on Investment ................................................................................................................40
4.19
Managing Change – the Evolving Warehouse ............................................................ 40
Understanding Data Warehousing – A Primer
4
Table of Contents (continued)
5.
Future of Data Warehousing .................................................................... 42
Objectives ....................................................................................................................................42
Business Factors .........................................................................................................................42
Technology Factors ....................................................................................................................43
Knowledge Engineering .............................................................................................................43
More Effective Infrastructure & Automation .........................................................................43
More Effective Communication ................................................................................................44
6.
Summary ...................................................................................................... 45
Conclusion ...................................................................................................................................45
Understanding Data Warehousing – A Primer
5
1.
What’s a Data Warehouse or Data Mart?
A “data warehouse” is a database, or group of databases, where business people can access
business-related data; a data warehouse may aggregate business information from several
organizations, and/or may serve as a central repository for standards, definitions, value domains,
business models, and so on.
Bill Inmon, recognized as the father of the data warehouse concept, defines a data warehouse as
“a subject-oriented, integrated, time variant, non-volatile collection of data in support of
management’s decision-making process”. Richard Hackathorn, another data warehouse pioneer,
says the “goal of data warehouse is to provide a single image of business reality”. Both of these
definitions have merit, as this primer will illustrate.
A “data mart” is a (usually smaller) specialized data warehouse that covers one subject area, such
as finance or marketing, and may serve only one distinct group of users. Since a data mart focuses
on meeting specific user business needs, it can capture the commitment and excitement of users.
Also, the limited scope makes it easier for IT to grasp the business requirements and work with the
users effectively. Most successful data warehousing efforts begin through the successes of one or
more marts.
A data mart or warehouse is a collection of “clean data”. This data usually ultimately originates
from the production transaction systems that automate the business. But this data is not the same
“raw” data processed by transaction systems, and more specifically it is not in the same
transaction-optimal form; rather it is very selective data, organized into specific “dimensions”,
and optimized for access by decision support systems (DDS), sometimes also called executive
information systems (EIS).
These are the simple definitions for data warehouse and mart. The remainder of this paper will
illustrate that the process of data warehousing is far from simple. Recent history provides many
examples of unsuccessful warehouse implementations costing millions of dollars.
A number of things - warehouse/mart architecture, data understanding, data access, data
movement, merging of data from heterogeneous sources, data cleansing and transformation, mart
database design, and business user access/query tools - all have to come together successfully for
a data warehouse to succeed. In short, data warehousing cannot be done successfully without
significant management measures in place.
For the remainder of this paper, I will use the term “data warehouse” to refer equally to marts or
warehouses – except in discussions of architecture where I draw distinctions between the two.
1.1
How are Data Warehouses Produced?
Data warehouses are usually populated with data originating from the production business systems
- in particular, from the databases underlying those systems. That means existing data must be
extracted from its source, cleansed, transformed (modified, aggregated and dimensioned), staged,
distributed to remote sites (perhaps) and finally loaded into the data stores comprising the data
warehouse or marts. The following figure illustrates this flow:
Understanding Data Warehousing – A Primer
6
Figure 1a: Steps in Building a Data Warehouse
Actually, this figure grossly simplifies the effort that goes onto designing, creating and managing
a data warehouse. Depending on the nature and quality of existing production systems, and the
needs of business users querying the data warehouse, each step may require significant research,
planning and coordination.
1.2
What is the Purpose of the Data Warehouse?
Basically, the purpose of a data warehouse is to put accurate information about the business into
the hands of competent analysts or decision-makers. This information must present very crisp and
unambiguous facts about business subjects (such as customer, product, brand, contract,
representative, region, and so on) and provide useful perspectives on the current operations of the
business. Here I look at just a couple roles of the warehouse.
Measurement
The analysts use the data warehouse to measure the performance of the business; these
measurements are often computed from millions of detailed records of data, or sometimes
from aggregations (summaries) of the data. When business performance is weak or
unexpected, the analysts “drill down” to deeper levels of detail to find out why; this can
mean looking at the data (i.e. querying) in a new and different way – perhaps on a one-time
basis.
Here’s an example: a well-designed mart might allow an analyst to measure the
effectiveness of the company’s investment in inventory. Called GMROI, or Gross Margin
Return on Inventory, this measurement is computed as follows:
(Total Quantity Shipped) * (Value at Latest Selling Price – Value at Cost)
GMROI =
(Daily Average Quantity on Hand) * (Value at Latest Selling Price)
Obviously, this measurement could not be made without the right, accurate detailed data
organized in a very particular pattern, and the ready ability to query the data in a number of
Understanding Data Warehousing – A Primer
7
interesting ways. Most successful data warehouses are based on a “star schema” database
design that minimizes the amount of data to be stored, while optimizing the potential of
query performance. (I will discuss this further later on, because not all star schema designs
have the same performance potential.)
Discovery
Analysts also use data warehouse data for a relatively new form of analysis called “data
mining”. Data mining involves deep analysis of potentially huge volumes of data to
discover patterns and trends that no one would think to look for. Certain business events or
trends can occur due to the convergence of highly unrelated facts; data mining makes it
possible to see new facts, and how they are correlated. Data mining is “new” because only
recently have the database and query technologies matured enough to support it with
acceptable performance.
Trend Analysis
The data in a data warehouse is usually organized into “dimensions”. For example, a data
mart designed to store information about stores selling products might be organized into the
following dimensions: Store, Product, and Time. Time is a very common and important
dimension, because it allows the measurements taken today to be compared with the same
measurements taken last week, last month or last year. This is what Bill Inmon referred to
as “time variant” in the quote I used earlier to define data warehouses. When a number of
time-phased measurements are considered in succession, we can do trend analysis.
One significance of trend analysis is to better understand the past. Trend analysis is very
effective in highlighting a declining market or shift in sales from one brand to another.
Perhaps a more powerful use of trend analysis is to project the future; this is done by
extending a trend into the future through the application of various statistical models.
It should be easy to see how the notions of Measurement, Discovery and Trend Analysis
contribute (in very meaningful ways) to assessing how effectively a business is operating, and
where management actions should be focused.
Understanding Data Warehousing – A Primer
8
2.
How do Data Warehouse Databases Differ from
Production Ones?
What’s the difference between a data warehouse database, and all the other production databases
that the business has? Why not just use the production databases as the basis for data warehouse
queries?
These are good questions - and important ones – as we explore the success criteria for data
warehousing initiatives. In this section, I’ll discuss the differences and rationales.
2.1
Decision Support versus Transactional Systems
Decision Support systems and Executive Information systems have different usage and
performance characteristics from production systems. Production systems such as order entry,
general ledger or materials control generally access and update the record of a single business
object or event: one order, one account, or one lot. Transactions are generally pre-defined, and
require the database to provide very fast access to, and locking for, one record at a time.
In fact, the requirement that production databases be “transaction-optimal” heavily influences the
schema design of these databases. These database schemas are often highly normalized to
maximize the opportunity for concurrent access to database data; that is, they are designed so that
the smallest amount of data will be locked (i.e. unavailable to others) at any one time. In this way,
transactions can have pinpoint accuracy and be highly efficient.
In contrast, databases supporting DSS must be able to retrieve large sets of aggregate or historical
data within a reasonable response time. The data must be well-defined, clean, and organized into
specific dimensions that support business analysis.
2.2
Data Quality and Understanding
The data records processed by production systems are usually concatenations of the master records
of the key databases with contextual information. IT systems that have evolved over the years
have been tuned to cater to the data anomalies found in this data; that is, they “correct” anomalies
“on the fly”. Such anomalies include missing or inaccurate codes, discrepancies between order
header and detail records, and garbage found in fields due to electronic transmission errors. The
programmers of these systems are usually so far removed from the data entry points to the system,
that it is easier (and more convenient) to adjust values during processing than to correct the source
of the data.
It is common for data in production stores to get “tainted” - becoming application specific. For
example, a primary datastore of customer information may be pre-filtered to contain only “active
accounts”, while to the casual observer (outside IT) it may appear to encompass all accounts.
Also, programmers and early database designers have traditionally been lazy about naming
standards, assuming that only a small technical audience would see the names they contrived for
data elements. Hence, an “outsider” cannot trust the meaning of terms such as customer, order,
product, and so on – such terms were often used loosely in the past.
These subtleties and miscellaneous filters make production stores the wrong, or at least
incomplete, sources of information for most decision-making purposes.
Understanding Data Warehousing – A Primer
9
2.3
General Data Usage Patterns
As noted above, transaction systems are tuned for item-at-a-time processing, and more
importantly, update processing. These systems are called on-line transaction processing (OLTP)
systems. The intent of such systems to update, and to minimize the scope of database locks, places
significant constraints on how the data is laid out (schema design) and how it is accessed.
By contrast, decision support systems operate against schemas that facilitate querying [perhaps
very granular] information in a myriad of ways; this is a read-only mode of access. The tools
used for these systems are called on-line analysis processing (OLAP) tools. A particular pattern of
schema design, called “star schemas”, has proven very powerful in decision support systems. Star
schemas can be used to design dimensional data stores. The term “star” was coined because the
schema configuration consists of a core fact table which has relationships (foreign keys, in
relational OLAP systems) to a number of dimension tables.
Time Dimension
Sales Fact
Product Dimension
time_key
product_key
product_key
description
time_key
store_key
brand
day_of_week
dollars_sold
category
month
units_sold
quarter
dollars_cost
year
Store Dimension
holiday_flag
store_key
store_name
address
floor_plan_type
Figure 2a: Example of a Star Schema
Another schema pattern, snowflake schema, is sometimes used for data warehouses. The
snowflake approach is an extension of the star schema idea; in contrast to star schemas, which are
commonly de-normalized, snowflake schemas are often highly normalized. This creates the
“snowflake” schema pattern as illustrated below:
Dim
Dim
Facts
Dim
Dim
Figure 2b: Pattern of a Snowflake Schema
Several pioneers of data warehousing have heavily discouraged the usage of snowflake schemas
for data warehouses of appreciable size. Too many joins are necessary to access the data, which
leads to unacceptable performance. Also, the schema is more difficult for users to learn and use.
Still, there will be sites where you run across the term, so I mention it here for completeness.
Understanding Data Warehousing – A Primer
10
It should be clear to IS professionals that these schemas differ considerably from production
database schema in key respects:
Time Series Data
As mentioned earlier, the Time dimension allows analysts to view how business behaviors
change over time; it is very common to find a Time dimension in OLAP star schemas.
This dimension is essential for measuring whether the business will reach its business goals,
and to compare today’s state of some segment of the business to a past state (last month,
last year, etc.) of the identical segment.
Summarized or Aggregated Data
In our example above, note that the central fact table contains elements (dollars_sold,
units_sold) that are aggregations of more detailed information (perhaps gathered from
production data stores). A more subtle observation is that these “facts”, by themselves,
have no meaning. The meaning of each fact is entirely dependent on the dimensions
attached to it, which provide a valid “context” for the fact. In our example above, the
dollars_sold fact is more accurately: dollars_sold of a Product at a Time in a Store.
Lastly, it is worth pointing out that, from a usage point of view, transactions are often rigidly
defined and mechanically executed. By contrast, DSS/EIS users need the flexibility to form ad
hoc queries which “cut” the facts into any combination of the available dimensions. In his report,
“Data Warehouses: An Architectural Perspective”, Jack McElreath nicely summarizes the
difference of data usage:
Production Data
Short-lived, rapidly changing
Requires record-level access
Warehouse Data
Long-lived, static
Data is aggregated into sets **
** Which is why warehouse data is friendly to relational DBs.
Repetitive standard transactions
Updated in real time
Event-driven – process generates data
Ad hoc queries; some periodic reporting
Updated periodically with mass loads
Data-driven – data governs process
Figure 2c: Production and Warehouse Data are Very Different
In summary, DSS/EIS users need data that accurately describes the organization (as they visualize
it), is accessible and internally consistent, and is organized for access by analytical tools. DSS
users are analysts or managers that think about the big picture long term. Typical queries might
list total sales for each of the last five years, items that have been out of stock more than 15% of
the time, or customers with the most orders in 1997.
2.4
Separate DSS and OLTP Systems
In very rare cases in which the databases designed for OLTP are also accessed by DSS systems.
In these situations, the production database serves as a virtual data warehouse. To be successful,
this requires a fairly technical warehouse user, very clean data in the production store, and excess
bandwidth in the processing window of that store. In addition, the analyst must concede that the
data is not stable (it is changing constantly), is not time phased, and performance will likely be
degraded by normal business processing. Given these limitations, it is easy to see why this
strategy is not employed in serious, large-scale data warehousing solutions. However, there are a
few OLAP tools designed to enable virtual data warehouses.
Understanding Data Warehousing – A Primer
11
In even fewer cases, there is a need for an production system to use DSS type functions. This is
called on-line complex processing (OLCP) - yes,there’s an acronym for everything!. These cases
are rare indeed.
A defining characteristic of successful data warehousing is a clean separation of production
and decision support functionality. By separating these two very different processing patterns,
the data warehouse architecture enables both production and decision support systems to focus on
what they do best, and thereby provide better performance and functionality for each.
2.5
DSS-specific Usage Patterns
Within the decision support system community, users in various roles place different demands on
the warehouse. A few of the recognized roles, and their demands of the DSS system, are
discussed below:
Information Farmer:
The information farmer is a consumer of well-understood data through
pre-defined queries. This user knows what to look for, and accesses
small amounts of data with predictable patterns.
Information Tourist:
The information tourist is a consumer with unpredictable access to
data. The tourist is a regular user of the warehouse/mart, so the load
placed on system is known; but s/he accesses larger amounts of data,
and potentially different data each time. Custom queries are used.
Information Explorer:
The information explorer is involved with information mining. This
user has very erratic access patterns, and can access very large amounts
of very detailed data. This can place very unreasonable performance
loads on a warehouse or mart.
Mart Prototyper:
The mart prototyper is involved with experimenting with warehouse
and mart schemas. Like the information explorer, his/her usage
patterns are erratic. Also, several schemas may be implemented and
discarded during the development process, causing administrative loads
on the DBMS.
It is easy to see that the explorer and prototyper users place unreasonable demands on the typical
data warehouse environment; a random relational query, that is not supported by well-tuned
indices and access plans, can overload the DBMS and bring the decision support system to its
knees. For this reason, many companies isolate these functions onto entirely different database
platforms, where they do not threaten the performance of the mainstream warehouse.
One new technology specifically addresses the explorer and prototyper audience. The technology
is called Nucleus, and it is packaged into two new products, Nucleus Exploration Warehouse/
Mart and Nucleus Prototype Warehouse/Mart, marketed by Sand Technology, Inc. Nucleus is a
hybrid database technology that uses a “tokenization” scheme to store and automatically index all
data that is loaded; the virtue of tokenization is that each unique data value is stored only once,
thereby providing data compression. Nucleus boasts highly-tuned algorithms to access data,
automated server fault recovery, and greatly reduced database administration requirements.
Nucleus looks and acts like a relational database, accessed though a standard ODBC read/write
interface. But under the covers, the storage technology has the effect of compressing data (up to
60 percent), while supercharging the performance of data access. Bill Inmon, of Pinecone
Systems, and other data warehouse experts are already voicing support for the technology remarkable, since the products have been on the market less than a year.
Understanding Data Warehousing – A Primer
12
2.6
Operational Data Stores (ODS)
In the last few years, the data warehousing industry has begun to mature, and with this maturity
has come new ways to view and integrate decision support technology. The concept of
Operational Data Stores (ODSs) has evolved out of this industry experience, and has recently been
promoted heavily by Bill Inmon.
An ODS is not a warehouse, nor is it a mart. Rather it is another store used, in conjunction with
warehouses and marts, within some warehousing architectures. An ODS is a regularly-refreshed
container of detailed data used for very short term, tactical decision making, and is often used as
one of the feeds to a central data warehouse hub. I discuss ODSs further in the next section.
Understanding Data Warehousing – A Primer
13
3.
Data Warehouse Architectures
Quite recently, some of the venerable pioneers and gurus of data warehousing history have taken
the proverbial “step back” to assess what works and what doesn’t. Billions of dollars have been
lost in unsuccessful data warehousing initiatives; and there are certainly a lot of ways to create
poor ones. A key focus of these leaders today is: data warehouse architecture.
Shaku Atre, acknowledged expert in the data warehousing, client-server and database fields,
speaks about architecture in her latest 1998 report:
“Because data warehousing is playing a more critical role, organizations need to ensure
that their data warehousing capability is able to meet requirements that change rapidly.
You need an approach that delivers quick results now, but provides a flexible and
extendible framework for the future; this means you need to build the right architecture
from the beginning.”
“Without the right architecture, there can be no effective long term strategy.”
In Data Warehousing Fundamentals: What You Need to Know to Succeed, Bob Lambert states:
“A data warehouse is a very complex system that integrates many diverse components:
personal computers, DSS/EIS software, communications networks, servers, mainframes,
and different database management system packages, as well as many different people
and organizational units with different objectives. The overall requirement for data
warehouse is to provide a useful source of consistent information describing the
organization and its environment. Even though this requirement can be simply stated, it
is a moving target, buffeted by accelerating change in business conditions and
information technology. Successful data warehouse projects require architectural
design.”
In this section, we look at data warehouse architectures through the eyes of the industries leading
experts. To do this, we need to again separate the ideas of data mart and data warehouse.
3.1
Marts and Warehouses
Based on the nature of star schemas, it is easy to deduce that an average business could have
several data marts, each focused on a specific area of analysis. One analyst reported in 1997 that
most companies have already developed three or more data marts. Many data marts are operated
more-or-less autonomously within departments until they prove successful. But after a few
successful marts are in full swing, it makes sense to pool resources and refine basic processes used
in populating warehouses. (I’ll explore these processes in a later section.)
In businesses with a strong central IT organization, there is often a desire to create an central data
warehousing “infrastructure” that mart-builders can take advantage of. In fact, some central
organizations have had the clout to prevent marts from being created until a set of central services
is established, thus impeding some business objectives.
There is a fair amount of controversy between these two camps. In part, it comes down to whether
marts are independent or dependent (on a central warehouse). Sometimes this is just a “turf war”;
but there are many other factors to consider having to do with time and resources.
Understanding Data Warehousing – A Primer
14
Suppose a body of information is essential to 5 marts, and this body of information is aggregated
from 4 different source and cleaned in preparation for use. If each mart is independent, then each
mart: performs the extractions from the 4 sources, merges the data, performs its own cleaning,
and populates the mart database; that’s a total of 5 x 4 = 20 extractions, 4 merge functions and 4
cleanings. Besides the duplication of effort, what are the chances that these operations are
conducted in precisely the same way, yielding precisely the same data?
Suppose instead that the same data were processed once (centrally) and then distributed to the 4
marts. This would result in 5 extractions, 1 merge function, 1 cleaning, and 4 distributions. An
important side benefit is that the data would be consistent, and have precisely the same meaning,
in each of the marts. Therefore, it would be safe to combine information from the different marts
into a higher-level report.
Aside from redundant processing, data replication is another important issue to consider. Data
that is aggregated centrally (for processing) and then distributed to marts naturally resides in
multiple locations at a time. This can consume significant DASD resources that must be planned
for and managed.
What do you mean by “Architecture” ?
In the following sections, we’ll look at two aspects of architecture relating to data warehouses.
First we explore the physical architecture – basically the decision of whether the warehouse (with
marts) is centralized or distributed. Second, we explore the framework architecture – that is, the
relationships between data, operations on data, and movement of data.
3.2
Physical Warehouse Architecture
There are several architectures that can data warehouses can be based on. Each have pros and
cons. One key to data warehousing success is to first understand the possible options, then
understand the unique business needs of the company, and finally select architectures that meet
those needs.
Centralized
In this approach, a company gathers production data from across the organizations of the
company into a central store. This data covers many different subject areas, and/or lines of
business.
The advantages of the centralized approach include the degree of control, accuracy and
reliability it provides, along with economies of scale.
But one of the main problems with the centralized approach is that it comes up directly
against and organization or political issues that are causing difficulties for the company.
Building such a data warehouse will commonly require great cooperation between IT and
users, between business units and central management, and among various business units.
This is a big reason, without even considering technology challenges, why centralized,
monolithic data warehouses tend to be expensive, complex and take a long time to build. In
other words, in order to serve an enterprise, the data warehouse must reflect its complexity.
Usually, a centralized warehouse will draw from a wide variety of data sources. It will be a
big job for IT to transform and combine the data into forms better suited for analysis,
especially when large volumes of non-relational mainframe data is used. This data will also
need to be “cleaned” – an effort-intensive process of removing inconsistency and
redundancy that needs to involve users. In a centralized data warehouse, it can be difficult
Understanding Data Warehousing – A Primer
15
to motivate users to participate in cleaning data if they don’t see an immediate payoff for
themselves. (Note: I’ll discuss later how data usage directly contributes to data quality).
The pros and cons of the central warehouse are:
Pros
Cons
Engineered by IT as enterprise system to
ensure quality of design.
Able to share components and thus justify
industrial strength tools for data cleansing
and transformation.
Generally use high performance platforms
able to scale and handle growth.
Highly manageable.
Big and complex to maintain.
Take a long time to plan and implement,
because projects try to accomplish too
much.
Expensive (often three to five million
dollars).
Come up against organizational and
political walls.
Independent Data Mart
The data mart is typically a specialized data warehouse that covers one subject area, such as
Finance or Marketing. Since a data mart is focused on meeting specific business user
needs, it is generally easier to get users more involved in the data quality, cleansing and
transformation activities. Because the audience is usually limited to a group of end users
performing the same function, and data mart avoids much of the interdepartmental conflicts
that occur between business units with different business processes, different world views
and different priorities.
In this architecture, a company builds a series of data marts that each model a particular
subject area. By narrowing the scope, a data mart avoids much of the complexity of
centralized warehousing – like trying to model the entire company. The initial data volume
and number of data sources may also be far less than the centralized architecture, which
may mean the marts can run on smaller, more cost-effective machines.
Figure 3a: Centralized vs. Mart Architecture
An independent data mart is a stand-alone system that does not connect to other data
marts. The advantage of an independent mart is that its politics and technical complexity
Understanding Data Warehousing – A Primer
16
are localized, making it the quickest and least expensive to deploy. The primary risk with
an independent mart arises when you build a series of them. When a mart reflects the
different business assumptions of its department, it can become an island of information
within the organization. A secondary risk is that the marts will each require diverse data
feeds from legacy sources. Also, each marts stores its own transformations of the corporate
data (along with summaries and indexes) driving overall data volumes up 3 to 7 times.
The pros and cons of independent data marts are:
Pros
Cons
Narrower and more contained scope.
Hidden costs, which multiple as you build a
series of marts.
Potential duplication of effort.
Can be poor in quality and inconsistent.
Can be built relatively quickly.
Lower initial costs for hardware, software
and staff.
Tend not to scale up very well.
Dependent Data Marts, with Distribution Data Warehouse
Another architecture, the dependent data mart, shares most characteristics of the
independent marts – that is, it still concentrates on one subject area. The primary difference
is that a dependent mart “depends” on a central data warehouse that stages and transforms
data for many marts. This architecture does a good job of balancing department needs (for
flexibility and focus) with enterprise needs (consistency, control and manageability).
The big problem with dependent marts comes when management decides they will not
actualize any marts until the central warehousing facilities are in place – which could take
years.
Operational Data Stores - with Marts and Data Warehouse
Yet another architecture incorporates the use of Operational Data Stores (ODSs) with
dependent marts and a distribution warehouse. ODSs are usually deployed for tactical
reasons.
An ODS is basically a subject-oriented, time-variant and volatile (frequently changing or
updated) image of production system data. An ODS may be implemented over a star
schema, but this schema may be quite different from downstream warehouse schemas; the
ODS schema and content tends to be geared towards very short-term tactical analysis rather
than trend analysis. The data in an ODS may, or may not, be cleansed.
An ODS may be is implemented for one or more of a variety of purposes:
 Better organized and more accessible view of production data. Production data is
often stored in indexed files or old DBMS types, such as IMS. Users accessing these
stores require specialized skills, and are usually programmers. When production data
is surfaced in an ODS based on relational technology, it becomes more available to
growing numbers of analysts through highly friendly ODBC-based tools.
 Production system extension. Once production data is available in a more accessible
form, production system analysts often want to leverage it. This can lead to extensions
to production systems that operate on ODS data. These applications may generate
“new data” that never existed in the production systems; when this occurs it can be
useful to propagate the new data “back” into product databases.
Understanding Data Warehousing – A Primer
17



When ODSs are used in this manner, it is easy to see how “the line gets blurred”
between production systems and the ODS, leading to some very significant
management challenges.
Production system migration. An ODS can be used as part of a strategy to migrate
production systems from antiquated DBMSs to new (usually relational) ones. Using a
populated ODS that “mirrors” the content of production data, new applications can be
developed and tested, while the old applications continue to serve their existing
functions. Eventually, the old applications and DBMS can be phased out.
Short-window tactical analysis. The mainstream purpose for most ODSs is to
provide “current moment” tactical analysis of the business. This differs from typical
data mart analysis, which often covers longer-term trends, and depends on stable,
immutable data.
Staging point on the way to distribution warehouse. Very often, data from an ODS
gets loaded into the data warehouse (subject to established data cleansing and
transformation rules). This data propagation must be managed carefully, and on a very
time-sensitive basis, to ensure that the data contains a proper subset (periodic slice) of
information; in other words, such propagation must be planned around the fact that the
ODS data is frequently refreshed and updated.
The following figure illustrates the options available when an ODS is part of the data
warehousing architecture:
Figure 3b: Environment with Operational Data Store
The operational data store concept can be worked into most data warehouse architectures.
The decision for-or-against an ODS is usually based on the unique needs of the particular
data warehousing environment – especially the quality of, and accessibility to, production
system data.
Virtual Data Warehouse
A virtual data warehouse isn’t really a physical data warehouse at all, but rather a unified
way to access production data that resides on diverse systems. Thus, although it appears to
the warehouse users that they are working with a dedicated system, the data does not really
get moved into one – it remains in the operations stores.
This approach may meet the needs less dynamic organizations that do not really need a full
fledged warehouse, or any organization not quite ready for real data warehousing. Perhaps
the most serious limitation of the virtual warehouse approach is that the data is not
Understanding Data Warehousing – A Primer
18
cleansed, transformed or reformatted in ways that support better analysis, thus seriously
hampering its effectiveness.
Virtual warehouses may find use when: a real data warehouse can’t be cost-justified, for
proof of concepts, when an interim solution is needed while a warehouse is built, or when
only infrequent access is needed stores.
“Hub-and-Spoke” Architecture
The “hub-and-spoke” architecture attempts to combine the speed and simplicity of
independent data marts with the efficiency, manageability and economies of scale of the
centralized data warehouse. Shaku Atre terms this architecture the “managed data mart”
approach.
The managed data mart approach is very similar to the distribution warehouse and
dependent marts architecture. The company deploys marts across the enterprise and
manages them through a central facility where data is staged and transformed. This central
can be built incrementally over time.
The hub-and-spoke architecture is a state of mind about how data warehousing should be
managed. The “hub” delivers common services and performs common tasks – most
notably data preparation and management. The “spokes” include data sources (inputs),
central staging facilities, and the target warehouses, marts and user query applications
(destinations).
Figure 3c: Managed Mart Approach
This is a hybrid approach that combines data marts with a central data warehousing
component. The “hub” is not a full-fledged data warehouse. For example, it isn’t designed
to provide user data access; rather the role of the hub is a central “clearing house” and
management center for the enterprises data warehousing capability. The hub receives data
from various sources; then cleans, transforms and integrates it – according to the needs of
the data warehouses, data marts, and user applications that it services.
The hub can function in two ways. It can serve as a transient data hub that does not
maintain any long term data; in this mode, it holds data only long enough to clean it,
transform it and staging it until marts can accept it. Alternately, it can be a distribution data
Understanding Data Warehousing – A Primer
19
warehouse that maintains cleansed data for the data marts. It this mode, it can grow in size
and function over time. Either way, because services are centralized, the hub can help a
company enforce consistent standards for the design of marts, and can dramatically simply
the creation of new marts from existing data pools.
Of the architectures discussed, the hub-and-spoke approach allows scalability of every part of the
data warehousing capability: data sources, central component, and data marts. For large corporate
undertakings, experts seem to agree that it is the best approach. But it can also be noted that the
other architectures may have their place during the evolution of a data warehousing solution.
3.3
Warehouse Framework Architecture
A data warehouse “framework” is a means to define and understand the locations of data, access
to data, movement of data and operations on data. At a high level, the data warehouse
environment can be segmented into several interconnected “layers” of functionality:









Production or External Database Layer
Information Access Layer
Data Access Layer
Transformation (aggregation, cleansing, transformation and dimensioning) Layer
Data Directory Meta Data Layer
Process Management Layer
Application Messaging Layer
Data Warehouse Layer
Data Staging Layer
These layers are illustrated below:
Application Messaging
External
Databases
Information
Access
Data
Access
ODS, Data
Warehouse
and Marts
Data
Staging
Data
Transform
Data
Access
Operational
Databases
Meta Data Directory (Repository) Functions
Process & Activity Management
Repository
Figure 3d: Data Warehouse Framework
I’ll discuss each of these layers briefly:
Understanding Data Warehousing – A Primer
20
Production or External Database Layer
Production systems process data to meet daily business operation needs. Historically these
databases have been created to provide efficient processing for a relatively small number of
well-defined transactions; therefore, they are difficult to access for general information
query purposes. As noted earlier, there are also significant problems with data quality in
these stores.
Increasingly, large organizations are acquiring additional data from outside databases or
electronic transmissions. This information can include demographic, econometric,
competitive and purchasing trends.
Data Access Layer
The figure above portrays two different Data Access layers: one over the data warehouse,
and another over the production databases. Predominately, SQL is the access language used
by Information Access tools to retrieve data warehouse information through this layer. In
contrast, the data access methods used over production and external data sources may be
quite different – and possibly quite archaic.
The Data Access layer not only spans different DBMSs and file systems running on the
same hardware, it also spans manufacturers and network protocols as well.
Transformation Layer
The Transformation Layer is responsible for the aggregation, cleansing, transformation and
dimensioning of data gathered from production and external sources. This is where the
bulk of the data preparation activities take place. This layer may use services of the Staging
layer to privately store intermediate data results during the transformation processes.
Data Staging Layer
An important component of the framework is the data Staging layer. This layer handles the
necessary staging, copying and replication of data between the production systems,
preparation tools and data warehouse stores; in distributed warehouse architectures (such
as distribution marts or hub-and-spoke), this layer can also stage data movement between
marts.
Data Warehouse Layer
The core (physical) data warehouse layer is where the cleansed, organized data used as the
basis for DSS and OLAP is stored. This data is commonly stored in relational databases,
although multi-dimensional and OLAP databases are also used.
Information Access Layer
The end users of the data warehouse deal directly with the Information access layer. In
particular, it represents the tools that end users employ to access information, and the office
tools used to graph, report and analyze information retrieved from the data warehouse.
A number of OLAP and business query tools are available in this layer.
Application Messaging Layer
The Application Messaging layer is pervasive across the production system and data
warehousing environments. It is responsible for transporting data around the enterprise’s
computing network. The Messaging layer may include “middleware” – technology used to
transfer data across different platforms and “equalize” data formats between tools/
Understanding Data Warehousing – A Primer
21
applications. Messaging can also be used to collect transactions or messages and later
deliver them to a specific location at a particular time.
Meta Data Directory Layer
It is hard to imagine a large-scale data warehousing solution that does not take significant
advantage of a repository to manage meta data about the warehousing and production
system environments. Meta data is the “data about data” within the enterprise. Meta data
about data structures, data models, production databases, value domains, transformation
rules, authorities, schedules (and many other subjects) is necessary to effectively manage
even a small data warehouse.
Some solutions from data warehouse tool vendors include a meta data repository, but the
common weakness of vendor-included repositories is that the scope of their meta data
collection is limited to the scope of the tool. This means that large data warehousing
environment may end up with more than one “partial solution” meta data repository.
The Meta Data Directory layer provides access to the meta data repository (perhaps more
than one). Ideally, a common meta data repository can be implemented to aggregate data
from (or coordinate access to) the repositories that occur in the environment.
Process Management Layer
The Process Management layer is involved in sequencing and scheduling the various tasks
involved in building and maintaining the data warehouse, and in building and maintaining
the meta data repository.
Together, these layers provide the infrastructure for data warehousing activities and management.
Understanding Data Warehousing – A Primer
22
4.
What is Data Mart/Warehouse Management?
In this section, we take a closer look at data warehouse management. What gets managed, and
why is it important? Many roles and responsibilities are involved with managing successful data
warehouse implementations. Depending on the architecture selected, a significant number of
people may be involved; in small environments, a few people may serve multiple roles.
4.1
Managing Data Administration: Defining “Information”
Recall that the objective of data warehousing is to create an accurate and consistent view of the
business that can be analyzed, and from which decisions can be made. This means that the
business must be well understood, and the goal(s) of the data warehouse should be crystal clear.
But in large organizations, and even within departments, the definitions of the information to be
analyzed may be poorly understood.
Many enterprise resource planning (ERP) systems begin by defining the core business “objects”
involved in any business, and then proceed to define systems around these themes. A good data
data warehousing solution does the same. This is about getting back to basics: clear and
unambiguous themes, single definitions, specified ranges of acceptable values, and so on.
Managing the definition of information sits at the heart of data quality management. Companies
are often reluctant to manage information quality, because it is a very cumbersome, largely manual
process unless it is supported by active repository services.
What does it mean to manage the “definition” of information? Let’s discuss some of the key
initiatives.
Terms and Semantics Directory (Glossary)
For most companies, it is essential that a corporate “glossary” or directory of terms be
compiled. This directory stores information about the terms that describe the business,
especially core business themes such as Customer, Vendor, Order, Line Item, Vendor,
Invoice, Sales Region, Service Region, Mail Drop, Zip Code, Country and so on. As
companies grow, these terms can shift in meaning; for example, a term like “customer”
may come to refer to an internal, as well as external, entity. Good business communication
depends on everyone using well-understood terms that convey consistent semantics.
The term directory stores a semantic description of each business term, and may also
provide a cross reference to related information, including: sub-classifications of the term
(i.e. terms which refine the theme), business models the theme appears in, business
processes the theme is involved in, common value domains (range of valid values) for the
term, data types or structures or elements that implement the theme, business rules which
involve the theme, and so on.
The directory also stores terms that are company internal. Some of these terms include
acronyms for key management information systems, coined “business lingo” phrases,
various codes used in business systems, and so on.
Ideally, the terms glossary is part of a comprehensive business information directory (BID).
Understanding Data Warehousing – A Primer
23
Value Domains and Abbreviations
Value domains specify the range of valid values for business terms and technical elements.
Value ranges for business terms may be general, while the value domains for technical
elements may be specific to the technical element.
Why are value domains important? Suppose you are querying a production database, and
filtering on the value of a column. You may (or may not) have visibility to the data type –
for example, that it is a YES/NO binary flag. Assuming you have this information, what
predicates should you use in your query? Probably only the production system
programmers know that values to expect in this column: NULL, a HIGH-VALUES code, a
LOW-VALUES code, Y, y, N, n, 1, 0, space, T, t, F, f, and perhaps other characters. This
explains the problem, and why an analyst constructing a query needs to know that a
YES/NO flag has valid values Y or N (or perhaps Y or unknown) … period.
Value domains play a critical role in data warehousing practices, because they provide the
basis for assessing the quality of data from production systems, and “cleansing” it, before it
is made available in data warehouses. To support the cleansing operation, it is also useful
to identify the set of common erroneous values, and associate each value with the
appropriate correct value (or a single standard “unknown” value); this information can be
generated into “data conversion maps” employed by the data cleansing tools.
Value domains also play a critical role in the development of on-line processing validation
routines. Poorly crafted or incomplete validation at data entry is a key cause of inaccurate
data in production databases in the first place.
Identifying a definitive set of corporate abbreviations for each primary business term is also
highly useful. These abbreviations may be organized by size, i.e. a 5-character
abbreviation, a 4-character one, a 3-character one, and so on. (Some organizations only
allow ONE abbreviation to be defined.) The valid sets of abbreviations are used by quality
standards initiatives, such as the standard naming conventions used by programmers and
data structure/schema designers.
Naming Conventions
Naming conventions apply the rigor of business terms, semantics and abbreviations to the
technical IT world. Naming conventions can be defined for various (usually technical)
elements such as file names, module names, data element/structure names; ideally they are
also defined for less technical things, such as business processes. Ideally, consistent
naming conventions should be outlined for virtually all items that comprise “information”.
Although naming conventions are usually attached to specific technical elements, it should
be clear that they also have important linkages to business terms, abbreviations and data
types. Good conventions are built around an ordered sequence that includes some
combination of type (or term), name, qualification, and context; these are sometimes called
object label standards. In such standards, abbreviations are used for types/terms and
contexts, and entity abbreviations (see next paragraph) are used for name and qualification.
These conventions have been employed by application generation tool for years.
Some organizations establish a standard for data element definitions that include a standard
“abbreviated name”. These abbreviated names are highly useful in the construction of
naming conventions that have length constraints (such as COBOL items).
Data Type and Format Standards
Data type standards define the common data forms (and formats) that your organization
understands as “information”. Some examples include: Text, Number, Date, Time, Time
Understanding Data Warehousing – A Primer
24
Stamp, Identifier (ID), Code, URL, and so on. The standards also define the content
guidelines for each respective type.
Certainly most “year 2000” analysis projects can illustrate how many forms that DATEs
and TIME STAMPs appear in. However, the data quality issues relating to other data types
remain largely unexplored - until now, in the face of data warehousing. For example, a
large manufacturing organization have many several different formats for the thing called
“part number” – perhaps some alpha and some numeric.
It is highly useful to relate standard data types with the various enterprise assets that
implement them, including business themes, codes, data structures/schema, and so on.
Automated Data Quality Measurement
Data type, form, value and naming standards are high priorities for IT groups – but are very
difficult to manage unless adherence to the standards can be easily measured, and there are
ways for the standards to be actively deployed in IT systems; such problems can lead to
waning effort and attention. However, today’s e-commerce and data warehousing
initiatives are increasing corporate awareness of the impact that data quality has on the
ability to understand (query and analyze) and modify the behaviors (shift tactics) of the
business. These capabilities are highly correlated with competitive advantage.
Information Owners
It is common for the responsibility for data quality to go unassigned in sizeable
organizations. IT managers refuse to accept responsibility for the content of data that
“flows through“ their systems, business managers disavow knowledge of the data values
“stored in” IT systems, database administrators can’t control what applications put in data
stores, and users perpetuate and suffer from poor information content. Who should be
responsible?
Corporations serious about business engineering, and needing to leverage data quality in
data marts, are assigning explicit ownership for specific segments of information; a data
administration group typically takes on this role. The responsibility for quality is most
effectively assigned at the business unit or department level, although it can also be
assigned more centrally. There are some key benefits to assigning this responsibility:
 A central point of contact is established that immediately understands the quality issue,
and can prioritize the business value of multiple issues.
 A role with authority for the business unit gets first-hand, intimate knowledge of the
information available (and unavailable) to the unit, and a keen understanding of which
information is used (and which is not).
 A stronger link is forged between the goals of the business unit, and the IT data,
systems and marts that support them.
The “data quality analysis” and “data cleansing” operations of the data warehouse are key
opportunities to establish these standards, and to put them to active work in critical decision
systems.
Managing IT-related Meta Data
There is a significant amount of other IT-related meta data to be managed besides the
database schemas. These include, but may not be limited to:
 Meta data from traditional application analysis tools (such as ESW or EPM)
 Meta data from vendor-specific repositories
 Meta data from business process, conceptual design, or database design (CASE)
modeling tools
Understanding Data Warehousing – A Primer
25




Meta data from application environment and management tools (such as Tivoli)
Meta data from middleware or messaging tools (such as MQSeries)
Meta data from message brokers or component managers (ORBs, etc.)
… and varied other sources.
Managing this varied body of meta data, in a meaningful way (by inter-relating it), is
perhaps the greatest challenge for IT (if not business management) in the coming years. It
also represents one of the greatest opportunities for companies in the repository and
information directory industry.
The challenge is in organizing and presenting the key information while avoiding the
common problems of myopia (seeing too little, too small a scope) and macro-phobia
(seeing a picture too big to be meaningful). The vehicle for presenting this information is
sometimes called a “business information directory”, a BID, or some similar name.
Certain industry professionals, including John Zachmann, have spent careers in pursuit of a
highly usable information directory. Other organizations, such as Enterprise Engines, Inc.
have pursued an alternate strategy of integrating business application generation (Javabased) with a strategic business design framework. Both of these approaches serve to
shorten the cycle between a business decision, and a corresponding shift in the policies
enforced by business systems. This is the stuff that makes enterprises truly agile.
Without meta data correlation spanning business process definition, such agility is a pipe
dream for large corporations. A BID solution, or similar integration between business
planning and application system engineering, brings agility into the realm of feasibility.
But a BID, by itself, is not the full solution. Enterprises that are steeped in traditional
values and archaic management structures may experience great difficulty “re-engineering
themselves” to take advantage of this new wave of information.
The data administration function should play a role in defining the data transformation maps and
rules used in data warehousing; these management subjects are discussed in later sections. The
data administration role should also be held accountable for maintaining the meta data linkages
between production (or data mart) schemas and the associated business themes; this subject, too,
is discussed later.
4.2
Managing Business Segment Analysis & Mart Design
Good data mart design begins with analyzing the business segment that the mart will service. I
group the management of these practices together because they are heavily linked during data
warehouse development.
Exploration
Data exploration involves rummaging through, and creatively assessing, the ODSs or production
data available to build warehouses and marts from. This tasks requires a keen sense of the
business objectives and issues, and an open mind about how [seemingly obscure] data might be
used to derive facts and insights about the business. Exploration is similar to data mining; the
explorer may run a series of queries and look for patterns, or analyze the distinct values found in
columns of interest. But exploration doesn’t usually involve special technology; rather it is done
through schema analysis and basic database queries.
Schema Design
In his book, the Data Warehouse Toolkit, Ralph Kimball describes a number of data mart star
(dimensional) schemas geared toward analyzing different aspects of the business. These schemas
Understanding Data Warehousing – A Primer
26
can be used as analysis models, and serve as a good starting point for thinking outside the
traditional IT “transaction-based, fully-normalized” box. Below are just a few, ranging from the
simple to more complex:





Grocery Store (Retail analysis)
Warehouse (Inventory analysis)
Shipments (Discounts, Ship Modes and other analyses)
Value Chain ( demographics)
Others include Financial Services, Subscriptions, Insurance, and non-fact tracking methods.
Building a dimensional data warehouse is a process of matching the needs of the business user
community to the realities of available data. That sounds remarkably like what IT business
systems are supposed to do in the first place. This disconnect - between what business users need
to run the business, and what IT systems typically supply – is the driver for the entire data
warehousing opportunity.
In practice, most data warehousing implementations become evolutionary, long-term projects.
The “facts” (synthesized out of production data and deposited in marts) offer new views of the
business. This leads to further analysis, confirmations and information mining, which in turn
ultimately lead to refinements in core IT business systems.
Why won’t data warehouses simply replace IT systems? The answer to this question remains: the
inherent differences between “transaction processing” versus “analysis and decision support” keep
these two worlds from converging.
At least nine decision points affect the database design for a dimensional data warehouse:
 The business processes, and hence the identity of the core fact tables
 The grain (data granularity) of each fact table
 The dimensions of each fact table
 The facts, including pre-calculated facts
 The dimension attributes, with complete descriptions and proper terminology
 How to track slowly changing dimensions
 The aggregations, heterogeneous dimensions, mini-dimensions, query modes and other
physical storage decisions
 The historical duration of the database, and
 The urgency (frequency) with which the data is extracted and loaded into the data warehouse.
These factors must be addressed essentially in the order given. As fact tables and dimensions are
determined, the mart designer can begin to work with Database Administration to develop the
mart’s database schema.
Locating Data Sources
The search for relevant data sources is another responsibility that the mart designer may
undertake. I discuss this topic later in Managing Data Acquisition.
Prototyping
A significant role of mart design is prototyping the mart and testing its behavior. Queries must be
developed to use the new schema; these access paths will drive requirements for how the mart
tables will be indexed. Indexing is one of the key performance tuning mechanisms available to
database designers, and one that results in the most DASD resource consumption. Finally, queries
must be tested to determine that they are using optimum data access paths, and using indices as
expected.
Understanding Data Warehousing – A Primer
27
4.3
Managing Database Administration
Several kinds of database administration (DBA) are necessary to support large-scale data
warehousing projects. In this context, I refer mainly to the schema management of production
systems, the schema management of warehouses and marts, and cooperative administration of
meta data models (which reflect enterprise and technical themes) in the corporate or tactical
repositories.
Database administration should get involved with the valid data value domains for technical data
structures, since the understanding of possible value ranges and scope can contribute to data
normalization decisions. And the database administration group should be held accountable for
actuating naming conventions and format standards in the schemas and data structures produced
for IT.
Managing Production Data Store Schema
The Database Administration function should have accountability for the schema design,
implementation and tuning of production data stores. I use the term “accountability”,
because in some organizations this function may simply oversee these operations - due to
staffing constraints of the DBA function.
This group must promote and police the adherence to standards for naming conventions especially for new schemas and shared record layouts, but also for existing schemas … if
management and resources allow for this re-engineering.
Managing Data Warehouse, Mart and ODS Schema
The Data Administration function should have accountability for the schema design,
implementation and tuning of warehouse and mart data stores as well.
In organizations new to data warehousing, the fact and dimension theme designs should
ideally be done by (or in conjunction with) the mart designer; the mart designer will have
keen sensibilities about the analysis that the mart must support, and care should be taken
that these insights are not contaminated early in pursuit of blind data optimizations.
As the database administration function matures in the management and design of mart
information, it can play a key role in re-using the schema design patterns that have been
deployed in that past, and any persistent “clean data” pools in the warehouse.
It is even more critical that the DBA group adheres to standards for the mart and data
warehouse schemas. A less technical class of user (a business analyst) will be using these
table and column names in countless queries. Queries against different marts may later be
correlated into aggregate reports, so semantics must be consistent.
To accelerate usage of mart schemas, the DBA can prototype a number of basic queries that
illustrate valid usage patterns to prospective end users.
Applying Data Types, Naming Conventions and Format Descriptors
The DBA group, in cooperation with the data operation group, shares responsibility for
managing data types. This function includes applying the data standards of the organization
(including valid content format and format), and perhaps more important, developing an
auditing and continuous improvement plan for data structures used in implemented systems.
This can be a political issue: the time the organization spends improving its data quality
and understanding may be viewed (by some) as better spent on activities that more
directly contribute to strategic initiatives. In other words: do we decide to prepare
ourselves for growth, or skip over that detail in order to grow sooner?
Understanding Data Warehousing – A Primer
28
Also see related management topics, Managing Data Valid Domains and Managing Data
Traceability , below.
4.4
Managing Data Valid Value Domains
An administrative group must be assigned to pay special attention to the valid value domains for
technical elements (codes, IDs and so on). These domains, and supplementary cross-references of
“bad values” to appropriate domain values, will greatly aid the data cleansing functions of the data
warehouse.
This responsibility may be assigned by business unit; if the central data warehouse “distribution
hub” is adopted, it could be coordinated centrally. The most essential aspect is that the
accountability for data quality remains close to the end users of the data. See the related
Managing Data Quality topic below.
4.5
Managing Data Extraction
Data extraction is the process of pulling the data from production data stores that will be
ultimately be used by one or more marts. Very often, this is a complex process that involves using
3rd-party extraction tools, which are usually applicable to commercial database sources. It may
also involve home-grown solutions to extract data from complex data structures, or odd-form data
stores peculiar to a given IT environment.
I break this section into two segments: qualification and extraction. These functions typically
occur at different times during the data warehouse development project, and may be performed by
different organizations.
Identifying and Qualifying Candidate Stores to Draw From
Ahead of the task of extraction, is the task of identifying and qualifying the “candidate”
production data stores that data could be drawn from. The enormity of this task is factored by the
size, number and distribution of the production system data stores. In exceptionally large global
organizations, these stores could be managed in multiple geographic locations.
This task may not performed by the data extraction team; if the mart designer is fairly technical,
he/she may conduct this research. However, there are real benefits to having the data extraction
team involved with “data sourcing”. First, this gives the extraction team early information about
the formation of a mart - most importantly, some insight into the problem to be solved; this
knowledge can be important for selecting the right sources. Secondly, the extraction team will
tend to be a fairly technical group with intimate knowledge of problems or incompleteness in
certain data pools; this should greatly help in the selection process.
In a complex environment, data mining might be employed to locate suitable data sources. Data
mining relies on specialized tools to search production stores for certain clusters of information.
Managing Queries and Data Extraction
A key challenge in data extraction is to ensure that the data content extracted matches the expected
scope of prospective mart users; this mean applying a number of filters or selection criteria.
Close communication is important on these subjects; this is another reason why the extraction
team should have a close relationship with the mart design team.
Understanding Data Warehousing – A Primer
29
The extraction team manages the queries and tools, and/or develops programs to accomplish the
extractions. For these tasks, intimate knowledge of the data structures, or ready access to accurate
meta data about the structures, is crucial.
For most data warehousing situations, a given data extraction will be performed on a very precise
schedule, with very particular dependencies on completion of specific operation phases. For
example, extractions of data from the inventory database might be dependent on completion of
data entry for today’s warehouse shipments. Honoring these dependencies is key to providing
accurate and semantically complete information to later processing steps. See the related topic
Scheduling and Dependencies below.
Data extraction for data warehouse purposes is necessarily “downstream” from the data
administration of the production system data stores. When data structures in the production data
stores change, the relevant DW extraction process must immediately be altered to stay in sync;
otherwise, the data warehouse population mechanisms will break down.
4.6
Managing Data Transformation
Data Transformation is perhaps the “muddiest” topic associated with data warehousing. I discuss
it separately, so that I can be clear about its definitions. But in fact, data transformations may
occur in concert with other operations, such as Data Extraction. In this section, I cover several
transformations common to data warehousing.
Before proceeding, it is worth mentioning that a number of other utilities come into play that don’t
do actual transformations. The most obvious ones are simple sorting and merging utilities; others
include database unloads and proprietary 4GL extractions. While not complex, these utilities
consume time, and must be represented in the overall dependency scheme that populates a
warehouse.
The management of any data transformation activity, such as those discussed below, requires
knowledge of data element and record key offsets within the data stores being processed. Also
required are the sequences of, and dependencies between, transformation activities, the schedules
of when they are performed, and the associations between the originating, intermediate and final
data stores involved during any transformation thread. Management needs specific to particular
transformations are discussed in relevant sections below.
Data Conjunction
Data conjunction is the process of gathering data from different sources, and potentially different
formats, and transforming it into a common record profile. For example, customer information
may be gathered from a several business subsystems or regional databases; each of these sources
may impose different constraints or value formats on the data. In order to make this data
available, en total, to subsequent data warehouse processing, it must be transformed into a
consistent format.
The management of conjunction requires knowledge of: each data schema that will be merged,
the target schema, mappings of the source schemas to the target one, incidental transformations
associated with the mapping, the job streams or programs that perform the operation, and the
scheduling of the operation.
Data Aggregation
Data aggregation is the process of making meaningful summarizations of data before the data is
made available to subsequent data warehouse processing. Typically, these summarizations are
Understanding Data Warehousing – A Primer
30
made along the “dimensions” that will ultimately be used in data marts. See the related topic Data
Dimensioning below.
For example, in Figure 2a I discussed a fact table, called Sales Fact, that served as the
foundation of a star schema. Columns in this table, such as dollars_sold and units_sold represent
aggregations of information along the dimensions of time, product and store.
It is possible for data to go through several levels of aggregation, especially if it is used in multiple
marts with different focuses. For example, detailed sales data might be aggregated to feed a
regional sales analysis mart; then later re-aggregated to feed an enterprise-global inventory
movement analysis mart.
The management of aggregation requires knowledge about the schema of the detailed data,
schema of the target data (usually a fact table and dimension tables), mappings of the source
schemas to the target one, the job streams or programs that perform the operation, and the
scheduling of the operation.
Data Cleansing
Data cleansing is the process of making data conform to established formats, lengths and value
domains so that its can be effectively ordered, merged, aggregated and so on for use in data marts.
Data cleansing is necessary because, typically, production data is subject to a number of quality
problems. These problems are introduced through poor data entry validation, electronic data
transmission errors, and business processing anomalies. Most aging business systems have “hard
coded” business rules (informally maintained over the years) which reject, tolerate or fix data
quality anomalies on the fly (during processing). But in order to make this data usable for
decision support systems, or any relational query mechanism, the values in the data must be made
to conform to consistent standards.
The management of data cleansing requires additional knowledge of the valid value domains for
technical data elements. The linkage between a data element and a value domain may be through
a business element or technical element definition, if such information is gathered in a repository.
See also Managing Data Valid Value Domains above, and Managing and Data Quality below.
Data Dimensioning
Data dimensioning is the process of deriving a fact table, and corresponding dimension tables,
from cleansed data; it is a specialized kind of aggregation (described above). Dimensioned data is
the cornerstone of most data warehouse analysis.
The management of data dimensioning relies on knowledge and semantics of the dimensions, and
of their relationship to a particular “fact”. This meta data should originate from the business
model, and be correlated with the “mart themes” which support that model. Dimensions should
also be associated with specific star schemas that implement them. All these relationships can be
maintained in a repository.
4.7
Managing Data Traceability
Providing data traceability is perhaps the most challenging aspect of data warehousing. Users of a
particular warehouse or mart require knowledge of the production stores and ODSs that the data
was drawn from, and of the transformations made to it. Other trace information may also be
required, such as the schedule on which the warehouse is augmented, and the cut-off times for the
source data stores and processing.
Understanding Data Warehousing – A Primer
31
There are two sides to this information: the “operations plan” for the warehouse, and the history
of activities involved in warehouse operations (i.e. how the plan was executed). In this section, I
discuss the operations plan, and defer the subject of tracking history to the section Managing
Operational History below.
Virtually every data movement in the preparation and operation of the warehouse, in a sense,
defines the warehouse. This is why traceability is so important. Information must be associated
with specific sources, and be confirmed to follow well-defined semantics, in order to be useful in
decision support systems. Otherwise, the information may carry unacceptable risks.
In this section, I identify many of the facts and activities that must be traceable:
 Extraction or Query. Any program or query definition used to extract data from a data
source according to well-defined criteria.
 Movement. Any movement of data into/from a process, onto/off archival media (tape), or
transmitted through electronic channel.
 Transformation. Any transformation of data into a similar or different configuration. Also
the systematic cleansing of data.
 Dependencies. Any dependency of one process on another process, of any process on data,
and of any data on process. Also, the correlation between business concepts and the domains
and marts in the warehouse.
 Schedules. The identified times (and duration if appropriate) when source data availability
must be ready, when an operations activity shall take place, when warehouse data is made
available and when warehouse access will be predictably suspended. Planned dates/times of
changes to current schema, activities, roles, ownership, dependencies, and so on (engineering
change control).
 Content Ownership. The person, role or organization responsible for the definition of the
content, and the policies that govern its quality.
 Operation Ownership. The person, role or organization responsible for a particular
movement or transformation conducted by warehouse operations.
 History. A transactional history of activities managing the warehouse, including
consummated engineering change control milestones, exceptions, corrections, reorganizations
or disaster recovery.
 Exceptions. Any deviation from the operations schedule (beyond policy tolerance) or failure
(beyond tolerance) of any planned activity or process.
 Corrections, Recovery and/or Re-loads . Any adjustments to warehouse or mart data (which
is typically immutable). Such adjustments may be required to recovery from a processing
exception or database disaster.
 Rules and Policies. The rules that define warehouse operations, especially those
implemented though the operations schedule. The policies that describe the acceptable
tolerances for activity completions (such as return codes) and the actions to be taken upon any
exception.
 Tools Used. The tools used in any operation, activity or aspect of the warehouse. These
range from schema design tools, meta data gathering and management tools, utilities, and any
extraction transformation or movement tools. Also includes the tool that implements the
scheduler.
These traceable elements are described in greater detail in the adjoining sections of this paper.
4.8
Managing Schedules and Dependencies
All operation activities, engineering change control, data readiness deadlines and data availability
windows are best managed through a central (or regional) scheduling system. Ideally, the
scheduling system should reflect the operations and plans of the warehouse, expressed in terms of
the meta data defining the objects and activities; likewise, historical information (that may be
gathered from the scheduling system and execution environment) should refer to the same meta
Understanding Data Warehousing – A Primer
32
data definitions. For this reason, common “job schedulers” may be inadequate for the purposes of
operating the warehouse.
The schedule will reflect a number of critical dependencies that must be maintained between
warehouse activities, in order to maintain the integrity of the warehouse and associated marts.
These dependencies must be expressed and accessible so that a “recovery schedule” can be
constructed on-the-fly (manually or automatically) when a processing exception disrupts
subsequent portions of the schedule.
An intelligent schedule management system may allow tailoring according to rules. Examples of
rules include: election of an alternate processing flow if a specific exception occurs, or automatic
resumption of a pending process when a required data store becomes available.
A good scheduling system is essential for managing the day-to-day “threads” of activities involved
in data extraction, transformation, movement or transmission, warehouse loading, and warehouseto-mart data distributions.
4.9
Managing Data Quality and Usage
Without data quality, even the most rigorously managed warehouse has little inherent value. Data
quality is influenced by every extraction, conjunction, transformation, movement, and any other
data warehouse related activity. It is influenced by the right data being ready at the right time;
and it is influenced by the valid value domains used in cleansing the data. Considering all the
things that can go wrong, most data warehousing sites agree that data quality must be managed as
a specific risk.
In Data Quality and Systems Theory, Ken Orr discusses managing data quality through a
feedback-control system (FCS). “From the FCS standpoint, data quality is easy to define: Data
quality is the measure of agreement between the data views presented by an information system
and that same data in the real world.” Orr discusses a number of data quality rules deduced from
the FCS research:






Data that is not used cannot be correct for very long.
Data quality in an information system is a function of its use, not its collection.
Data quality will, ultimately, be no better than its most stringent use.
Data quality problems tend to become worse with the age of a system.
The less likely some data attribute (element) is to change, the more traumatic it will be when
it finally does change.
Laws of data quality apply equally to data, and to meta data (data about data).
Clearly, if an organization is not using data, then over time, real world changes will be ignored and
the quality of that data will decline. Orr relates this behavior to the scientific phenomenon of
atrophy, i.e. if you don’t use a part of the body, it atrophies. Something similar happens to
unused data – if no one uses it, then there is no basis for ensuring its quality.
What does this have to do with data quality? Product system schemas are chock full of elements
that were included “in case someone might want the information later. This data, if populated
with any kind of consistency, has inherently poor quality because no users rely on it (perhaps since
the inception of the system). It is an common mistake for warehouse designers to assume all
elements exiting in production system schemas are fair game for DSS; in truth, only a handful of
production system programmers know anything about the quality of many elements.
Not only does data quality suffer as a system ages, so does the quality of its meta data. This
begins as the people responsible for the entering the data learn which fields are not used; they then
Understanding Data Warehousing – A Primer
33
either make little effort to enter correct data, or they begin using the data elements for other
purposes. The consequence is that the data and the meta data cease to agree with the real world.
Very often, corporate accountability for data quality is assigned within the business units that
consume it. Successful efforts follow a “use-based” approach of audits, re-design, training and
continuous measurement.
Use-based Data Quality Audits
Auditing can be done through statistical sampling of the data pool. Elements can be ranked
according to use-based criteria including:





How interested are users in the data?
What is the data model, data design and meta data?
Who uses the data today, how is it used, and how often is it used?
How do data values compare to real world perceptions?
How current is the data?
Quality audits should be conducted on a regular periodic basis; besides addressing gross
anomalies, auditors should also identify trends from a sequence of audits.
Data Quality Re-Design
In order to improve data quality, Orr says it is mandatory to improve the linkage between data
throughout the system. The first step is a careful examination of which data serves a critical role,
and how that data is used. Data usage is typically manifest in two areas: the basic business
processes, and in decision support.
The goal of use-based redesign is to eliminate the flow of extraneous data through the system, and
identify inventive ways of ensuring that mainstream data is used more strenuously. Only so much
resource is available to the quality assurance effort. The bottom line is this: If certain data
cannot be maintained correctly, then it is questionable whether that data provides any value to
the enterprise; perhaps it should be eliminated.
Data Quality Training
Both users and managers must come to understand the fundamentals of data quality before quality
can improve. These parties must understand the steps taken in the organization to refine data
quality, and understand the risks to data quality when data is not used or used infrequently. It is
unreasonable to expect that users and managers will intuit these facts on their own. General
training ensures that a common perspective of data quality is promoted across organizations.
Data Quality Continuous Improvement
Management controls must be established to ensure that Data Quality policies and procedures are
followed; a key component of these controls is measurement. Measurement and quality programs
go hand-in-hand.
As Data Quality Re-Design occurs, the Data Quality Audits must be re-done for the redesigned
systems. This begins the cycle of improvement for the new system. Data that is truly vital must
be physically sampled and audited, and ideally these audits should verified by authorities external
to the local organization responsible for quality.
Data Quality S.W.A.T. Tactics
The departmental data quality authority must be mobilized to address quality problems quickly,
especially when these problems are polluting the warehouse data pools. Accountability must be
Understanding Data Warehousing – A Primer
34
assigned to review and trace data anomalies that appear – errant data in the data warehouse pools,
exceptions that shake out of the data cleansing and transformation processes, and data redundancy
that should get resolved during data conjunction.
In summary, the problems with data quality that must be actively managed are:
 Dirty data – inaccurate, invalid or missing data
 Redundant data – comes from redundant record, with or without common keys, often with
conflicting values
 Inconsistent data – resulting from inconsistent transformation process or rules.
An organization that is not dedicated to the active management of data quality, is likewise not
dedicated to a fruitful data warehouse environment.
Data Cleansing Management
In the section Managing Data Transformation above, I discussed the management of data
cleansing processes. It is important to draw a distinction between managing the data cleansing
activities, and managing the goals of those activities. The cleansing activity itself is largely
mechanical; but the definition of the cleansing process (including the application of standards,
value domains and conjunction matrices) is fairly creative work that often requires significant
subject area expertise. The latter is better managed outside the operations group, and with
departmental (users of the data) accountability.
4.10 Managing Data Warehouse Architectures
Data warehousing environments of any size must be designed with a particular architecture in
mind. A common problem occurs when successful mart implementations are re-scaled, or
combined in some fashion, to become centralized data warehouses. The success of the data
warehouse project is significantly compromised when the warehouse architecture is simple left to
“happen”.
Architecture must be a conscious effort. The independence or dependence of specific marts must
be well known. The economies of scale possible through hub-and-spoke architectures must be
calculated. The short-term and long-term usage patterns of marts and ODSs must be assessed and
managed to.
In many respects, data warehouse environments are “living, breathing animals”. They consume
massive amounts of physical and people resources. Some portions dynamically grow while other
portions decline. Requirements are continually shifting based on DSS feed-back loops. DSS
systems are often the most expensive initiatives in the enterprise, and therefore the most heavily
scrutinized.
Some of the architecture aspects to be managed are:
 Logical network of warehouse, marts, ODSs and staging domains to identify dependencies
and key information flows
 Functional and logistical analysis of where and when data extraction, transformation, staging
and distribution occurs
 Physical network of the platforms, facilities and middleware in the environment used to
implement the warehouse plan
 Resource load planning and budgeting for initial and subsequent growth phases of the
warehouse project. Coordination of MIS, IT and Data Administration resources
 Selection of appropriate tools, or design of home-grown solutions, to enable the operations
and management of the warehouse
 Strategic planning and oversight of the growth plan
Understanding Data Warehousing – A Primer
35
It is often necessary to dedicate specific resources to the development and growth of the data
warehouse architecture; otherwise, resources can get torn between the priorities of the warehouse
environment, and other operational priorities in the enterprise.
4.11 Managing Operations, Facilities and End User Tools
After planning and implementation is underway, running the warehouse on a day-to-day basis is
largely a job of scheduling and logistics. Highly integrated activities cannot be simply “farmed
out” to satellite organizations without a strong infrastructure in place. Even when a strong
infrastructure exists, the rate of change during early warehousing efforts can be significant.
The logistics manager of the warehousing project needs a clear view of the hardware, software,
tools, utilities, middleware, physical resources, intranet and internet capabilities, messaging
services and other facilities that are employed in running the warehouse. Secondarily, s/he needs
current information about the loading, or the available capacity, of these resources. These
requirements are a tall order. Facilities are augmented all the time, and capacities can fluctuate
dynamically.
Be definition, managing these operations includes estimating, planning and monitoring transaction
and query workloads. Capacity planning is crucial for ensuring that growth is managed, and that
intermittent crises don’t bring the warehouse environment to its knees.
The tools employed by the end users of the warehouse (primarily query and OLAP tools), and the
catalog of existing queries, are perhaps the most visible elements of the warehouse on a daily
basis; this is where “the rubber hits the road” and the value potential of the warehousing effort is
realized. Users continually and subjectively assess the quality and performance of warehouse.
Since satisfying users is the objective of the data warehouse project, it is important that the user
community has some visibility to the logistics involved in operating the warehouse. This visibility
includes the planned refreshment cycle, operational milestones (stages of processing), availability
of certain data to the central warehouse, availability of new data to marts (distribution schedule
and duration), and the “release“ of data for official use following verification. This visibility
allows the user community to better understand the operational capabilities and constraints of the
“warehouse engine”, implicitly sets expectations about quality and availability, and engenders a
team dynamic between the users and deliverers of the warehouse.
4.12 Managing Basic Database Operations
Data warehouses, ODSs and marts are largely implemented using commercial database
technology. This technology requires particular patterns of administration including: back ups,
replication, reorganization, data migrations, column additions, index adjustments, recovery,
maintenance and so on.
In the data warehouse environment, the administrative overhead of these activities can be
significantly higher than that of the production system environment, because of the dynamics of
the warehouse and logistical complexity. Additional resources should be allocated, if not
dedicated, to handle this work load; the time and resource overhead of these activities must be
also factored into logistics plans.
4.13 Managing Rules, Policies and Notifications
A data warehousing environment is managed through a number of rules and policies that affect
day-to-day decision making and ultimately contribute to information quality. For example, what
Understanding Data Warehousing – A Primer
36
happens if one of the sources for a data conjunction is not available due to some catastrophe?
Should be the data conjunction still be performed, and the data warehouse loaded anyway? Do
users need to be notified?
Situations such as these need to be considered, and policies established to handle them, well in
advance of their occurrence. Policies must reflect the critical nature of the data, and the effects on
the business users when current data is not available. Such policies may vary by data mart, since
criticality of data and sophistication of users likewise vary. The important thing is that they be
established before a crisis occurs.
Communication becomes paramount when the operations of the warehouse are spread across a
number of organizations. Peer organizations must be notified when problems disrupt the
operations schedule. Data Quality administrators must be notified when quality thresholds are
exceeded. Users must be notified when there is a change in delivery schedule or databases are
unavailable due to failure or maintenance.
Meta data about the data warehouse and mart environments, and the people/roles responsible for
critical events, makes effective communication possible. Ideally, the complete “supply chain” of
data delivery should be available to a particular participant in this environment. For example, a
mart user should be able to determine what queries are available, where the mart data comes from,
who is responsible along the way, and the cycles of data availability and refreshment. An
operation task manager should be able to determine which predecessor operations he is impacted
by, which successor operations he affects, who the quality authority is for an operation, and which
marts are ultimately affected. A production systems manager should be able to map production
data to business themes, and trace the threads of business data from the production system to the
data warehouse and marts. Each of these roles needs to know who to notify when something goes
wrong (and who to praise when things go right).
Rules and policies take special effort to administer because they can be so difficult to automate.
4.14 Managing Meta Data Definition, Population and Currency
Throughout this paper I’ve identified the importance of meta data in managing the data warehouse
environment. Meta data identifies the objects to be managed, the processes, the responsibilities,
and the higher critical thinking about business themes and processes, data quality, and mart
objectives. Meta data must be managed with much the same (or greater) rigor as business data,
because it ultimately influences how business data is utilized and understood.
The previous section, Managing Rules, Policies and Notifications, discussed the critical role that
meta data about supply chains, roles and responsibilities can play in keeping the warehousing
community well-informed. An active meta data repository can be powerful vehicle for active and
passive communication across management, operations and end users. When key roles in
warehouse environment keep their meta data current, anyone else can assess the state of the
environment for themselves: this is passive communication. Conversely, when a crisis occurs,
meta data can be used to ensure that impacted roles are actively notified.
The value of the data warehouse is built on credibility. Such credibility is possible only when
there is a common understanding of business themes and processes, and of the data that business
decisions are made from.
Many of the same decisions made regarding data mart data must likewise be made for meta data.
When an enterprise-wide repository is populated, it essentially becomes the “meta data mart”. So
the following decisions must be made about the data:

Who is the user of the data, and what is that user trying to accomplish?
Understanding Data Warehousing – A Primer
37







How should the data be organized to make it more accessible to the target users?
How shall change control be managed? As meta data evolves, how important is versioning of
historical configurations?
How often, and through what methods, will meta data be refreshed or augmented?
How should the threads between business themes and initiatives, production systems,
database schema, standards, content quality and responsibilities be modeled in the meta data?
How will these threads (relationships) get established? Which are most crucial?
What interfaces are necessary to make the relevant meta data available to the global audience?
How will sensitive meta data be protected?
Who will manage the integrity of the meta data? The Data Administration group?
… and so on.
Effective meta data management affects the entire operation. The information systems (IS) group,
which often gets saddled with most of the data warehouse operation responsibilities, is most
directly impacted. It is a common perception that IS is responsible for the integrity of the data and
how its used, so IS has a vested interest in making sure everyone uses it the best way they can;
otherwise, repercussions will come back IS.
4.15 Managing History, Accountability and Audits
Data warehouse projects are some of the most mission critical, expensive and visible initiatives in
the corporation. A significant number of operations go into the day-to-day care and feeding of a
large warehousing environment; there are many opportunities for the objectives of the warehouse
to be compromised. As DSS activities become more crucial to the business, it is necessary to
establish certain “assurances”.
Corporate auditors may perform regular operation audits (over and above those performed within
the immediately responsible organizations). Such audits may require that logs be captured of
critical processing, and that a record of local audits, change control and oversight be maintained.
Ideally, logs should capture measurement criteria. For example, daily reports from the data
cleaning operations might indicate the volume of spurious values in crucial value domains. These
reports can be logged; any exceptions over prescribed thresholds can be escalated for immediate
attention. If the steps taken to resolve these problems are also logged, then the auditors have a
check and balance, and a measure of performance (time to correct the problem). Logs such as
these must be retained for a prescribed period, perhaps though specific archival procedures that
facilitate the audits.
Because of the critical importance of meta data, the management practices involved with
maintaining meta data should also be audited.
4.16 Managing Internal Access, External Access and Security
By definition, the data business in critical decision support systems is “sensitive”. The definitions
of the warehouse and marts paint a picture of how the business is strategically managed. If this
information falls in the wrong hands, competitive advantage can be compromised. Conversely, if
the information is not placed into the right hands, it quickly diminishes in strategic value.
More and more progressive companies are exploring the value of making certain mart data
available to the supply chain (also called the business ecosystem) they are involved with. For
example, a regional distributor may be granted access to the parts delivery marts of the
manufacturers it buys parts from, and may in turn grant to its customers access to its own regional
warehouse stock and delivery queue marts. These collaborations save organizations money
Understanding Data Warehousing – A Primer
38
(reduced inventory) and allow them to provide greater customer service (visibility to the event
horizon).
The management of data accessibility can be complex in “open” environments. Internal
information consumers must be administrated on a need-to-know basis, while mart prototypers
and data miners may need special (more comprehensive) access. Data access for outside entities
must be administrated across firewalls, and through web or virtual private network (VPN)
mechanisms. It may be necessary to encrypt especially sensitive data that travels beyond
protected boundaries. Decisions about data access are necessarily at business unit levels, but must
be administrated at technical IT levels.
Administrating how internal and external users access DSS information is essential to the success
of the warehouse. This includes monitoring usage, which we explore in a later section.
4.17 Managing Systems Integration Aspects
A complex data warehousing environment can require significant systems integration. Each
interface between applications, databases, ERP systems, EDI sources and OLAP tools represents a
potential risk for compromising the value of information, or impeding performance. Even the
localized architecture between the production systems, operational data stores, warehouse and
marts can yield many system integration issues.
The team responsible for the data warehouse architecture (as discussed in section Managing Data
Warehouse Architectures) can handle many of the systems integration issues. However, this team
may lack the skills necessary to interface with ERP systems, or to integrate significant new
technologies. Additional resources with specialized skills may need to be assigned to the
warehouse project on a one-time, periodic or full-time basis.
Managing the interfaces with ERP packages, OLAP tools and other special utilities may
necessitate additional kinds of meta data. Where possible, meta data extraction tools should be
purchased or developed to keep the meta data in the corporate or tactical repositories in step with
meta data stored independently by packages, tools or management subsystems.
4.18 Managing Usage, Growth, Costs, Charge-backs and ROI
Data warehouse projects are some of the most expensive and most visible initiatives in the
enterprise, so it comes as no surprise that they come under heavy scrutiny.
Planners and mart designers need to consider usage and growth. Project managers, data
administrators and database administrators need to accrue various project costs (relating to design
and operations) to the warehousing effort. Accountants need to distribute costs across the
benefiting organizations, because no one department should have to carry the budget
responsibility for the overall warehouse. And executive management wants to ensure that return
on investment is realized.
This section discusses a few of the issues involved.
Usage
Departments want to know who is using the marts, how often, and through which queries. In
earlier section Managing Data Quality, we discussed that the data which goes unused represents
the greatest quality risk.
Understanding Data Warehousing – A Primer
39
The data query environment can be monitored with various devices (such as SQL exits) to
measure usage; where such monitors are unavailable, periodic polls can be taken. Unused queries
can be removed from general availability, and unused data can be eliminated from relevant
schemas at an appropriate change control juncture.
Growth
Data warehouse projects are known to experience dynamic growth in consumption of resources.
Unless huge volumes of DASD are idle and available, advanced planning is a must.
At least some of the growth planning can be done from a combination of planned “mart starts” and
historical experience of warehouse and mart growth; of course the latter is not available for the
first warehousing project, so industry metrics or peer experience can be substituted.
Cost Management
Cost management is an accounting function, but actual cost data must be gathered from IT
operations, contributing data administration and database administration groups, and end users.
Various tools may be capitalized and depreciated, while others may be expensed against one or
more startup projects.
Centralized data warehouse costs can be aggregated; then a “slice” of the expenses can be charged
back to each benefiting business organization. This distribution of expenses can be done simply
(with an equal distribution), but most using organizations argue for distribution according to use.
This makes the usage monitoring efforts all the more important.
Return on Investment
Executive management is most interested in which new business initiatives were made possible by
the warehouse and marts, and that the business investment for warehousing results in real value.
This can be hard to quantify after the fact, so early planning of how ROI will be computed is very
important. For example, metrics about some business processes or line of business behaviors must
be gathered before the warehouse projects begin, so that they can be measures against similar
post-implementation metrics.
4.19 Managing Change – the Evolving Warehouse
As business users start to work with integrated business data and discover new value in the
information, they often change the requirements. Keeping pace with these demands means
developers must make enhancements quickly and cost-effectively, and facilities planners must
ensure that adequate CPU and DASD resources are available.
Change management is especially important. There are a number of activities, across departments
or organizations, that must be coordinated. When schema or file layouts change, the programs
and utilities that use those resources must be changed in coordination; indeed, changes to business
elements can and should ripple changes all the way through warehouse processing and ultimately
influence or alter end user queries. These large-scale changes must be staged for implementation
in the future – at a point when all affected parties can be ready. Ripple effects of uncoordinated
implementations can be catastrophic and take weeks to undo.
Strong planning and management communication is essential for keeping politics at bay, and
keeping the warehouse focused on key business initiatives. The warehouse environment has the
potential to become the “hub” of business decision making; left unchecked, it can easily spawn a
new and highly unmanageable breed of applications running against ODS and warehouse data.
Excited business areas can exert political forces that warp the original objectives of the warehouse,
Understanding Data Warehousing – A Primer
40
and lead the warehouse project beyond its constraints of processing window, data quality and
general feasibility.
The data warehouse project requires a strong management presence that can maintain and evolve
the warehouse vision, while keeping pace with target business needs. The holder of the vision
must continually re-cast the role of the data warehouse as it evolves in concert with production
systems and business initiatives.
Understanding Data Warehousing – A Primer
41
5.
Future of Data Warehousing
We can expect data warehousing to continue evolving in the next few years. Perhaps the best way
to deduce how data warehousing will evolve is to revisit why it originated in the first place, then
evaluate the factors that continue to influence it.
Objectives
Recall that data warehousing was initially invented to work around a serious problem: the
operational data buried in production systems was not available, nor clean enough, nor
organized properly, for use by decision support systems. Data warehousing solves this problem
with a number of solutions that (not surprisingly) retrieve, clean, organize and stage data – and
finally make that data available for business queries.
In some ways, current data warehousing is a brute force methodology that works around the
problems that mainstream technology can’t solve today. These problems include:
 A single structure of data is insufficient to support varied audiences with separate interests
 A single body of data cannot support concurrent access from all potential information
consumers, and their respective usage patterns, with any semblance of performance.
 Operational data is forever evolving, while data used for analysis must remain relatively
stable.
 Organizations require a closer bond between business initiatives and the IT systems that
activate them - in order to be more responsive and competitive.
At the heart of data warehousing is “reaching deeper truths”. Many corporations are finally asking
the question: “What do we really know about the most important thing in our business – the
customer?”. Knowledge that a customer exists is not enough; nor, perhaps, are simple
demographics enough. The deeper truths are discovered by understanding the behavior of the
customer - and this means observing patterns: buying trends, shifts in preferences, and so on. It
has been said that 90% of the proprietary data we’re protecting today has no value; yet we spend
an enormous amount to protect it. Data warehousing is about wringing value from data.
Business Factors
Changes to the nature and positioning of data warehousing will be driven by business factors.
Issues such as competitive advantage, gaining market share, focus on strategic objectives, and
streamlining business operations will continue to be the main reasons for data warehousing.
But the nature of business is changing due to new opportunities like e-commerce and the advent of
more powerful enterprise requirements planning (ERP) systems. We can expect increasing
interest in the integration of data warehouses or marts into “supply chain” scenarios, since mart
data is the most relevant, cleanest, and most semantically pure data that the enterprise can offer to
the outside world.
Over time, I anticipate growing pressure to actually replace production systems with ODS or data
warehouse “like” systems that serve both the production and DSS worlds. There are real
technology barriers that make this impossible today; but as these barriers dissolve, people are
going to gravitate to the data that is the most accessible and easy to understand. And this body of
people will increasingly come with a business perspective rather than a technical one, i.e. with a
predisposition to ODS or mart-like data.
Deeper integration with ERP systems will also drive a shift to better data organization. ERP
systems visualize the business is a well-defined, clean and organized fashion. This “information
Understanding Data Warehousing – A Primer
42
map” corresponds well to ODS or mart data structures, but maps poorly to most production system
data structures. So integration with ERP systems will more likely happen through the ODSs or
marts.
All this said, the cost of migrating from an established data warehousing architecture to
“something new”, will be an inhibitor for many organizations – unless the migration can be tied to
a specific business initiative with measurable returns that can cover the costs.
Technology Factors
Advancements in technology, and the decreasing costs of CPU and DASD hardware, will continue
to drive changes to the data warehouse infrastructure. Data warehousing is fundamentally
comprised of many low level technology tasks (extraction, cleaning, moving, distribution, etc.).
We can expect technology to improve in remarkable ways; the Nucleus technology mentioned in
an earlier section is a good example.
In addition to doing the current processes of data warehousing better, we can expect technology to
allow data warehousing to be done differently, and perhaps more succinctly. For example, data
technology is being developed that allows common access, through standards like SQL via
ODBC, to databases of different kinds and structures. It is not that far fetched to anticipate
technologies that will allow us to combine and view archaic legacy data stores through a “data
mart lens”.
Web systems are perhaps on the fringes of a new frontier of consumer behavior monitoring.
Remote web sites are able to collect increasingly significant information from the clients that
access them. And the new automated “information gathering” services and specialty search
engines are a natural proving ground for gathering hard data about consumer behaviors. We can
expect this “behavior demographics” information to be sold back into vertical industries, and
incorporated into data mining and warehousing initiatives.
We can also expect new information delivery mechanisms, such as intranet- or Internet-based
publishing or search facilities, to be more commonly utilized on the back end of data warehouses.
Knowledge Engineering
Data warehousing should enjoy increasing synergies with “sister initiatives” such as knowledge
engineering and data mining. There is a growing overlap between these fields, and for good
reason – they are all involved in deducing knowledge from data.
As discussed in this paper, central knowledge engineering is a critical success factor in allowing
global organizations to act globally, rather than just locally. Acting globally only happens when
people can communicate clearly on important subjects, and cooperate (rather than bump into each
other) on enterprise-wide projects.
More Effective Infrastructure & Automation
Data warehousing is a complex environment. It requires cooperation and coordination among
many departments; it utilizes many different tools, and requires significant ongoing management
and administration. It is also one of the most dynamically changing environments in organizations
today.
The data warehousing scenarios are simply too complex to be administrated manually or through
trivial scheduling tools. A significant portion of the infrastructure, and the administration, must be
automated. Since the quality of warehouse data is inherently tied to the consistency of the
processes that create it, there is a growing opportunity for process flow automation.
Understanding Data Warehousing – A Primer
43
The goal of automation is to reduce human interactions (human error) and leverage the
consistency and predictability of the computer. Ultimately, the goal is for humans to make fewer
“mechanical” decisions, and rather respond to events and deal with exceptions.
For data warehousing, this implies that a very strong repository of meta data is available about all
the business information, activities and deliverables occurring in the warehouse environment –
complete with detailed dependencies between warehouse processes and assigned responsibilities.
Few repository technologies available today can meet this challenge; but without such
technology, the risk to data warehouse projects will remain very high – mitigated only by .
More Effective Communication
The number of parties that must communicate effectively in the design, planning, day-to-day
operations, distribution and mart management activities of the data warehouse can be staggering.
This highlights a significant challenge that will undoubtedly be addressed in the future communication.
Effective communication requires a sound information foundation. Certainly, meta data
repositories play a key role in laying this foundation, and enabling collaboration. Various work
group tools (such as Lotus Notes) also provide some communication infrastructure, but may not
provide necessary continuity between the communication, and the context of the communication.
Being able to communicate easier with more people is not the problem. Rather, it is
communicating more discretely, and more contextually, with just the right people. In order for
this to occur, a more automated process infrastructure must be in place. People must be trained to
leverage it, and to communicate through it. And the process infrastructure must be capable of
monitoring events, raising exception conditions for review, and notifying responsible people or
subsystems.
We can expect such an infrastructure to evolve as organizations strive to “close the gap” between
the business objective, and the implementation of that objective as realized in IT and DSS
applications.
Understanding Data Warehousing – A Primer
44
6.
Summary
Data warehousing is a significant undertaking for most organizations. Data warehousing projects
are notoriously expensive, and history has shown that the overly ambitious projects are prone to
failure. The risks are high, but so are the rewards when data warehouses and marts are conceived
and employed strategically. As with e-commerce, many companies will reach the point where
they can no longer afford not to do some strategic decision support; the competition alone will
make it imperative.
This primer has discussed a number of management issues inherent in the data warehouse
environment. Chapter 4 identified nearly twenty management aspects that require attention; even
if several of these can be consolidated, it is clear to see the breadth of processes, roles, control and
communication that must come together to make data warehousing succeed. We are using the
breakdown of management issues, outlined in Chapter 4 , for the development of use cases and
solution points for data warehouse management.
A number of roles (actors, in “use case” parlance) are involved in the data warehouse problem
space. The spectrum of different needs, ranging from the very technical to the very non-technical,
is perhaps the most striking aspect. Just some of the identified roles include:








End User (Common, Explorer, Protyper)
Application developer
Database administrator
Data Administrator
Systems programmer
Network administrator
Managers (Business unit, DW Operations and IT)
Data warehouse project administrator.
It is clear that a single solution cannot hope to satisfy all these roles; but it is likewise clear that a
common information infrastructure must support them all.
There is an obvious need for meta data solutions to support the data warehouse environment. In
fact, many of the existing warehouse management tools come with narrowly focused meta data
repositories. This is good and bad news: it is good news that meta data is being leveraged (data
warehousing would not be feasible without it); but it is bad news that there a many scattered
repositories with no coordination between them. The ongoing challenge for meta data delivery is
to provide excellent integration of meta data sources, intuitive organization of meta data subjects,
high availability to the meta data store, and focused views for different roles.
Conclusion
It is worth restating the initial premise of data warehousing: Data warehousing is the process of
creating useful information that can be used in measuring and managing the business.
Whether data warehousing survives in its current form, or evolves to a more optimal form, will
depend on emerging technology and the ingenuity of leaders in this market; regardless, it is clear
that the business requirements for strategic decision support will not go away. It is also clear that
the complex issues of data warehouse management will not go away either.
Understanding Data Warehousing – A Primer
45
Download