Data warehouse

Unit 5 A – Data warehouse
What is a data warehouse?
A simple answer to this question is that a data warehouse is managed
data situated after and outside operational systems. This answer is
based on the concept that fundamentally distinguishes an
organization's operation data (data that used to run the organization)
and information data (data used to manage the organization.) For
example, a bank's operation data are the most up-to-date financial
data for each customer, which are stored in different banking hosts.
A given customer's saving account data will be stored in a saving
account database, while his/her stock trading data will be stored in a
stock exchange database. Operation data can be dispersed amongst
different places in an organization, and they are usually overwritten
by new data as transactions go on. On the other hand, the same bank
may have a data warehouse, i.e. a single data repository that stores
all the financial data (all the transaction records) of all its customers
over a period of time (say several years). This huge amount of
accumulated archival data will serve as the bank's information data
on which business intelligence analyses can be done.
From an application point of view, a data warehouse can be defined
as any centralized data repository that can be queried for business
benefit. Warehousing makes it possible to:
extract archived operational data;
overcome inconsistencies between different legacy data
integrate data throughout an enterprise, regardless of
location, format or communication requirements; and
incorporate additional or expert information.
Characteristics of data warehouses
As defined by Bill Inmon (1992)(an information system guru and a
well-known data warehousing expert), one can easily recognize the
following characteristics of a data warehouse:
Subject-oriented -- data in the warehouse are organized by
subject instead of application, e.g. an insurance company
would organize their data by customer, premium and claim,
instead of by different products (auto, life, etc.). A data
warehouse contains only the information necessary for
decision support processing.
Integrated -- encoding of data is often inconsistent, e.g.
gender might be coded as 'm' and 'f' or 0 and 1, but when data
are moved from the operational environment into the data
warehouse they assume a consistent coding convention.
Time-variant -- a data warehouse contains a place for storing
data that are 5 to 10 years old, or older. These data are used
for comparisons, trends and forecasting, and are not updated.
Non-volatile -- data are not updated or changed in any way
once they enter a data warehouse. They are only loaded and
These unique characteristics usually pose many challenges to the
warehouse designer or architect when she considers how to
implement a data warehouse for an organization. These challenges
include the considerations of the following criteria for a data
Load Performance -- as data warehouses require incremental
loading of new data on a periodic basis, they must not
artificially constrain the volume of data.
Data Quality Management -- a data warehouse must ensure
local consistency, global consistency and referential integrity
despite 'dirty' sources and massive database size
Query Performance -- a data warehouse must not be slowed
or inhibited by the performance of its RDBMS.
Terabyte Scalability -- data warehouse sizes are growing at
astonishing rates so their RDBMSs must not have any
architectural limitations. They must support modular and
parallel management.
Mass User Scalability -- access to warehouse data must not be
limited to an elite few. A data warehouse has to support
hundreds, even thousands, of concurrent users while
maintaining acceptable query performance.
Networked Data Warehouses -- data warehouses rarely exist
in isolation -- users must be able to look at and work with
multiple warehouses from a single client workstation.
Warehouse Administration -- since a data warehouse is
large-scale and time-cyclic, it must offer administrative ease
and flexibility.
Integration of dimensional analysis by RDBMS -- dimensional
support must be inherent in the warehouse RDBMS to provide
the highest performance for relational OLAP tools.
Advanced Query Functionality -- end users require advanced
analytic calculations, sequential and comparative analysis,
and consistent access to detailed and summarized data
Data warehousing architecture
Data warehouses, as we have discussed, are different from corporate
databases in the sense that they serve an organization's information
needs instead of its operational needs. Depending on various factors
such as the availability of resources and business needs, there are a
number of different ways to structure data warehouses. In this unit,
we are not going to look into the technical details of data warehouse
architecture -- i.e. the hardware, software and physical layout.
Instead, we will look into the architecture of data warehouses from
the organization perspective as well as a business-needs perspective.
Organization architecture and implementation choices
There are a number of architecture choices for data warehouses as
seen from the organization perspective. These architecture choices
will be based on managerial factors such as:
The organization's existing physical and management
structure -- Is the organization highly centralized in one
location or highly dispersed with many branch offices? Is the
management structure highly centralized or is there relatively
independent departmental decision making?
Technical capability and resources availability -- Does the IS
department control all the technical expertise and resources
or are there departmental computing capabilities and
Commitment and style of management -- Is senior
management highly committed to data warehouse
development or is it only driven by needs from frontline users?
Depending on these various organizational factors, there are three
broad types of data warehouse architecture:
Global architecture -- this architecture is built upon a fully
integrated data warehouse across all the departments and
lines of business of an enterprise. The architecture is global in
the sense of data access instead of the physical organization
of the data warehouse. A globalized data warehouse can be
either physically centralized or distributed. However, a global
architecture can only be effectively managed and maintained
by a centralized IS department.
Independent data mart architecture -- the term 'data mart'
refers to a data warehouse on a smaller scale, usually small,
stand-alone data warehouses developed by workgroups,
departments or lines of business within an organization.
Implementation and maintenance of these data marts are
usually drawn from a department's own resources and
Interconnected data mart architecture -- several independent
data marts within an organization can be integrated or
interconnected to provide sharing of access to data marts
across departments or workgroups.
In addition to these architectural choices, there are also several
different approaches for the implementation of data warehouses that
also depend on various organizational factors. These choices are:
Top-down approach -- this implementation approach usually
leads to a global architecture, and usually features more
senior management involvement and more resources from
the organization. More consistent data definition and business
rules should be the result. However, the cost for
implementation will be high and more time will be required.
Bottom-up approach -- this approach is more flexible in the
sense that it is usually driven by business needs in individual
departments and lines of business. This approach does not
necessarily end up in data mart architectures. The usual
scenario is that a global data warehouse be built incrementally
with initial data mart implementations expanding and being
joined up.
Combined approach -- this approach is in a sense the best
approach since, with good project management control, this
approach can be a 'balancing act' that combines the
advantages of both the top-down and bottom-up approaches.
Data architecture for data warehouses
In the previous subsection, we talked about the architecture of data
warehouses from the organization's perspective. You learned how a
data warehouse could be organized in terms of the ownership,
management and control of data and usage. This architectural design
is also directly related to how a data warehouse is implemented. In
this subsection, we will look at the architecture of a data warehouse
from another perspective; that is, how the data contained in a data
warehouse should be organized to meet different business needs.
To begin our consideration of architecture from this business-needs
perspective, we first take a look at the different types of data that can
be stored in a data warehouse. There are three basic types of data,
real-time data
derived data
reconciled data
As you've seen, there are different methodologies for data
architecture design in data warehouses. The Enterprise Data Model
(EDM) is one of the methodologies that enables all the data elements
to be defined consistently throughout the whole organization. Usually
an EMD exercise is divided into phases. Each phase will include
different amounts of information. A full-scale EDM exercise can be
very resource intensive and time-consuming and may not be
practical for some organizations. Hence, a trimmed-down version of
EDM can be adopted in which some of the core components that are
required for data warehouse modelling can be extracted and grouped.
The following reading presents the details of EMD and a simplified
version of it. The benefits and drawbacks of EMD are also presented.
In the physical design phase of data modelling, a very important aspect of the
design is the 'granularity' of the data. This concept is concerned with the degree of
summarization of the data elements. The level of data granularity will directly affect
the ability to answer queries in future data analysis and data mining. The following
reading gives you the details of how to proceed with a choice of data granularity.
Data partitioning is very important in data architecture design. It affects the
efficiency and flexibility of accessing data, and also affects the maintenance aspects
of the data warehouse such as the scalability, portability and ease of sharing and
archiving data.
There are two different perspectives in data partitioning: physical partitioning and
logical partitioning. Physical partitioning refers to how data are structured and
grouped according to the physical design of a data warehouse. Logical partitioning
refers to how data are structured and grouped based on the characteristics of the
data, such as customer, product, time period, accounts, etc. Logical data
partitioning will facilitate access of data according to business needs. However,
physical data partitioning can overlap with logical data partitioning.