Data Warehousing - An Overview

advertisement
Data Warehousing - An Overview
Information Technology (IT) has historically influenced organizational performance
and competitive standing. The increasing processing power and sophistication of
analytical tools and techniques have put the strong foundation for the product called
data warehouse. There are a number of reasons that any organization should
consider a data warehouse, which can be the critical tool for maximizing the
organization’s investment in the information it has collected and stored throughout
the enterprise. IT managers need to understand the rationale and benefits of data
warehouses because they may need to design and implement, or procure this
kingpin of business intelligence.
The data warehouses are supposed to provide storage, functionality and
responsiveness to queries beyond the capabilities of today's transaction-oriented
databases. Also data warehouses are set to improve the data access performance of
databases. Traditional databases balance the requirement of data access with the
need to ensure integrity of data. In present day organizations, users of data are
often completely removed from the data sources. Many people only need readaccess to data, but still need a very rapid access to a larger volume of data than can
conveniently by downloaded to the desktop. Often such data comes from multiple
databases. Because many of the analyses performed are recurrent and predictable,
software vendors and systems support staff have begun to design systems to
support these functions. Currently there comes a necessity for providing decision
makers from middle management upward with information at the correct level of
detail to support decision-making. Data warehousing, online analytical processing
(OLAP) and data mining provide this functionality.
Here, we are to discuss what is all about data warehouse, how it helps to gain a
competitive edge for an organization.
Data Warehouse - An Introduction
A data warehouse is defined as a subject-oriented, integrated, nonvolatile, timevariant collection of data in support of management's decisions. More generally, data
warehousing is a collection of decision support technologies, aimed at enabling the
knowledge worker, such as executive, manager, and analyst, to arrive at better and
faster decisions. Data warehouses provide access to data for complex analysis,
knowledge discovery, and decision-making. style="mso-spacerun: They support high
performance demands on an organization's data and information. It provides an
enormous amount of historical and static data from three tiers:
1. Relational databases
2. Multidimensional OLAP applications
3. Client analysis tools
Several types of applications such as online analytical processing (OLAP),
decision-support systems (DSS) and data mining are being supported. OLAP is a
term used to describe the analysis of complex data from the data warehouse.
OLAP is a software technology that allows users to easily and quickly analyze and
view data from multiple points-of-view. OLAP provides dynamic and multi-
dimensional support to executives and managers who need to understand different
aspects of the data. Activities that are supported include:





Analyzing financial trends
Creating slices of data
Finding new relationships among the data
Drilling down into sales statistics
Doing calculations through different dimensions where each category of data
(that is, product, location, sales numbers, time period, etc.) is considered a
dimension.
There are OLAP tools that use distributed computing capabilities for analyses that
require more storage and processing power than can be economically and efficiently
located on an individual desktop.
DSS support an organization's leading decision makers with higher-level data for
complex and critical decisions. A DSS queries a data warehouse or an OLAP database
for relevant information that can be compared in order to make a business decision
and predict the impact of that decision.
Finally, data mining is being used for knowledge discovery, the process of searching
data for unanticipated new knowledge.
Knowledge workers and decision makers use tools ranging from parametric queries
to ad hoc queries to data mining. Thus, the access component of the data warehouse
must provide support of structured queries (both parametric and ad hoc). These
together make up a managed query environment.
Databases Vs Data Warehouses
A database is a collection of related data and a database system is a database and
database software together.
A data warehouse is also a collection of information as well as a supporting system.
Databases are transactional such as relational, object-oriented, network or
hierarchical. Traditional databases support on-line transaction processing (OLTP),
which includes insertions, updates, and deletions, while also supporting information
query requirements. Traditional databases are optimized to process queries that may
touch a small part of the database and transactions that deal with insertions or
updates of a few tuples per relation to process.
Thus databases must strike a balance between efficiency in transaction processing
and supporting query requirements (ad hoc user requests), That is, they can't
further optimized for the applications such as OLAP, DSS and data mining.
But a data warehouse is typically optimized for access from a decision maker's
needs. Data warehouses are designed specifically to support efficient extraction,
processing and presentation for analytic and decision-making purposes.
In contrast to databases, data warehouses generally contain very large amounts of
data from multiple sources that may include databases from different data models
and sometimes files acquired from independent systems and platforms.
Multidatabases provide access to disjoint and usually heterogeneous databases and
are volatile. Whereas a data warehouse is frequently a store of integrated data from
multiple sources, processed for storage in a multidimensional model and nonvolatile.
Data warehouses also support time-series and trend analysis, both of which require
more historical data.
In transactional systems, transactions are the unit and are the agent of change to
the database, but data warehouse information is much more coarse-grained and is
refreshed according to a careful choice of incremental refresh policy. Warehouse
updates are handled by the warehouse's acquisition component that provides all
required processing. As data warehouses encompass large volumes of data, they are
more or less double the size of source databases.
The sheer volume of data likely to be in terabytes is an issue that has been dealt
with through enterprise-wide data warehouses, virtual data warehouses and data
marts. Enterprise-wide data warehouses are huge projects in need of massive
investment of time and resources. Virtual data warehouses are bound to provide
views of operational databases that are materialized for efficient access. A data mart
is an easy-to-access repository of a subset of highly focused data for a single
function or department (i.e., finance, sales, marketing) and is considerably smaller
than a data warehouse. The data comes form operational information that is needed
by a particular group of employees for analysis, content, presentations all in terms
that are familiar to them. Data for a data mart is derived from a data warehouse or
from more specialized access.
Distinctive Characteristics of Data Warehouses
a. Data warehouses are supposed to be blessed with the following unique
features.
b. Multidimensional conceptual view and generic dimensionality,
c. Unlimited dimensions and aggregation levels and unrestricted crossdimensional operations,
d. Dynamic sparse matrix handling,
e. Client/server architecture and multi-user support,
f. Accessibility and transparency, intuitive data manipulation and consistent
reporting performance
As data warehouses are not much particular about transaction processing, there is
an increased efficiency in query processing. There are some specialized tools and
techniques. They are query transformation, index intersection and union, special
ROLAP (Relational OLAP), MOLAP (multidimensional OLAP), DOLAP (Database
OLAP) and WOLAP (Web OLAP) functions, SQL extensions, advanced join methods,
and intelligent scanning.
Traditional OLAP products are also known as multidimensional OLAP. Relational OLAP
tools take data from traditional two-dimensional or relational databases and create
multidimensional views upon request rather than being prepared in advance as in
OLAP. ROLAP is often used on complex data with a wide number of fields, such as
customer data. DOLAP is a relational database management system designed to
perform OLAP calculations. WOLAP refers to OLAP data that can be reached from a
Web server.
Parallel processing can enhance the performance of data warehouse. Parallel server
architectures include symmetric
Data Modeling for Data Warehouses
Multidimensional models highly take advantage of inherent relationships existing in
data to populate data in multidimensional matrices referred to as data cubes. If the
dimensional of the matrix is greater than three, then it is called hypercubes. Query
performance in multidimensional matrices for data that lend themselves to
dimensional formatting can be much better than in the relational data model. For a
corporate data warehouse, three examples of dimensions would be the corporation's
fiscal periods, products and regions.
Multidimensional models lend themselves readily to hierarchical views such as roll-up
display and drill-down display. Roll-up display moves up the hierarchy, grouping into
larger units along a dimension. A drill-down display provides the opposite capability,
furnishing a finer-grained view through disaggregating process.
The Multidimensional storage model involves two types of tables: dimension tables
and fact tables. A dimension table consists of tuples of attributes of the dimension.
A fact table can be thought of as having tuples, one per a recorded fact. This fact
contains some measured or observed variables and identifies them with pointers to
dimension tables.
Two common multidimensional schemas are the star schema and the snowflake
schema. The star schema consists of a fact table with a single table for each
dimension. The snowflake schema is a variation on the star schema in which the
dimensional tables from a star schema are organized into a hierarchy by normalizing
them. A fact constellation is a set of fact tables that share some dimension tables.
Data warehouse storage also utilizes indexing techniques to support high
performance access. A technique called bitmap indexing constructs a bit vector for
each value in a domain (column) being indexed. It does well for domains of lowcardinality. Bitmap indexing can provide considerable input/output and storage space
advantages in low-cardinality domains. With bit vectors a bitmap index can provide
dramatic improvements in comparison, aggregation, and join performance.
In a star schema, dimensional data can be indexed to tuples in the fact table by join
indexing. Join indexes are traditional indexes to maintain relationships between
primary key and foreign key values. They relate the values of a dimension of a star
schema to rows in the fact table. Data warehouse storage can facilitate access to
summary data by taking further advantage of the nonvolatility of data warehouses
and a degree of predictability of the analyses that will be performed using them.
Building a Data Warehouse
Warehouse builders should take a broad view of the anticipated use of the
warehouse. The design should specifically support adhoc querying that is, accessing
data with any meaningful combination of values for the attributes in the dimension
or fact tables. The following steps are being involved during the data acquisition
phase.
1. The data must be extracted from multiple, heterogeneous sources such as
databases or other data feeds.
2. Data must be formatted for consistency within the data warehouse. Names,
meanings and domains of data from unrelated sources must be reconciled.
3. The data must be cleaned to ensure validity. Data cleaning is an important part in
building a data warehouse and it is being touted as the largest labor-demanding one.
For input data, cleaning process has to be performed before the data are loaded in
the warehouse. Data warehouse builders have to check for validity and quality when
the input data must be examined and formatted consistently.
4. The data must be fitted into the data model of the warehouse. Data from the
various sources must be installed in the data model of the warehouse. Data may
have to be converted from relational, object-oriented, or legacy databases.
5. The data must be loaded into the warehouse. The sheer volume of data in the
warehouse makes loading the data a significant task. Monitoring tools for loads as
well as methods to recover from incomplete or incorrect loads are required. With the
huge volume of data in the warehouse, incremental updating is usually the only
feasible approach. The refresh policy will probably emerge as a compromise that
takes into account the answers to the following questions.
How up-to-date the data must be?
Can the warehouse go off-line, and for how long?
What are the data interdependencies?
What is the storage availability?
What are the distribution requirements such as for replication and partitioning?
What is the loading time including cleaning, formatting, copying, transmitting and
overhead such as index rebuilding?
Data storage in a data warehouse involves the following processes:
Storing the data according to the data model of the warehouse
Creating and maintaining required data structures
Creating and maintaining appropriate access paths
Providing for time-variant data as new data are added
Supporting the updating of warehouse data
Refreshing and purging the data
The sheer volume of data in the warehouse generally makes it impossible to simply
reload the warehouse in its entirety later on. There are a couple of alternatives for
this problem such as selective refreshing of data and separate warehouse versions.
When the warehouse uses an incremental data refreshing mechanism, data may
need to be periodically purged.
Data warehouses should also be designed with by taking the environment in which
they reside into account. The important points behind the data warehouse design are
. Usage projections
. The fit of the data model
. Characteristics of available sources
. Design of the metadata component
. Modular component design
. Design for manageability and change
There are mainly two types of data warehouses. They are distributed warehouse and
federated warehouse. For a distributed data warehouse, all the issues such as
replication, partitioning, communication and consistency, of distributed databases
are taken into account. As usual, the benefits of distribution, such as load balancing,
scalability of performance and higher availability, are available with distributed data
warehouse. A single replicated metadata repository would reside at each distribution
site.
The idea of the federated warehouse is like that of the federated database: a
decentralized confederation of autonomous data warehouses, each with its own
metadata repository. Given the magnitude of the challenge inherent to data
warehouses, it is likely that such federations will consist of smaller-scale
components, such as data marts. Large organizations may choose to federate data
marts rather than build huge data warehouses.
Functionality of Data Warehouses
Data warehouses exist to facilitate complex, data-intensive and frequent adhoc
queries. Data warehouses must provide far greater and more efficient query support
than is demanded of transactional databases. The data warehouse access component
supports enhanced spreadsheet functionality, efficient query processing, structured
queries, adhoc queries, data mining and materialized views. Particularly enhanced
spreadsheet functionality includes support for state-of-the art spreadsheet
applications as well as for OLAP applications programs. These provide
preprogrammed functionalities such as the following:
Roll-up: Data is summarized with increasing generalization
Drill-down: Increasing levels of detail are revealed
Pivot: Cross tabulation that is, rotation, performed
Slice and dice: Performing projection operations on the dimensions
Sorting: Data is sorted by ordinal value
Selection: Data is available by value or range
Derived or computer attributes: Attributes are computed by operations on stored
and derived values.
Benefits of Data Warehouses
The core benefits include



Historical information for comparative and competitive analysis
Enhanced data quality and completeness
Supplementing disaster recovery plans with another data back up source.
Among the greatest benefits of a data warehouse is the ability to analyze and
execute business decisions based on data from multiple sources. For example, an
organization has collected valuable data and stored it in 30 databases. A data
warehouse is not only a convenient way to analyze and compare data in all the
databases, but it can also give historical data and perspective. Thus data warehouse
is a one-stop shop, but it is also a one-stop shop from an historical perspective as
well. Using data warehouse, one can look at past trends, whether they be product
sales or customers or whatever and may be do some predictions of what is going to
happen in the future.
Also data retrieved from multiple databases is not constrained by the tables in each
of those databases. A data warehouse receives application neutral data. Whatever
database application is supplying the information to the data warehouse is not
preconditioning the data to be presented in a way the originator of the data requires.
That means, the data from the inventory system, the financial system, or the sales
system is sent to the data warehouse for processing as application neutral data that
is not formatted to answer only queries from an inventory database, finance
database, or sales database program. If not for application-neutral data, the data
warehouse would be nothing more than a collection of data marts.
A data warehouse by itself does not create value, but value comes from the use of
the data in the warehouse. In support of a low cost strategy, the data warehouse can
provide savings in billing processes, reduce fraud losses, and reduce the cost of
reporting. The data warehouses can provide analysts with precalculated reports and
graphs. This increases the productivity of business analysts.
Most companies can benefit from a data warehouse when the proper tools are in
place and users are trained in analysis of results.
Conclusion
However, data warehouses are still an expensive solution and typically found in large
firms. The development of a central warehouse is a huge undertaking and capital
intensive with large, potentially unmanageable risks.
Download