What is Data Warehouse?

advertisement
FindSkills Inc.
Data Warehouse
Concepts Training
DW – ETL Training
Mahesh
2/8/2011
Contents
What is Data Warehouse? ............................................................................................................................ 2
OLAP vs OLTP: ............................................................................................................................................... 3
OLTP (On-line Transaction Processing) ..................................................................................................... 3
Elements of Data Warehouse: ...................................................................................................................... 5
Source Systems ......................................................................................................................................... 5
Staging Area .............................................................................................................................................. 5
Enterprise Data Warehouse ...................................................................................................................... 5
Data Mart .................................................................................................................................................. 5
Dimensional Modeling: ................................................................................................................................. 6
Fact Table .................................................................................................................................................. 6
Dimension Table ....................................................................................................................................... 6
Star & Snowflake Schema: ............................................................................................................................ 9
What is Star Schema? ............................................................................................................................... 9
What is Snowflake Schema? ..................................................................................................................... 9
Slowly Changing Dimensions: ..................................................................................................................... 11
Type One (Overwrite History) ................................................................................................................. 11
Type Two (Preserve History) ................................................................................................................... 12
Type Three (Preserve a Version of History) ............................................................................................ 12
Fact Table Types:......................................................................................................................................... 14
What is Data Warehouse?
In data warehousing, you create stores of informational data data that is extracted from the
operational data and then transformed for reporting and decision making. For example, a data
warehousing tool might copy all the sales data from the operational database, perform
calculations to summarize the data, and write it to a new database. End-users can query the new
database (the warehouse) without impacting the operational databases.
To summarize




The purpose of data warehouse is to store data consistently across the organization and to
make the organizational information accessible.
It is adaptive and resilient source of information. When new data is added to the Data
Warehouse, the existing data and technologies are not disrupted. The design of separate
data marts that make up the data warehouse must be distributed and incremental.
Anything else is a compromise.
The data warehouse not only controls the access to the data, but gives its owners great
visibility into the uses and abuses of the data, even after it has left the data warehouse.
Data warehouse is the foundation for decision-making.
OLAP vs OLTP:
We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can
assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help
to analyze it.
OLTP (On-line Transaction Processing) is characterized by a large number of short online transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put
on very fast query processing, maintaining data integrity in multi-access environments and an
effectiveness measured by number of transactions per second. In OLTP there is detailed and
current data, and schema used to store transactional databases is the entity model (usually 3NF).
OLAP (On-line Analytical Processing) is characterized by relatively low volume of
transactions. Queries are often very complex and involve aggregations. For OLAP systems a
response time is an effectiveness measure. OLAP applications are widely used by Data Mining
techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional
schemas (usually star schema).
The following table summarizes the major differences between OLTP and OLAP system design.
OLTP System
OLAP System
Online Transaction
Online Analytical Processing
Processing
(Data Warehouse)
(Operational System)
Operational data; OLTPs are
Consolidation data; OLAP data comes from the
Source of data the original source of the
various OLTP Databases
data.
To control and run
To help with planning, problem solving, and
Purpose of data
fundamental business tasks
decision support
Reveals a snapshot of
Multi-dimensional views of various kinds of
What the data
ongoing business processes
business activities
Inserts and
Short and fast inserts and
Periodic long-running batch jobs refresh the data
Updates
updates initiated by end users
Relatively standardized and
Queries
simple queries Returning
Often complex queries involving aggregations
relatively few records
Depends on the amount of data involved; batch
Processing
data refreshes and complex queries may take
Typically very fast
Speed
many hours; query speed can be improved by
creating indexes
Larger due to the existence of aggregation
Space
Can be relatively small if
structures and history data; requires more indexes
Requirements historical data is archived
than OLTP
Database
Highly normalized with
Typically de-normalized with fewer tables; use of
Design
many tables
star and/or snowflake schemas
Elements of Data Warehouse:
Source Systems
Typically in any organization the data is stored in various databases, usually divided up by the
systems. There may be data for marketing, sales, payroll, engineering, etc. These systems might
be legacy/mainframe systems or relational database systems.
Staging Area
The data coming from various source systems is first kept in a staging area. The staging area is
used to clean, transform, combine, de-duplicate, household, archive, and to prepare source data
for use in data warehouse. The data coming from source system is kept as it is in this area. This
need not be based on relational terminology. Sometimes managers of the data are comfortable
with normalized set of data. In these cases, normalized structure of the data staging storage is
certainly acceptable. Also, staging area doesn’t provide querying/presentation services.
Enterprise Data Warehouse
A central database for decision support throughout the enterprise.
Data Mart
Data mart is a logical subset of complete data warehouse. It is often viewed as the restriction of
data warehouse to a single business process or to a group of related business processes targeted
toward a particular business group. For example an organization may have a data mart for Sales
or Inventory.
Dimensional Modeling:
Dimensional modeling is the design concept used by many data warehouse designers to build
their data warehouse. Dimensional model is the underlying data model used by many of the
commercial OLAP products available today in the market. Designing a data warehouse is very
different from designing an online transaction processing (OLTP) system. In contrast to an
OLTP system in which the purpose is to capture high rates of data changes and additions, the
purpose of a data warehouse is to organize large amounts of stable data for ease of analysis and
retrieval. Because of these differing purposes, there are many considerations in data warehouse
design that differ from OLTP database design. In dimensional model, all data is contained in two
types of tables called Fact Table and Dimension Table.
Fact Table
Each data warehouse or data mart includes one or more fact tables. The fact table captures the
data that measures the organizations business operations. A fact table might contain business
sales events such as cash register transactions or the contributions and expenditures of a
nonprofit organization. Fact tables usually contain large numbers of rows, sometimes in the
hundreds of millions of records when they contain one or more years of history for a large
organization. A key characteristic of a fact table is that it contains numerical data (facts) that can
be summarized to provide information about the history of the operation of the organization.
Each fact table also includes a multipart index that contains as foreign keys the primary keys of
related dimension tables, which contain the attributes of the fact records. Fact tables should not
contain descriptive information or any data other than the numerical measurement fields and the
index fields that relate the facts to corresponding entries in the dimension tables. An example of
fact table is Sales_Fact table that might contain the information like sale_amount, unit_price,
discount, etc.
Dimension Table
Dimension tables contain attributes that describe fact records in the fact table. Some of these
attributes provide descriptive information; others are used to specify how fact table data should
be summarized to provide useful information to the analyst. Dimension tables contain hierarchies
of attributes that aid in summarization. For example, a dimension containing product information
would often contain a hierarchy that separates products into categories such as food, drink, and
non-consumable items, with each of these categories further subdivided a number of times until
the individual product is reached at the lowest level.
Consider an example of Sales_Fact table and the various attributes that describe this fact are
Store, Product, Time and say Sales Person. In this case we will have four dimension tables, viz.
Store_Dimension, Product_Dimension, Time_Dimension and Sales_Person_Dimension.
Figure 1
You may notice that all of these dimensions contain a Key field. This is called Surrogate Key.
This key is substitute for a natural key in dimensions (e.g., in Sales_Person_Dimension, we have
natural key as ID). In a data warehouse a surrogate key is a generalization of the natural
production key and is one of the basic elements of data warehouse.
As a fact table is described by the four dimension tables described above, it will contain the
Surrogate Keys of all these dimensions. This is how the Sales_Fact table will look like:
Figure 2
Now if you carefully look at the structure of above tables and how they are linked the schema
will look like this:
Figure 3
You can easily tell that this looks like a STAR. Hence its known as Star Schema.
Star & Snowflake Schema:
What is Star Schema?
Star Schema is a relational database schema for representing multi-dimensional data. It is the
simplest form of data warehouse schema that contains one or more dimensions and fact tables. It
is called a star schema because the entity-relationship diagram between dimensions and fact
tables resembles a star where one fact table is connected to multiple dimensions. The center of
the star schema consists of a large fact table and it points towards the dimension tables. The
advantage of star schema is slicing down, performance increase and easy understanding of data.
Steps in designing Star Schema





Identify a business process for analysis (like sales).
Identify measures or facts (sales amount).
Identify dimensions for facts(product dimension, time dimension).
List the columns that describe each dimension
Determine the lowest level of summary in a fact table(sales amount).
Advantages of having Star Schema



Star Schema is very easy to understand, even for non-technical business managers
Star Schema provides better performance and smaller query times
Star schema is easily extensible and will handle future changes easily
What is Snowflake Schema?
A snowflake schema is a term that describes a star schema structure normalized through the use
of outrigger tables. i.e. dimension table hierarchies are broken into simpler tables.
The snowflake schema is a variation of the star schema used in a data warehouse.
The snowflake schema (sometimes called snowflake join schema) is a more complex schema
than the star schema because the tables which describe the dimensions are normalized.
Flips of "snowflaking"
- In a data warehouse, the fact table in which data values (and its associated indexes) are stored,
is typically responsible for 90% or more of the storage requirements, so the benefit here is
normally insignificant.
- Normalization of the dimension tables ("snowflaking") can impair the performance of a data
warehouse. Whereas conventional databases can be tuned to match the regular pattern of usage,
such patterns rarely exist in a data warehouse. Snowflaking will increase the time taken to
perform a query, and the design goal of many data warehouse projects is to minimize these
response times.
Benefits of "snowflaking"
- If a dimension is very sparse (i.e. most of the possible values for the dimension have no data)
and/or a dimension has a very long list of attributes which may be used in a query, the dimension
table may occupy a significant proportion of the database and snowflaking may be appropriate.
- A multidimensional view is sometimes added to an existing transactional database to aid
reporting. In this case, the tables which describe the dimensions will already exist and will
typically be normalized. A snowflake schema will hence be easier to implement.
- A snowflake schema can sometimes reflect the way in which users think about data. Users may
prefer to generate queries using a star schema in some cases, although this may or may not be
reflected in the underlying organization of the database.
- Some users may wish to submit queries to the database which, using conventional
multidimensional reporting tools, cannot be expressed within a simple star schema. This is
particularly common in data mining of customer databases, where a common requirement is to
locate common factors between customers who bought products meeting complex criteria. Some
snowflaking would typically be required to permit simple query tools such as Cognos Powerplay
to form such a query, especially if provision for these forms of query weren't anticpated when the
data warehouse was first designed.
In practice, many data warehouses will normalize some dimensions and not others, and hence
use a combination of snowflake and classic star schema.
Slowly Changing Dimensions:
Handling changes to dimensional data across time is the most important aspect in designing a
data warehouse. In dimensional modeling, there is a very rare chance that a dimension will
remain static over time. For example, a customer address may change; a company may phase out
old products and introduce new products. What if a customer name changes, sales person
changes his region of sale or a company assigns new sales territory. How to record the history or
preserve the old version of history? Here comes the concept of Slowly Changing Dimensions.
The term Slowly Changing Dimension is about variation in dimensional attributes over time. The
word slowly, in this context, might seem incorrect. A sales person may change his territory
rapidly. But in general, when compared to measures in fact table, the changes in dimensions
occur slowly.
Types of Slowly Changing Dimensions
In reference to Figure 3 above, lets say a sales person changes his region of sale. We may handle
this change in several ways. These methods fall in various categories based on company’s need
to preserve an accurate history of dimensional changes. Ralph Kimball categorized the
dimensional changes into three categories



Type One: Changes that overwrite history
Type Two: Preserve history
Type Three: Preserve a version of history
Type One (Overwrite History)
A type one change overwrites existing dimensional attribute with new information. In Sales
Person Region change example, the old region name will be overwritten by the new region. Say,
a sales person Rob, has territory as ASIA.
Sales_Person_Dimension
Sales_Person_Key
ID
Name
Region
...
100
203234
Rob Doe
ASIA
...
Now, if he starts looking after NorthWest Region, by implementing Type 1 dimension, the
dimension table will look like:
Sales_Person_Dimension
Sales_Person_Key
ID
Name
Region
...
100
203234
Rob Doe
NorthWest
...
Advantages:

This is the easiest way to handle the Slowly Changing Dimension problem, since there is
no need to keep track of the old information.
Disadvantages:

All history is lost. By applying this methodology, it is not possible to trace back in
history. For example, in this case, the company would not be able to know that Christina
lived in Illinois before.
Type Two (Preserve History)
A Type Two change writes a record with the new attribute information and preserves a record of
the old dimensional data. Type Two changes let you preserve historical data. Implementing Type
Two changes within a data warehouse might require significant analysis and development. Type
Two changes accurately partition history across time more effectively than other types. However,
because Type Two changes add records, they can significantly increase the database's size.
In our example, lets say we identify Region as Type Two attribute. This can be handled in this
way using:
Sales_Person_Dimension
Sales_Person_Key
ID
Name
Region
...
100
203234
Rob Doe
ASIA
...
153
203234
Rob Doe
NorthWest
...
Advantages:

This allows us to accurately keep all historical information.
Disadvantages:


This will cause the size of the table to grow fast. In cases where the number of rows for
the table is very high to start with, storage and performance can become a concern.
This necessarily complicates the ETL process.
Type Three (Preserve a Version of History)
You usually implement Type Three changes only if you have a limited need to preserve and
accurately describe history, such as when someone gets married and you need to retain the
previous name. Instead of creating a new dimensional record to hold the attribute change, a Type
Three change places a value for the change in the original dimensional record. You can create
multiple fields to hold distinct values for separate points in time. In the case of a region change
example, you could create an OLD_REGION and NEW_REGION field and a
REGION_CHANGE_EFF_DATE field to record when the change occurs. This method
preserves the change. But how would you handle a second name change, or a third, and so on?
The side effects of this method are increased table size and, more important, increased
complexity of the queries that analyze historical values from these old fields. After more than a
couple of iterations, queries become impossibly complex, and ultimately you're constrained by
the maximum number of attributes allowed on a table.
This is how the table will look like in Type Three change:
Sales_Person_Dimension
Sales_Person_Key
ID
Name
Old Region
New Region
...
100
203234
Rob Doe
ASIA
NorthWest
...
Advantages:


This does not increase the size of the table, since new information is updated.
This allows us to keep some part of history.
Disadvantages:

Type 3 will not be able to keep all history where an attribute is changed more than once.
For example, if Christina later moves to Texas on December 15, 2003, the California
information will be lost.
Because most business requirements include tracking changes over time, data warehouse
architects commonly implement Type Two changes. A data warehouse might use Type Two
changes for all attributes in all tables. As an alternative, you can implement a mix of Type One
and Type Two changes at an attribute level by implementing Type 2 changes for only attributes
whose historical values are important when you're slicing and dicing. For example, users might
not need to know an individual's previous name if a name change occurs, so a Type One change
would suffice. Users might want the system to show only the person's current name. However, if
the company reassigns sales territories, users might need to track who sold what, at what time,
and in what territory, necessitating a Type Two change.
Although most data warehouses include Type Two changes, you need to seriously examine the
business need to record historical data. Implementing Type Two changes might be necessary, but
those changes will increase the database size, degrade performance, and lengthen the
development time. You need to carefully evaluate using a Type Two implementation, a Type
One implementation, or a hybrid implementation.
Fact Table Types:
There are three types of facts:



Additive: Additive facts are facts that can be summed up through all of the dimensions in
the fact table.
Semi-Additive: Semi-additive facts are facts that can be summed up for some of the
dimensions in the fact table, but not the others.
Non-Additive: Non-additive facts are facts that cannot be summed up for any of the
dimensions present in the fact table.
Let us use examples to illustrate each of the three types of facts. The first example assumes that
we are a retailer, and we have a fact table with the following columns:
Date
Store
Product
Sales_Amount
The purpose of this table is to record the sales amount for each product in each store on a daily
basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can
sum up this fact along any of the three dimensions present in the fact table -- date, store, and
product. For example, the sum of Sales_Amount for all 7 days in a week represent the total sales
amount for that week.
Say we are a bank with the following fact table:
Date
Account
Current_Balance
Profit_Margin
The purpose of this table is to record the current balance for each account at the end of each day,
as well as the profit margin for each account for each day. Current_Balance and
Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add
them up for all accounts (what's the total current balance for all accounts in the bank?), but it
does not make sense to add them up through time (adding up all current balances for a given
account for each day of the month does not give us any useful information). Profit_Margin is a
non-additive fact, for it does not make sense to add them up for the account level or the day
level.
Based on the above classifications, there are two types of fact tables:


Cumulative: This type of fact table describes what has happened over a period of time. For
example, this fact table may describe the total sales by product by store by day. The facts for this
type of fact tables are mostly additive facts. The first example presented here is a cumulative fact
table.
Snapshot: This type of fact table describes the state of things in a particular instance of time, and
usually includes more semi-additive and non-additive facts. The second example presented here
is a snapshot fact table.
Download