Uploaded by saed bhati

DWM-Unit No 1

advertisement
Unit No.01 Data Warehouse Fundamentals
What is a Data Warehouse?
A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than
transaction processing. It includes historical data derived from transaction data from single and
multiple sources.
Bill Inmon's definition of a data warehouse is that it is a “subject-oriented, non-volatile, integrated,
time-variant collection of data in support of management's decisions.”
A Data Warehouse is separate from DBMS, it stores a huge amount of data, which is
typically collected from multiple heterogeneous sources like files, DBMS, etc. The goal is
to produce statistical results that may help in decision makings. For example, a college
might want to see quick different results, like how the placement of CS students has
improved over the last 10 years, in terms of salaries, counts, etc.
Data warehouse is an information system that contains historical and commutative data from single
or multiple sources. It simplifies reporting and analysis process of the organization. It is also a
single version of truth for any company for decision making and forecasting.
Characteristics of Data warehouse
 Subject-Oriented
 Integrated
 Time-variant
 Non-volatile
Subject-Oriented
A data warehouse is subject oriented as it offers information regarding a theme instead of
companies’ on-going operations. These subjects can be sales, marketing, distributions, etc. a data
warehouse never focuses on the on-going operations. Instead, it put emphasis on modelling and
analysis of data for decision making. It also provides a simple and concise view around the specific
subject by excluding data which not helpful to support the decision process.
Integrated
In Data Warehouse, integration means the establishment of a common unit of measure for all
similar data from the dissimilar database. The data also needs to be stored in the Data warehouse in
common and universally acceptable manner. A data warehouse is developed by integrating data
from varied sources like a mainframe, relational databases, flat files, etc. Moreover, it must keep
consistent naming conventions, format, and coding.
This integration helps in effective analysis of data. Consistency in naming
conventions, attributemeasures, encoding structure etc. has to be ensured.
Time-Variant
The time horizon for data warehouse is quite extensive compared with
operational systems. The data collected in a data warehouse is recognized with a
particular period and offers information from the historical point of view. It
contains an element of time, explicitly or implicitly. One such place where Data
warehouse data display time variance is in in the structure of the record key.
Every primary key contained with the DW should have either implicitly or
explicitly an element of time. Like the day, week month, etc. Another aspect of
time variance is that once data is inserted in the warehouse, it can't be updated or
changed.
Non-volatile
Data warehouse is also non-volatile means the previous data is not erased when
new data is entered in it. Data is read-only and periodically refreshed. This also
helps to analyze historical data and understand what & when happened. It does
not require transaction process, recovery and concurrency control mechanisms.
Activities like delete, update, and insert which are performed in an operational
application environment are omitted in Data warehouse environment. Only two
types of data operations performed in the Data Warehousing are
Need for Data Warehouse
An ordinary Database can store MBs to GBs of data and that too for a specific purpose.
For storing data of TB size, the storage shifted to Data Warehouse. Besides this, a
transactional database doesn’t offer itself to analytics. To effectively perform analytics, an
organization keeps a central Data Warehouse to closely study its business by organizing,
understanding, and using its historic data for taking strategic decisions and analysing
trends.
Data Warehouse Architecture
A data warehouse architecture is a method of defining the overall architecture of data
communication processing and presentation that exist for end-clients computing within the
enterprise. Each data warehouse is different, but all are characterized by standard vital components.
Production applications such as payroll accounts payable product purchasing and inventory control
are designed for online transaction processing (OLTP). Such applications gather detailed data from
day to day operations.
Data Warehouse applications are designed to support the user ad-hoc data requirements, an activity
recently dubbed online analytical processing (OLAP). These include applications such as
forecasting, profiling, summary reporting, and trend analysis
Production databases are updated continuously by either by hand or via OLTP applications. In
contrast, a warehouse database is updated from operational systems periodically, usually during offhours. As OLTP data accumulates in production databases, it is regularly extracted, filtered, and
then loaded into a dedicated warehouse server that is accessible to users. As the warehouse is
populated, it must be restructured tables de-normalized, data cleansed of errors and redundancies
and new fields and keys added to reflect the needs to the user for sorting, combining, and
summarizing data.
Data warehouses and their architectures very depending upon the elements of an organization's
situation.
Three common architectures are:
o
Data Warehouse Architecture: Basic
o
Data Warehouse Architecture: With Staging Area
o
Data Warehouse Architecture: With Staging Area and Data Marts
Data Warehouse Architecture: Basic
Operational System
An operational system is a method used in data warehousing to refer to a system that is used to
process the day-to-day transactions of an organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file in the
system must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data used in Data Warehouse for a variety of purpose, including:
Meta Data summarizes necessary information about data, which can make finding and work with
particular instances of data more accessible. For example, author, data build, and data changed, and
file size are examples of very basic document metadata.
Metadata is used to direct a query to the most appropriate data source.
Lightly and highly summarized data
The area of the data warehouse saves all the predefined lightly and highly summarized (aggregated)
data generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The summarized
record is updated continuously as new information is loaded into the warehouse.
End-User access Tools
The principal purpose of a data warehouse is to provide information to the business managers for
strategic decision-making. These customers interact with the warehouse using end-client access
tools.
The examples of some of the end-user access tools can be:
o
Reporting and Query Tools
o
Application Development Tools
o
Executive Information Systems Tools
o
Online Analytical Processing Tools
o
Data Mining Tools
Data warehouse Schemas
A schema is defined as a logical description of database where fact and dimension tables are joined
in a logical manner. Data Warehouse is maintained in the form of Star, Snow flakes, and Fact
Constellation schema.
Star Schema
A Star schema contains a fact table and multiple dimension tables. Each dimension is represented
with only one-dimension table and they are not normalized. The Dimension table contains a set of
attributes.
Characteristics

In a Star schema, there is only one fact table and multiple dimension tables.

In a Star schema, each dimension is represented by one-dimension table.

Dimension tables are not normalized in a Star schema.

Each Dimension table is joined to a key in a fact table.
The following illustration shows the sales data of a company with respect to the four dimensions,
namely Time, Item, Branch, and Location.
There is a fact table at the center. It contains the keys to each of four dimensions. The fact table also
contains the attributes, namely dollars sold and units sold.
Note − Each dimension has only one-dimension table and each table holds a set of attributes. For
example, the location dimension table contains the attribute set {location_key, street, city,
province_or_state, country}. This constraint may cause data redundancy.
For example − "Vancouver" and "Victoria" both the cities are in the Canadian province of British
Columbia. The entries for such cities may cause data redundancy along the attributes
province_or_state and country.
Snowflakes Schema
Some dimension tables in the Snowflake schema are normalized. The normalization splits up the
data into additional tables as shown in the following illustration.
Unlike in the Star schema, the dimension’s table in a snowflake schema are normalized.
For example − The item dimension table in a star schema is normalized and split into two
dimension tables, namely item and supplier table. Now the item dimension table contains the
attributes item_key, item_name, type, brand, and supplier-key.
The supplier key is linked to the supplier dimension table. The supplier dimension table contains
the attributes supplier_key and supplier_type.
Note − Due to the normalization in the Snowflake schema, the redundancy is reduced and therefore,
it becomes easy to maintain and the save storage space.
Fact Constellation Schema (Galaxy Schema)
A fact constellation has multiple fact tables. It is also known as a Galaxy Schema.
The following illustration shows two fact tables, namely Sales and Shipping −
The sales fact table is the same as that in the Star Schema. The shipping fact table has five
dimensions, namely item_key, time_key, shipper_key, from_location, to_location. The shipping
fact table also contains two measures, namely dollars sold and units sold. It is also possible to share
dimension tables between fact tables.
For example − Time, item, and location dimension tables are shared between the sales and
shipping fact table
Online Analytical Processing (OLAP)
OLAP stands for Online Analytical Processing. OLAP systems have the capability to
analyse database information of multiple systems at the current time. The primary goal of
OLAP Service is data analysis and not data processing
Online Analytical Processing (OLAP) consists of a type of software tool that is used for
data analysis for business decisions.
OLAP Examples
Any type of Data Warehouse System is an OLAP system. The uses of the OLAP System
are described below.

Spotify analysed songs by users to come up with a personalized homepage of their
songs and playlist.

Netflix movie recommendation system.
Online Transaction Processing (OLTP)
OLTP (online transactional processing) enables the rapid, accurate data processing
behind ATMs and online banking, cash registers and ecommerce, and scores of other
services we interact with each day.
OLTP, or online transactional processing, enables the real-time execution of large
numbers of database transactions by large numbers of people, typically over the internet.
Online transaction processing provides transaction-oriented applications in a 3-tier
architecture. OLTP administers the day-to-day transactions of an organization.
OLTP Examples
An example considered for OLTP System is ATM Center a person who authenticates first
will receive the amount first and the condition is that the amount to be withdrawn must be
present in the ATM. The uses of the OLTP System are described below.

ATM center is an OLTP application.

OLTP handles the ACID ( Atomicity, Consistency, Isolation, and Durability) properties during
data transactions via the application.

It’s also used for Online banking, Online airline ticket booking, sending a text
message, add a book to the shopping cart.
Benefits of OLTP Services

OLTP services allow users to read, write and delete data operations quickly.

OLTP services help in increasing users and transactions which helps in real-time
access to data.

OLTP services help to provide better security by applying multiple security features.

OLTP services help in making better decision making by providing accurate data or
current data.

OLTP Services provide Data Integrity, Consistency, and High Availability to the data.
Drawbacks of OLTP Services

OLTP has limited analysis capability as they are not capable of intending complex
analysis or reporting.

OLTP has high maintenance costs because of frequent maintenance, backups, and
recovery.

OLTP Services get hampered in the case whenever there is a hardware failure which
leads to the failure of online transactions.

OLTP Services many times experience issues such as duplicate or inconsistent data.
Difference between OLAP and OLTP
Category
OLAP (Online Analytical
Processing)
OLTP (Online Transaction
Processing)
Definition
It is well-known as an online
database query management
system.
It is well-known as an online database
modifying system.
Data source
Consists of historical data from
various Databases.
Consists of only operational current
data.
Method used
It makes use of a data warehouse.
It makes use of a standard database
management system (DBMS).
Application
It is subject-oriented. Used
for Data Mining, Analytics,
Decisions making, etc.
It is application-oriented. Used for
business tasks.
Normalized
In an OLAP database, tables are
not normalized.
In an OLTP database, tables
are normalized (3NF).
Usage of data
The data is used in planning,
problem-solving, and decisionmaking.
The data is used to perform day-to-day
fundamental operations.
Task
It provides a multi-dimensional
view of different business tasks.
It reveals a snapshot of present
business tasks.
Purpose
It serves the purpose to extract
information for analysis and
decision-making.
It serves the purpose to Insert, Update,
and Delete information from the
database.
Volume of
data
A large amount of data is stored
typically in TB, PB
The size of the data is relatively small
as the historical data is archived in
MB, and GB.
Category
OLAP (Online Analytical
Processing)
OLTP (Online Transaction
Processing)
Queries
Relatively slow as the amount of
data involved is large. Queries
may take hours.
Very Fast as the queries operate on 5%
of the data.
Update
The OLAP database is not often
updated. As a result, data integrity
is unaffected.
The data integrity constraint must be
maintained in an OLTP database.
Backup and
Recovery
It only needs backup from time to
time as compared to OLTP.
The backup and recovery process is
maintained rigorously
Processing
time
The processing of complex
queries can take a lengthy time.
It is comparatively fast in processing
because of simple and straightforward
queries.
Types of users
This data is generally managed by
CEO, MD, and GM.
This data is managed by clerks and
managers.
Operations
Only read and rarely write
operations.
Both read and write operations.
Updates
With lengthy, scheduled batch
operations, data is refreshed on a
regular basis.
The user initiates data updates, which
are brief and quick.
Nature of
audience
The process is focused on the
customer.
The process is focused on the market.
Database
Design
Design with a focus on the
subject.
Design that is focused on the
application.
Category
Productivity
OLAP (Online Analytical
Processing)
Improves the efficiency of
business analysts.
OLTP (Online Transaction
Processing)
Enhances the user’s productivity.
Online Analytical Processing Server (OLAP)
Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It
allows managers, and analysts to get an insight of the information through fast, consistent, and
interactive access to information. This chapter cover the types of OLAP, operations on OLAP,
difference between OLAP, and statistical databases and OLTP.
Types of OLAP Servers
We have four types of OLAP servers −

Relational OLAP (ROLAP)

Multidimensional OLAP (MOLAP)

Hybrid OLAP (HOLAP)

Specialized SQL Servers
Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end tools. To store
and manage warehouse data, ROLAP uses relational or extended-relational DBMS.
ROLAP includes the following −

Implementation of aggregation navigation logic.

Optimization for each DBMS back end.

Additional tools and services.
Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional views of data.
With multidimensional data stores, the storage utilization may be low if the data set is sparse.
Therefore, many MOLAP server use two levels of data storage representation to handle dense and
sparse data sets.
Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of ROLAP
and faster computation of MOLAP. HOLAP servers allows to store the large data volumes of
detailed information. The aggregations are stored separately in MOLAP store.
Specialized SQL Servers
Specialized SQL servers provide advanced query language and query processing support for SQL
queries over star and snowflake schemas in a read-only environment.
OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP operations
in multidimensional data.
Here is the list of OLAP operations −

Roll-up

Drill-down

Slice and dice

Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −

By climbing up a concept hierarchy for a dimension

By dimension reduction
The following diagram illustrates how roll-up works.

Roll-up is performed by climbing up a concept hierarchy for the dimension location.

Initially the concept hierarchy was "street < city < province < country".

On rolling up, the data is aggregated by ascending the location hierarchy from the level of
city to the level of country.

The data is grouped into cities rather than countries.

When roll-up is performed, one or more dimensions from the data cube are removed.
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways −

By stepping down a concept hierarchy for a dimension

By introducing a new dimension.
The following diagram illustrates how drill-down works −

Drill-down is performed by stepping down a concept hierarchy for the dimension time.

Initially the concept hierarchy was "day < month < quarter < year."

On drilling down, the time dimension is descended from the level of quarter to the level of
month.

When drill-down is performed, one or more dimensions from the data cube are added.

It navigates the data from less detailed data to highly detailed data.
Slice
The slice operation selects one particular dimension from a given cube and provides a new subcube. Consider the following diagram that shows how slice works.

Here Slice is performed for the dimension "time" using the criterion time = "Q1".

It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider the
following diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves three dimensions.

(location = "Toronto" or "Vancouver")

(time = "Q1" or "Q2")

(item =" Mobile" or "Modem")
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an
alternative presentation of data. Consider the following diagram that shows the pivot operation.
ETL (Extract, Transform, and Load) Process
What is ETL Process?
The mechanism of extracting information from source systems and bringing it into the data
warehouse is commonly called ETL, which stands for Extraction, Transformation and Loading.
The ETL process requires active inputs from various stakeholders, including developers, analysts,
testers, top executives and is technically challenging.
To maintain its value as a tool for decision-makers, Data warehouse technique needs to change with
business changes. ETL is a recurring method (daily, weekly, monthly) of a Data warehouse system
and needs to be agile, automated, and well documented.
How ETL Works?
ETL consists of three separate phases:
Extraction
o
Extraction is the operation of extracting information from a source system for further use in
a data warehouse environment. This is the first stage of the ETL process.
o
Extraction process is often one of the most time-consuming tasks in the ETL.
o
The source systems might be complicated and poorly documented, and thus determining
which data needs to be extracted can be difficult.
o
The data has to be extracted several times in a periodic manner to supply all changed data to
the warehouse and keep it up-to-date
Cleansing
The cleansing stage is crucial in a data warehouse technique because it is supposed to improve data
quality. The primary data cleansing features found in ETL tools are rectification and
homogenization. They use specific dictionaries to rectify typing mistakes and to recognize
synonyms, as well as rule-based cleansing to enforce domain-specific rules and defines appropriate
associations between values.
The following examples show the essential of data cleaning:
If an enterprise wishes to contact its users or its suppliers, a complete, accurate and up-to-date list of
contact addresses, email addresses and telephone numbers must be available.
If a client or supplier calls, the staff responding should be quickly able to find the person in the
enterprise database, but this need that the caller's name or his/her company name is listed in the
database.
Transformation
Transformation is the core of the reconciliation phase. It converts records from its operational
source format into a particular data warehouse format. If we implement a three-layer architecture,
this phase outputs our reconciled data layer.
The following points must be rectified in this phase:
o
Loose texts may hide valuable information. For example, XYZ PVT Ltd does not explicitly show
that this is a Limited Partnership company.
o
Different formats can be used for individual data. For example, data can be saved as a string or as
three integer
Following are the main transformation processes aimed at populating the reconciled data layer:
o
Conversion and normalization that operate on both storage formats and units of measure to make
data uniform.
o
Matching that associates equivalent fields in different sources.
o
Selection that reduces the number of source fields and records.
Cleansing and Transformation processes are often closely linked in ETL tools.
Loading
The Load is the process of writing the data into the target database. During the load step, it is
necessary to ensure that the load is performed correctly and with as little resources as possible.
Loading can be carried in two ways
1. Refresh: Data Warehouse data is completely rewritten. This means that older file is
replaced. Refresh is usually used in combination with static extraction to populate a data
warehouse initially.
2. Update: Only those changes applied to source information are added to the Data
Warehouse. An update is typically carried out without deleting or modifying pre-existing
data. This method is used in combination with incremental extraction to update data
warehouses regularly.
Download