2. Why a Data Warehouse? - University of Houston

advertisement
UNIVERSITY OF HOUSTONCLEAR LAKE
DATA WAREHOUSING AND DATA MINING
Case Study
on
Building a Data warehouse
Submitted by:
Luis (Lead)
Sonal Dandwate (1411210)
Aniruddha
Manjil
University of Houston-Clear Lake
ISAM 5332
Dr. Rob
Sprint : Case Study on Building Data Warehouse
Abstract
This paper will deliberate concepts of data warehouse and its use in our project to develop different
business scenarios, formulate strategic decisions, design dimensional modelling, cube and propose
outcomes.
Introduction
Sprint is an American multinational telecommunications corporation. Sprint ranks fourth in mobile
phone provider business. Sprint is one of the industry giants and dominating market each year
requires a lot of data analysis. It was observed that Sprint started to lose out its market presence to
its competitors. With AT&T providing with better rates and smarter plans. Thorough deep market
research and analysis Sprint came up with idea of what types of plans do customers want, what
plans sell well in which region, what type of customer base a particular region has and etc. In order
to follow a break down approach, Sprint decided to do break down their analysis based on their
stores geography.
Based on the outcome or reports the management can decide what steps can be taken to improve
sales and increase customer base and satisfaction. Sprint being a giant business company, it
generates lots of data every day. Transactional system would be understated software to process
such data. In order to handle such data, a Data Warehouse will be best option. Data warehouse will
give a better understanding for data, answer lots of questions, help in strategizing and formulating
future decisions. Data in the data warehouse has to be collected continuously, transformed and
cleaned. By implementing a Data warehouse, analysts can gather data from different departments
and use them to better understand and strategize business and help management take necessary
steps to grow business.
Case Study: Sprint
1. Business Scenario
Sprint Corporation aka Sprint is an American telecommunication company that provides wireless
services and major Internet carrier. Sprint ranks fourth largest wireless network provider in USA
and serves 58.6 million customers as of November 2015. The company is headquartered in
Overland Park, Kansas. Sprint is one of the largest long distance providers in the United States.
Every business aim is to increase its market presence, to achieve that sprint is expanding stores
and improving its coverage all over the country. Sprint also has to focus on many things such as
which plans to offer, target which age group, spot the area which has highest usage of data, highest
cellular usage, areas have less cellular connectivity and etc.
In order to formulate strategic decision Sprint can set up a Data Warehouse. Once the data
warehouse is set up they can manage and analyze data by looking at the reports the system will
generate. Let us look at few business needs that Sprint can have:







type of customers to target (demographic segmentation)
Plans to offer that will sell the most based on demographic segmentation
Generate Quarterly and Yearly Sales, Usage reports based on various factors.
Generate region wise sales data to expand business and put more stores
Study usage patterns of customers based on various factors and engender new combo plans.
Expand market, target more customers
Formulate Future Plans based on customer feedback.
2. Why a Data Warehouse?
A data warehouse is a subject-oriented, integrated, nonvolatile, and time-variant collection of data
in support of management’s decisions. In the past few decades it was very difficult to deal with
operational systems and informational systems as they have given disappointing results. There was
huge data generated but we did not have appropriate tools to deal with it. There was huge need of
new system that could help generate new data and use of archived or historical data in given that
valuable/strategic information. Few important factors of data warehouse are:


It is a concept which help you pick appropriate data (existing data, historical data), clean it
and process it, which will help in strategic decision making.
It is an environment but not a product.
In a data warehouse environment, users can access which ever data that they need.
Sprint is one of the industry leader in telecommunication business. Where it has to deal with tons
of data on daily basis, data could be related to their stores, daily transaction, network towers,
weekly transaction, inventory, employee information, customer information, monthly transactions,
or yearly transaction.
In order to solve the problems and use data in appropriate way Sprint build a Data warehouse. This
warehouse contained data of different departments, stores, customers, employee, daily
transactional systems. This Data will help us in understand sales pattern, formulate inventory and
manage resources, and most important increase and improve customer base. Once we analysie our
data we can come up with our grey areas. We can know which stores and regions that are doing
good amount of sales and in which area. We can also know badly performing regions, stores, plans.
Under all this we can basically know the health of business. Data warehouse helps the business to
run healthy.
References
1. Data Warehouse. (2015, December 3). Retrieved December 7, 2015, from
https://en.wikipedia.org/wiki/Data_warehouse
4. Methodology
Steps for creating a Data Warehouse:

Determine Business Objectives: Sprint is a well established company. To understand the
managements questions, we have to identify what items would define success of business.





Collect and Analyze information: For understanding the requirements and needs of
management, the easiest way is to ask questions. Questions let us understand what
information might be needed for decision making. Generating reports also helps in this
step.
Circle out Core Business Processes: In this step we have to identify the entities which
are required to understand and create the key performance indicators.
Develop a Conceptual Data Model: In this step, we determine the subjects which are to
be expressed as fact tables. Also, we identify the dimensions that will be related to these
facts.
Identify Data Sources: As we now have a conceptual data model, we have to identify
where critical information lies and how do we move it or relate it into a data warehouse
structure. Data transformations must be planned in this step
Set Historical Data Limits: When dealing with a data warehouse, we are talking about
large amount of information which may contain historical data. We must understand and
determine how much historical data we want to store in our data warehouse.
Implementing the plan: Once all the steps mentioned above are complete, we have to define
estimates for work and project completion.
5. Dimensional Modeling
Business dimensions that are to be incorporated into a logical data model give us Dimensional
Modeling. Dimensional modeling is a logical design technique that structures the business
dimensions and the metrics that are analyzed along these dimensions. This modeling technique is
intuitive for that purpose. High performance can be achieved while we query or do analysis using
this model. Data in this model is contained in Fact table and Dimension tables.
A fact table is used to maintain measurements. Each row in a fact table represents data which
may relate to a particular customer, a particular product, or sales in a particular state or region.
For each entity in an application, there will always be a row associated with that entity in the fact
table. Few fields in a fact table may not have data. In an example where there are no sales for a
particular product in any given month, the value of it in the fact table may not be present. These
types of rows in a fact table may create gaps. Fact table for Sprint data warehouse may contain
Usage as an attribute. This may have records for “Usage_For_January” for an example as a
record.
Dimension tables are used to represent business dimensions along which metrics are analyzed.
An important characteristic of a dimensional table is that it is wide. Dimension table consists of
many columns and attributes. We may come across dimension tables that have plenty of
attributes, and hence that is how they are called wide. A dimensional table when laid out like a
normal table with columns and row, it spans out horizontally.
Dimension Attributes are the various columns in a dimension table. In the Stores dimension, the
attributes can be Store ID, Store City, Store State, Store Country, Zip code. Generally, the
Dimension Attributes are used in report labels, and query constraints such as where
Country='USA'. The dimension attributes also contain one or more hierarchical relationships.
Before designing your data warehouse, you need to decide what this data warehouse contains.
Business Dimensions:
The following are the business dimensions that we can use in our data warehouse for Sprint. The
major attributes are Time, Customers, Stores and Products. Each of these may have multiple sub
attributes.
Customer
Time
Product
Store
Cust_ID
Day
Product_ID
Store_ID
Location
Month
Prod_Desc
Location
Year
Type
Quarter
Price
Name
Age
Our data warehouse may contain the following fields in the fact table:


Sales
Usage
Dimensional Hierarchies defines a sequence of mappings from a set of low-level concepts to
higher-level, more general concepts. The dimensional levels in a hierarchy form a tree like
structure. Members at lowest level are called leaf members and they are connected to a single
member at the highest level. Dimensional hierarchies are those various levels of detail contained
within a business dimension. Analysts, data experts and higher level management can use the
dimensional hierarchies as the paths for drilling down or rolling up in analysis. Modern software
is very useful when designing fact tables, dimension tables, and establishing the relationships
between them. There are two types of schemas generally used in a data warehouse. They are the
STAR schema and the Snowflake schema. The STAR schema gets its name from the way
dimension tables are arranged around a fact table. In STAR schema, the dimensional tables are
arranged around a centralized fact table making it look like a star. A STAR schema is not
normalized. Snowflaking is a method of normalizing the dimension tables in a STAR schema.
When normalized, the resultant structure resembles a STAR schema. Component tables are also
known as dimension tables.
6.Defining data
When defining our data, we went with the star schema. We felt like the star schema model was
best suited for the data we were providing. The star schema can be seen below.
We have defined the relationships between each table, and determined what each table must
contain to achieve our overall goal. The tables that were used in our schema are:
Dimensions:




Product
Time
Customer
Store
Fact Tables:


Sales
Usage
7.Data Cleaning:
Errors happen all the data. Sometimes it is easier to identify errors, but at times it becomes very
difficult to find the slightest mistake in data. Data coming from different sources, may be
inputted in different formats. This can cause errors as it can change the context of the data. When
data cleansing, the process because rigorous, as it can be time consuming and tedious. Data
cleansing is very essential to the overall goal of the business. When decision makers use the data
warehouse to analyze the business, they are depending on accurate and correctly formatted data
in the data warehouse. That is why, before the data is made available in the data warehouse, the
data must be cleanse and transformed correctly. Transformation is important, as it build
consistency in format across the data warehouse. This is important because the decision makers
want the make sure that the data they are pulling, will not have any errors. Since data warehouses
are time sensitive, the transformation process makes sure that the dates provided are correct and
accurate. One big issue in businesses today, is information/data overload. This can drive decision
makers crazy. It is important to focus only on the data that is irrelevant to the business goal.
Some of the data provided might not be important to the decision maker, and is removed in the
data cleansing process.
8.Implementing in SQL Server Analysis Services 2012
Tables in SQL Sever 2012
Once the data cleansing and transformation of date is complete, it is now time to submit our data
from Microsoft Access to SQL Server 2012. They are many advantages in having your data in
SQL Server 2012, rather than MS Access. One of the major advantages, is that SQL Server
supports and contains more GB for data. Depending on how big or small your company is, SQL
Server can support Small to Large companies. The most important advantage is the ability to
create a data cube. A cube consist of dimensions, fact tables, and any type of data related to it.
The obvious advantage of creating a data cube is to be able to have the ability to organize your
data. This would make it very beneficial for an Analyst to query the data. Within the data cube,
you also have the ability to use techniques such as slice and dice, roll-up, drill-down and any
other techniques as well. These techniques were essential when it was time to view are data
when creating our business plans. When developing the cube, it is important to first define
hierarchies in each dimension. This is very essential when trying to produce the correct
information to the decision makers. Depending on the type of business, or managerial position, it
is important to have the ability to determine what data is needed. Executives might want
information by at the state level, while a store manager might just want information at the city
level. With SQL Server 2012, you are able to manipulate the data as pleased. A cube is created
when relationships are defined between dimensions, and when hierarchies are set.
9. Browsing the Data Cube
The data cube is the most important part in data warehousing. OLAP cube is a method of storing
data in a multidimensional form, generally for reporting purposes. In OLAP cubes, data
(measures) are categorized by dimensions. OLAP cubes are often pre-summarized across
dimensions to drastically improve query time over relational databases. Now looking at our
outcome from or cube:
The above figure determine which product sold more per quarter per state
In First Quarter:
• Accessories and Add on messaging were sold the most in this California
• 5GB Data and Unlimited talk time Plan was popular in Pennsylvania
• The most popular plans were 3000min Talk time only, Add on messaging,
Unlimited talk time
• Least popular was 5GB Data
The objective was to identify where and when products are are sold.
The above figure determine which product sold more per quarter per state
In Second Quarter:
• Add on Messaging was sold the mostly in Texas
• Second most popular product was Unlimited talk time in Arizona, Florida, Georgia and
Washington
•
Third most Popular product was 5GB Data plan and Accessories in Arizona, Colorado,
Florida,
North Carolina, Texas and Washington
The objective was to identify where and when are products sold.
10. Report Generation
The main goal of implementing a data warehouse was to make use of our available data and
make a decision that is crucial to make business decisions. This information generated by the
data cube would help the managers to make strategic decisions.
The following are few reports that were generated by our cube:
Report - To determine usage per quarter


Usage was maximum in Third Quarter and lowest in Fourth Quarter
The Maximum usage was of 3000min talk time only in First Quarter. Whereas, 5 GB data
had maximum
It is to identify this highest peak of usage
Report - Usage Report by Age and Plans
• Age group 25-35 has the most Data usage in 5GB Data Plan.
• Text count and Minutes used was maximum for age group 32 and 44.
Solution - Determine which plans are popular with a certain age group.
11. Conclusion
The process of data warehousing has taught us the crucial technique of building a cube and
making a crucial decision. It has given us the knowledge about making a strategic decision with
the help of the data cube. Solving various challenges gave us the better understanding of the cube
and how to implement it in the real scenario.
Download