Uploaded by SHUBHAM JADHAV

datawarehouseprojectonretailstore-180128222932

advertisement
Data Warehouse
On
Retail Store
By: Siddharth Chaudhary
X16137001
Msc in Data Analytics
National College of Ireland
Table of Contents
Introduction: ......................................................................................................................................... 2
Data Sources: ....................................................................................................................................... 2
Data source 1-.................................................................................................................................. 2
Data Source 2- ................................................................................................................................. 2
Data Source 3- ................................................................................................................................. 2
Data Warehouse Design and Architecture:........................................................................................... 3
Design of Data Warehouse: .................................................................................................................. 6
Dim_Customer: ............................................................................................................................... 6
Dim_Product: .................................................................................................................................. 6
Dim_Location: ................................................................................................................................ 7
Dim_Source:.................................................................................................................................... 7
Dim_Month: .................................................................................................................................... 7
Fact_Table: ...................................................................................................................................... 7
Star Schema of Project: ................................................................................................................... 8
Extract Transform Load(ETL) process: ............................................................................................... 8
Extraction: ....................................................................................................................................... 9
Transformation: ............................................................................................................................. 10
Loading: ........................................................................................................................................ 11
Deploying the CUBE: ................................................................................................................... 11
Business Analytics ............................................................................................................................. 12
Case Study:1 .................................................................................................................................. 12
Analysis: ................................................................................................................................... 13
Case Study:2 .................................................................................................................................. 13
Analysis: ................................................................................................................................... 13
Case Study:3 .................................................................................................................................. 14
Analysis: ................................................................................................................................... 14
Case Study:4 .................................................................................................................................. 14
Analysis: ................................................................................................................................... 15
Conclusion: ........................................................................................................................................ 15
Introduction:
15 years back Information technology gave a gift to this world,-E-Commerce. Since than every small
and big business has used it to improve its outreach, customer count, sales, profit and each possible
aspect. But this was not sufficient. As data grew from MB to GB to PB, these smart business felt a to
store this data efficiently and to utilize it for improving various aspect of business.
One such domain is retail where customer are products are key aspect. Which product is needed by
what type of customer and when are the key questions of retail business. If they are answered well
the can take retail business to new heights. In solving these queries Data Warehouse plays an
important role. It helps to analyze key aspects to improve sale of retail stores. To know what customer
buys and in which season, we need to have a look over the whole data. So first we need to collect the
whole historical data in one place in a standard format. This is done by preparing data ware house.
There are many software which helps in this like Teradata, Netezza, Oracle, Hadoop etc. Once the
warehouse is prepared we can use this dataset in many ways to answer endless queries. In this project
I have simulated the real time data warehouse preparation and answering business queries.
Data Sources:
Data is the basic requirement of any data warehouse. Data for this data warehouse is collected from
three different datasets. The first one is from a Global supermarket store, from which I took data of
five different stores from different locations of USA for year 2012. Second is the revenue collection
from each store in each month. Third one signifies which month fall in which season in USA.
Three of the dataset are easily coerced together as all the dataset have same month_id in each dataset
which is used in sql query for lookup to populate data in fact table as shown in fig.12
Data source 1This dataset had been fetched from www.Kaggle.com . Kaggle is a repository of thousands of data
set. This dataset contain data of supermarket of whole globe. The data used in this data warehouse is
of five different state of USA which are New York, New Jersey, New Hampshire, Utah, Texas.
Link of the dataset:
https://kaggle2.blob.core.windows.net/datasets/1048/1903/global_superstore_2016.xlsx.zip?sv=201
5-12-11&sr=b&sig=V6MbJAh5QVwQC8wLLiPrsC8dKochxZ354VLclEnFuWM%3D&se=201704-07T08%3A21%3A15Z&sp=r
Data Source 2This dataset is a dummy dataset which is generated by mockaroo. This dataset contains the revenue
of each month for each state.
Data Source 3This is the unstructured data set which I Scraped from the site:
https://www.englishclub.com/vocabulary/time-months-of-year.htm
This data has been uploaded into excel which looks like as shown in Fig.3.1 which is cleaned in and
made structured as shown in Fig.3.This dataset have seasons of USA.
Fig.1
Fig.2
Data Warehouse Design and Architecture:
To carry out the analysis of retail store in different state of USA like how much is the revenue
generation, amount of product sold in what month and in which season Kimball’s approach is used
to build this Data Warehouse.
Design Tool for this Data Warehouse:● Sql Server Management Studio
● Sql Server Integration Services
● Sql Server Analysis Services
I have followed the Kimball’s architecture which consist of the following procedures :• Identification of the Process of Business:- We need to define the main process of business
like acquiring customer, acquiring the products, then sale process. We also need to understand
at what level sales data is summarized. Whether it is daily, weekly or monthly level. This step
helps in determining the entities and their relationship as per business requirement. Later on
these entities becomes the dimensions of the business. The most important entities are
Cusotmer, Product, Location, and time.
•
Defining the Grain:- Grains mean at what depth we need to store the data for these
dimension. It defined the granularity of the system. In this project we are going to store sales
of the product at month level.
•
Defining the Dimensions :- Once entities and grains are decided we can decide the
dimension. This dataset contains five dimensions -
Dimension Name
Primary Key
Example
Customer
Customer-Key
Sam
Product
Product_Key
Jeans
Location
Location_Key
Chicago
Season
season_Key
Summer
Month
Month_Key
Table -1
June
These dimensions contain descriptive and textual data.
•
Deciding the fact of the Data Warehouse:-Fact table defines the measurable data we are
going to store for the dimesions. It is the pivot of star schema which contain all the primary
keys of dimensions and the measurable quantities which are used to carry out business queries.
This fact data is designed in such a manner that it helps in identifying which is our regular
customer, how to improve retail business as each season have variation in selling of product,
how much revenue is generated in which state and last but not least which is the highest selling
product.
Advantage of Kimball’s Model: Kimball model has slight different approach to build data
warehouse as it follows bottom up approach which help in merging small datasets.
• Performane of Kimball model is better
•
More focus is on Dimension which play important role for analysis
•
Focus of this approach is on the process of Building DW
•
Less time consuming in creating the DataWarehouse
Overview of building data warehouse to carry out Business intelligence queries:In SSIS package Etl is done three of the datasets are in excel sheet which are extracted into the staging
table,From staging table data is populated into the Dimensions table.with the help of lookup tool(join)
data is being populated into the fact table.Cube is deployed in SSAS.Business queries are carried out
in power BI.as shown in Fig.a
Fig.3
Star Schema: Star Schema looks like a star in which Fact Table act as a pivot as it resides in the
center, while multiple Dimensions are attached to the fact table in a star like form having concepts of
Foreign key.A simple Star Schema usually have one Fact Table and multiple Dimensions but a
complex Star Schema can consist more than one Fact Table. Generally, Fact Tables are in 3NF.
Fact Table: Fact Table consist two type of column(i) Measure columns (ii) Foreign key column.
Measure columns consist of numeric values that can be measured or count while foreign key column
consist of column which act as primary key in dimension tables. Measure column can be used in form
of aggregation or without aggregation for analysis of Business query.
Dimension Table: Dimension table consist of Textual and descriptive values. Each dimension Table
have their own primary key which is a unique table represent other column values. The surrogate
column known as foreign key column in Fact Table is nothing else but they are the Primary key
column of Dimension Table
Fig.4
Advantage of Star Schema: Star schema has various merit which prove its efficiency as well as its
specialty in building a Data warehouse.
• Easy to generate an ETL process
•
Complexity is low as table query has direct relationship
•
Decrease the headache of Normalizing, as data in dimension tables is stored in normal form
•
It is very efficient to carry out metric analysis
•
Each Dimension table is directly connected to Fact Table
•
Navigation of Data is fast as of the nature of connection of fact and dimension table.
Design of Data Warehouse:
For this Retail Data warehouse five dimensions and one fact table have been created.
Dim_Customer:
Customer dimension consist of Customer name, Customer id, Customer key. Customer key is the
primary key in this dimension. It is generated when we I create the dimension by entering query
[Customer_Key] INT Identity (1,1)PK. Now the question is why I generated this, as I was already
having customer_id. As the primary key should be unique, none of the value should be repeated but
as the customer is repeated their id will also repeat and that won’t make the column unique,so to
remove this redundancy Customer_key as the primary key of this dimension is auto generated.
Customer_name contains the name of customer and customer_id column contain the id of customer.
With this dimension we can analyse which one is our regular customer.
Fig 5
Fig 6
Dim_Product:
Product dimension has product_key as the primary key. Product_id contain id of the products.
Product_name contain the name of product sold.With the help of this dimension we can analyze which
is the highest selling product and which customer buys what product.
Fig 7
Dim_Location:
Location dimension contain Location_Key as primary key. State_id is the id of state. State_name
contains the name of state of store location. Region name contains the region of the country. This
dimension is helpful in analyzing which state or region have higest number of customer,which state
got highest sale. It will also help in analyzing the revenue earned in each state or region.
Fig 8
Dim_Source:
This dimension is fetched from unstructured dataset. It contain Season_key as primary key.
Se_month_id is the id of a particular month. This Dimension will help in analyzing which month
shows the highest sale and which season has what highest selling product.
Fig 9
Dim_Month:
This dimension contains Month_Key as Primary Key. S_month_id contain the id of particular month.
Month_name contain the month.This dimension can be used in analyzing highest sale in a state
according to month or which is the highest sold product in a month.
Fig 10
Fact_Table:
For our retail superstore we have created one fact table which is connected with each dimension table
with foreign key relationship. It has three columns for measurement.
(i) product_quantity- It contains the product of quantity sold.
(ii) total_sale- It contain the sale amount of customer visit wise.
(iii) revenue- It contain the amount of revenue generated in the store month wise.
Fig 11
Star Schema of Project:
Dimension tables and Fact Table is connected together using Star schema as shown in Fig 12.
Fig.12
Extract Transform Load(ETL) process:
For Building a data warehouse the important thing is extracting data, then this data is transformed
into the staging area and lastly loaded in destination area. This is known as ETL process. To carry out
ETL process for SSIS toolbox is used. In ETL process data from the External source is Extracted into
the staging Database. Next step is to carry Transformation stage. Loading stage is the end of ETL
process in which data is loaded in fact table.At the end of ETL process data is populated in fact table
as well as in dimension table as shown in Fig.6.
Fig.13
Extraction:
Data is extracted from external source in this phase. For this project excel sheets are the external
source. Otherwise it can be any database or OLTP server. This extraction will load the data into the
the staging database base, which is ole db destination as shown in Fig 14. All the data is extracted
into the database from these excel files. We can also see the data which comes in staging phase is
stored in the database as
(i) dbo.Main_Stage
(ii) dbo.season_stage
(iii) state_stage as shown in Fig 15.
A Truncate Query is written in staging phase so that no multiple data is generated due to multiple
run as shown in Fig 16.
Fig.14
Fig.15
Fig.16
Transformation:
After the data is extracted from excel to staging database, next step which is done is
transformation.For transformation i have used lookup tool(join) and sql query as shown in Fig.19.2
for loading the data from dimension tables. we have five dimension tables in our data base and 1 fact
table.
(i) dbo.Dim_Customer
(ii) dbo.Dim_Location
(iii) dbo.Dim_Month
(iv) dbo.Dim_Product
(v) dbo.dim_Source
(vi)dbo.Retail_Fact
These dimensions are shown in Fig.17.Dimensions are one of the important factor in analyzing data.
Mapping should not be mismatched as it will terminate the ETL flow.
Fig.17
Fig.18
Loading:
After populating Dimension table next step is to populate Fact table. Fact table contains all the
primary key of the dimension tables and some measureables which are used for analysis purpose with
some aggregation rule. Lookup tool (joins) is used to populate the dimension table and Measures in
fact Table.
Fig.19.1
Fig.19.2
Deploying the CUBE:
It is the phase to carry out multidimensional representation of data with the help of cube in SSAS
which is further use to analyze the data on the basis of measures which are present in fact table and
the descriptive,textual data present in Dimension tables. Here, Project.Cube is successfully deployed
as shown in Fig.20 & Fig.21. After deploying the cube, phase of analysis and reporting start’s where
Business intelligence query is carried out.
Fig.20
Fig.21
Business Analytics
Tool Used for Business Query-: Power BI
Power BI is used to carry out the analysis of this Data Warehouse.For analyzing cube is imported in
power BI. with the help of descriptive, textual and measurable quantity business queries have been
carried out.
Following business query can be analyzed with the help of our database.
Case Study:1
Does Seasons(summer,spring,winter,autumn) in 3 different regions of USA effect the retail
store business in term of revenue collection.
This Query touches all of the three dataset. To verify the above Query we will take revenue, season
name and region name. Below Graph shows how much revenue is generated in which region and in
which season.
Fig.22
Analysis:
From the clustered bar chart representation we can analyze that highest revenue is generated in
summer season followed by autumn, then by winter and spring is responsible for least revenue in
each region of USA. Graph also shows that in all the seasons store earns most of its revenue from
Eastern US and Western season stood last. This graph give a quick insight to marketing and sales
team that they need work on Western region to increase sales and find the reason of spring being so
slow.
Case Study:2
Sales generated in different states on basis of seasons
This Query is generated from all the three dataset. To predict above query Total sale, State and Season
is used. Below is the pie chart Fig.23 represent sale of different states in different season.
Fig.23
Analysis:
This pie chart is used to analysis the sales of store in different state in different season. As the Fig.23
shows that sale in Texas in summer season is highest, followed by New York. The pie chart shows
that New York got highest sale in autumn Season and is followed by Texas. So New York and Texas
are biggest buyers in any season. While rest of states are slow in all seasons. So it seems state is very
important factor in terms of sales. We need to understand the needs of Western US states which our
store is not able to cater. Either we need to change the products or increase some offers or may be
store manager is not very efficient. Season and State are very important factor in US. The product
which is suitable for New York in Winter might not be suitable for Utah during same time. This kind
of variation is needed while planning store products.
Case Study:3
Analytical Targeting of customers
To predicate the above query we need to check which customer buys maximum number of products
in which season. Product quantity, Customer Name and season is used for targeting specific
customers.
Fig.24
Analysis:
The Donut chart Fig.24 represent customer who buys maximum number of products in four different
season. Figure explains which customer bought what quantity of product in which season. According
to the business point of view we can target the specific Customer and provide some more offers to
improve our sales.
Case Study:4
Seasons affecting the revenue of States
This query also touches three of the dataset.To analyze the above query we used seasons, revenue,
states to check the amount of revenue generated from each state in every season.
Fig.25
Analysis:
The above graphical representation Fig.25 shows how much revenue is collected in each state in each
season. New York have generated highest amount of revenue in each season.while New Hampshire
have generated the least. In perspective of business New York and Texas revenue generation is
significantly high.
Conclusion:
This data warehouse can help in depicting how we can target specific customer in which region of
the country. New York and Texas have highest sale and highest revenue generation while New
Hampshire have significance less than each of the other state.so to improve the sale in New
Hampshire, Utah, New Jersey. Seasons also play important role in retail business as the sale in
summer season is the highest of all. with the help of this Data Warehouse we can also examine which
product is sold in which month so we can give some extra offers on that particular product.
Download