Data Warehouse On Retail Store By: Siddharth Chaudhary X16137001 Msc in Data Analytics National College of Ireland Table of Contents Introduction: ......................................................................................................................................... 2 Data Sources: ....................................................................................................................................... 2 Data source 1-.................................................................................................................................. 2 Data Source 2- ................................................................................................................................. 2 Data Source 3- ................................................................................................................................. 2 Data Warehouse Design and Architecture:........................................................................................... 3 Design of Data Warehouse: .................................................................................................................. 6 Dim_Customer: ............................................................................................................................... 6 Dim_Product: .................................................................................................................................. 6 Dim_Location: ................................................................................................................................ 7 Dim_Source:.................................................................................................................................... 7 Dim_Month: .................................................................................................................................... 7 Fact_Table: ...................................................................................................................................... 7 Star Schema of Project: ................................................................................................................... 8 Extract Transform Load(ETL) process: ............................................................................................... 8 Extraction: ....................................................................................................................................... 9 Transformation: ............................................................................................................................. 10 Loading: ........................................................................................................................................ 11 Deploying the CUBE: ................................................................................................................... 11 Business Analytics ............................................................................................................................. 12 Case Study:1 .................................................................................................................................. 12 Analysis: ................................................................................................................................... 13 Case Study:2 .................................................................................................................................. 13 Analysis: ................................................................................................................................... 13 Case Study:3 .................................................................................................................................. 14 Analysis: ................................................................................................................................... 14 Case Study:4 .................................................................................................................................. 14 Analysis: ................................................................................................................................... 15 Conclusion: ........................................................................................................................................ 15 Introduction: 15 years back Information technology gave a gift to this world,-E-Commerce. Since than every small and big business has used it to improve its outreach, customer count, sales, profit and each possible aspect. But this was not sufficient. As data grew from MB to GB to PB, these smart business felt a to store this data efficiently and to utilize it for improving various aspect of business. One such domain is retail where customer are products are key aspect. Which product is needed by what type of customer and when are the key questions of retail business. If they are answered well the can take retail business to new heights. In solving these queries Data Warehouse plays an important role. It helps to analyze key aspects to improve sale of retail stores. To know what customer buys and in which season, we need to have a look over the whole data. So first we need to collect the whole historical data in one place in a standard format. This is done by preparing data ware house. There are many software which helps in this like Teradata, Netezza, Oracle, Hadoop etc. Once the warehouse is prepared we can use this dataset in many ways to answer endless queries. In this project I have simulated the real time data warehouse preparation and answering business queries. Data Sources: Data is the basic requirement of any data warehouse. Data for this data warehouse is collected from three different datasets. The first one is from a Global supermarket store, from which I took data of five different stores from different locations of USA for year 2012. Second is the revenue collection from each store in each month. Third one signifies which month fall in which season in USA. Three of the dataset are easily coerced together as all the dataset have same month_id in each dataset which is used in sql query for lookup to populate data in fact table as shown in fig.12 Data source 1This dataset had been fetched from www.Kaggle.com . Kaggle is a repository of thousands of data set. This dataset contain data of supermarket of whole globe. The data used in this data warehouse is of five different state of USA which are New York, New Jersey, New Hampshire, Utah, Texas. Link of the dataset: https://kaggle2.blob.core.windows.net/datasets/1048/1903/global_superstore_2016.xlsx.zip?sv=201 5-12-11&sr=b&sig=V6MbJAh5QVwQC8wLLiPrsC8dKochxZ354VLclEnFuWM%3D&se=201704-07T08%3A21%3A15Z&sp=r Data Source 2This dataset is a dummy dataset which is generated by mockaroo. This dataset contains the revenue of each month for each state. Data Source 3This is the unstructured data set which I Scraped from the site: https://www.englishclub.com/vocabulary/time-months-of-year.htm This data has been uploaded into excel which looks like as shown in Fig.3.1 which is cleaned in and made structured as shown in Fig.3.This dataset have seasons of USA. Fig.1 Fig.2 Data Warehouse Design and Architecture: To carry out the analysis of retail store in different state of USA like how much is the revenue generation, amount of product sold in what month and in which season Kimball’s approach is used to build this Data Warehouse. Design Tool for this Data Warehouse:● Sql Server Management Studio ● Sql Server Integration Services ● Sql Server Analysis Services I have followed the Kimball’s architecture which consist of the following procedures :• Identification of the Process of Business:- We need to define the main process of business like acquiring customer, acquiring the products, then sale process. We also need to understand at what level sales data is summarized. Whether it is daily, weekly or monthly level. This step helps in determining the entities and their relationship as per business requirement. Later on these entities becomes the dimensions of the business. The most important entities are Cusotmer, Product, Location, and time. • Defining the Grain:- Grains mean at what depth we need to store the data for these dimension. It defined the granularity of the system. In this project we are going to store sales of the product at month level. • Defining the Dimensions :- Once entities and grains are decided we can decide the dimension. This dataset contains five dimensions - Dimension Name Primary Key Example Customer Customer-Key Sam Product Product_Key Jeans Location Location_Key Chicago Season season_Key Summer Month Month_Key Table -1 June These dimensions contain descriptive and textual data. • Deciding the fact of the Data Warehouse:-Fact table defines the measurable data we are going to store for the dimesions. It is the pivot of star schema which contain all the primary keys of dimensions and the measurable quantities which are used to carry out business queries. This fact data is designed in such a manner that it helps in identifying which is our regular customer, how to improve retail business as each season have variation in selling of product, how much revenue is generated in which state and last but not least which is the highest selling product. Advantage of Kimball’s Model: Kimball model has slight different approach to build data warehouse as it follows bottom up approach which help in merging small datasets. • Performane of Kimball model is better • More focus is on Dimension which play important role for analysis • Focus of this approach is on the process of Building DW • Less time consuming in creating the DataWarehouse Overview of building data warehouse to carry out Business intelligence queries:In SSIS package Etl is done three of the datasets are in excel sheet which are extracted into the staging table,From staging table data is populated into the Dimensions table.with the help of lookup tool(join) data is being populated into the fact table.Cube is deployed in SSAS.Business queries are carried out in power BI.as shown in Fig.a Fig.3 Star Schema: Star Schema looks like a star in which Fact Table act as a pivot as it resides in the center, while multiple Dimensions are attached to the fact table in a star like form having concepts of Foreign key.A simple Star Schema usually have one Fact Table and multiple Dimensions but a complex Star Schema can consist more than one Fact Table. Generally, Fact Tables are in 3NF. Fact Table: Fact Table consist two type of column(i) Measure columns (ii) Foreign key column. Measure columns consist of numeric values that can be measured or count while foreign key column consist of column which act as primary key in dimension tables. Measure column can be used in form of aggregation or without aggregation for analysis of Business query. Dimension Table: Dimension table consist of Textual and descriptive values. Each dimension Table have their own primary key which is a unique table represent other column values. The surrogate column known as foreign key column in Fact Table is nothing else but they are the Primary key column of Dimension Table Fig.4 Advantage of Star Schema: Star schema has various merit which prove its efficiency as well as its specialty in building a Data warehouse. • Easy to generate an ETL process • Complexity is low as table query has direct relationship • Decrease the headache of Normalizing, as data in dimension tables is stored in normal form • It is very efficient to carry out metric analysis • Each Dimension table is directly connected to Fact Table • Navigation of Data is fast as of the nature of connection of fact and dimension table. Design of Data Warehouse: For this Retail Data warehouse five dimensions and one fact table have been created. Dim_Customer: Customer dimension consist of Customer name, Customer id, Customer key. Customer key is the primary key in this dimension. It is generated when we I create the dimension by entering query [Customer_Key] INT Identity (1,1)PK. Now the question is why I generated this, as I was already having customer_id. As the primary key should be unique, none of the value should be repeated but as the customer is repeated their id will also repeat and that won’t make the column unique,so to remove this redundancy Customer_key as the primary key of this dimension is auto generated. Customer_name contains the name of customer and customer_id column contain the id of customer. With this dimension we can analyse which one is our regular customer. Fig 5 Fig 6 Dim_Product: Product dimension has product_key as the primary key. Product_id contain id of the products. Product_name contain the name of product sold.With the help of this dimension we can analyze which is the highest selling product and which customer buys what product. Fig 7 Dim_Location: Location dimension contain Location_Key as primary key. State_id is the id of state. State_name contains the name of state of store location. Region name contains the region of the country. This dimension is helpful in analyzing which state or region have higest number of customer,which state got highest sale. It will also help in analyzing the revenue earned in each state or region. Fig 8 Dim_Source: This dimension is fetched from unstructured dataset. It contain Season_key as primary key. Se_month_id is the id of a particular month. This Dimension will help in analyzing which month shows the highest sale and which season has what highest selling product. Fig 9 Dim_Month: This dimension contains Month_Key as Primary Key. S_month_id contain the id of particular month. Month_name contain the month.This dimension can be used in analyzing highest sale in a state according to month or which is the highest sold product in a month. Fig 10 Fact_Table: For our retail superstore we have created one fact table which is connected with each dimension table with foreign key relationship. It has three columns for measurement. (i) product_quantity- It contains the product of quantity sold. (ii) total_sale- It contain the sale amount of customer visit wise. (iii) revenue- It contain the amount of revenue generated in the store month wise. Fig 11 Star Schema of Project: Dimension tables and Fact Table is connected together using Star schema as shown in Fig 12. Fig.12 Extract Transform Load(ETL) process: For Building a data warehouse the important thing is extracting data, then this data is transformed into the staging area and lastly loaded in destination area. This is known as ETL process. To carry out ETL process for SSIS toolbox is used. In ETL process data from the External source is Extracted into the staging Database. Next step is to carry Transformation stage. Loading stage is the end of ETL process in which data is loaded in fact table.At the end of ETL process data is populated in fact table as well as in dimension table as shown in Fig.6. Fig.13 Extraction: Data is extracted from external source in this phase. For this project excel sheets are the external source. Otherwise it can be any database or OLTP server. This extraction will load the data into the the staging database base, which is ole db destination as shown in Fig 14. All the data is extracted into the database from these excel files. We can also see the data which comes in staging phase is stored in the database as (i) dbo.Main_Stage (ii) dbo.season_stage (iii) state_stage as shown in Fig 15. A Truncate Query is written in staging phase so that no multiple data is generated due to multiple run as shown in Fig 16. Fig.14 Fig.15 Fig.16 Transformation: After the data is extracted from excel to staging database, next step which is done is transformation.For transformation i have used lookup tool(join) and sql query as shown in Fig.19.2 for loading the data from dimension tables. we have five dimension tables in our data base and 1 fact table. (i) dbo.Dim_Customer (ii) dbo.Dim_Location (iii) dbo.Dim_Month (iv) dbo.Dim_Product (v) dbo.dim_Source (vi)dbo.Retail_Fact These dimensions are shown in Fig.17.Dimensions are one of the important factor in analyzing data. Mapping should not be mismatched as it will terminate the ETL flow. Fig.17 Fig.18 Loading: After populating Dimension table next step is to populate Fact table. Fact table contains all the primary key of the dimension tables and some measureables which are used for analysis purpose with some aggregation rule. Lookup tool (joins) is used to populate the dimension table and Measures in fact Table. Fig.19.1 Fig.19.2 Deploying the CUBE: It is the phase to carry out multidimensional representation of data with the help of cube in SSAS which is further use to analyze the data on the basis of measures which are present in fact table and the descriptive,textual data present in Dimension tables. Here, Project.Cube is successfully deployed as shown in Fig.20 & Fig.21. After deploying the cube, phase of analysis and reporting start’s where Business intelligence query is carried out. Fig.20 Fig.21 Business Analytics Tool Used for Business Query-: Power BI Power BI is used to carry out the analysis of this Data Warehouse.For analyzing cube is imported in power BI. with the help of descriptive, textual and measurable quantity business queries have been carried out. Following business query can be analyzed with the help of our database. Case Study:1 Does Seasons(summer,spring,winter,autumn) in 3 different regions of USA effect the retail store business in term of revenue collection. This Query touches all of the three dataset. To verify the above Query we will take revenue, season name and region name. Below Graph shows how much revenue is generated in which region and in which season. Fig.22 Analysis: From the clustered bar chart representation we can analyze that highest revenue is generated in summer season followed by autumn, then by winter and spring is responsible for least revenue in each region of USA. Graph also shows that in all the seasons store earns most of its revenue from Eastern US and Western season stood last. This graph give a quick insight to marketing and sales team that they need work on Western region to increase sales and find the reason of spring being so slow. Case Study:2 Sales generated in different states on basis of seasons This Query is generated from all the three dataset. To predict above query Total sale, State and Season is used. Below is the pie chart Fig.23 represent sale of different states in different season. Fig.23 Analysis: This pie chart is used to analysis the sales of store in different state in different season. As the Fig.23 shows that sale in Texas in summer season is highest, followed by New York. The pie chart shows that New York got highest sale in autumn Season and is followed by Texas. So New York and Texas are biggest buyers in any season. While rest of states are slow in all seasons. So it seems state is very important factor in terms of sales. We need to understand the needs of Western US states which our store is not able to cater. Either we need to change the products or increase some offers or may be store manager is not very efficient. Season and State are very important factor in US. The product which is suitable for New York in Winter might not be suitable for Utah during same time. This kind of variation is needed while planning store products. Case Study:3 Analytical Targeting of customers To predicate the above query we need to check which customer buys maximum number of products in which season. Product quantity, Customer Name and season is used for targeting specific customers. Fig.24 Analysis: The Donut chart Fig.24 represent customer who buys maximum number of products in four different season. Figure explains which customer bought what quantity of product in which season. According to the business point of view we can target the specific Customer and provide some more offers to improve our sales. Case Study:4 Seasons affecting the revenue of States This query also touches three of the dataset.To analyze the above query we used seasons, revenue, states to check the amount of revenue generated from each state in every season. Fig.25 Analysis: The above graphical representation Fig.25 shows how much revenue is collected in each state in each season. New York have generated highest amount of revenue in each season.while New Hampshire have generated the least. In perspective of business New York and Texas revenue generation is significantly high. Conclusion: This data warehouse can help in depicting how we can target specific customer in which region of the country. New York and Texas have highest sale and highest revenue generation while New Hampshire have significance less than each of the other state.so to improve the sale in New Hampshire, Utah, New Jersey. Seasons also play important role in retail business as the sale in summer season is the highest of all. with the help of this Data Warehouse we can also examine which product is sold in which month so we can give some extra offers on that particular product.