UNIVERSITY OF HOUSTONCLEAR LAKE DATA WAREHOUSING AND DATA MINING Case Study on Building a Data warehouse Submitted by: Luis (Lead) Sonal Dandwate (1411210) Aniruddha Manjil University of Houston-Clear Lake ISAM 5332 Dr. Rob Sprint : Case Study on Building Data Warehouse Abstract This paper will deliberate concepts of data warehouse and its use in our project to develop different business scenarios, formulate strategic decisions, design dimensional modelling, cube and propose outcomes. Introduction Sprint is an American multinational telecommunications corporation. Sprint ranks fourth in mobile phone provider business. Sprint is one of the industry giants and dominating market each year requires a lot of data analysis. It was observed that Sprint started to lose out its market presence to its competitors. With AT&T providing with better rates and smarter plans. Thorough deep market research and analysis Sprint came up with idea of what types of plans do customers want, what plans sell well in which region, what type of customer base a particular region has and etc. In order to follow a break down approach, Sprint decided to do break down their analysis based on their stores geography. Based on the outcome or reports the management can decide what steps can be taken to improve sales and increase customer base and satisfaction. Sprint being a giant business company, it generates lots of data every day. Transactional system would be understated software to process such data. In order to handle such data, a Data Warehouse will be best option. Data warehouse will give a better understanding for data, answer lots of questions, help in strategizing and formulating future decisions. Data in the data warehouse has to be collected continuously, transformed and cleaned. By implementing a Data warehouse, analysts can gather data from different departments and use them to better understand and strategize business and help management take necessary steps to grow business. Case Study: Sprint 1. Business Scenario Sprint Corporation aka Sprint is an American telecommunication company that provides wireless services and major Internet carrier. Sprint ranks fourth largest wireless network provider in USA and serves 58.6 million customers as of November 2015. The company is headquartered in Overland Park, Kansas. Sprint is one of the largest long distance providers in the United States. Every business aim is to increase its market presence, to achieve that sprint is expanding stores and improving its coverage all over the country. Sprint also has to focus on many things such as which plans to offer, target which age group, spot the area which has highest usage of data, highest cellular usage, areas have less cellular connectivity and etc. In order to formulate strategic decision Sprint can set up a Data Warehouse. Once the data warehouse is set up they can manage and analyze data by looking at the reports the system will generate. Let us look at few business needs that Sprint can have: type of customers to target (demographic segmentation) Plans to offer that will sell the most based on demographic segmentation Generate Quarterly and Yearly Sales, Usage reports based on various factors. Generate region wise sales data to expand business and put more stores Study usage patterns of customers based on various factors and engender new combo plans. Expand market, target more customers Formulate Future Plans based on customer feedback. 2. Why a Data Warehouse? A data warehouse is a subject-oriented, integrated, nonvolatile, and time-variant collection of data in support of management’s decisions. In the past few decades it was very difficult to deal with operational systems and informational systems as they have given disappointing results. There was huge data generated but we did not have appropriate tools to deal with it. There was huge need of new system that could help generate new data and use of archived or historical data in given that valuable/strategic information. Few important factors of data warehouse are: It is a concept which help you pick appropriate data (existing data, historical data), clean it and process it, which will help in strategic decision making. It is an environment but not a product. In a data warehouse environment, users can access which ever data that they need. Sprint is one of the industry leader in telecommunication business. Where it has to deal with tons of data on daily basis, data could be related to their stores, daily transaction, network towers, weekly transaction, inventory, employee information, customer information, monthly transactions, or yearly transaction. In order to solve the problems and use data in appropriate way Sprint build a Data warehouse. This warehouse contained data of different departments, stores, customers, employee, daily transactional systems. This Data will help us in understand sales pattern, formulate inventory and manage resources, and most important increase and improve customer base. Once we analysie our data we can come up with our grey areas. We can know which stores and regions that are doing good amount of sales and in which area. We can also know badly performing regions, stores, plans. Under all this we can basically know the health of business. Data warehouse helps the business to run healthy. References 1. Data Warehouse. (2015, December 3). Retrieved December 7, 2015, from https://en.wikipedia.org/wiki/Data_warehouse 4. Methodology Steps for creating a Data Warehouse: Determine Business Objectives: Sprint is a well established company. To understand the managements questions, we have to identify what items would define success of business. Collect and Analyze information: For understanding the requirements and needs of management, the easiest way is to ask questions. Questions let us understand what information might be needed for decision making. Generating reports also helps in this step. Circle out Core Business Processes: In this step we have to identify the entities which are required to understand and create the key performance indicators. Develop a Conceptual Data Model: In this step, we determine the subjects which are to be expressed as fact tables. Also, we identify the dimensions that will be related to these facts. Identify Data Sources: As we now have a conceptual data model, we have to identify where critical information lies and how do we move it or relate it into a data warehouse structure. Data transformations must be planned in this step Set Historical Data Limits: When dealing with a data warehouse, we are talking about large amount of information which may contain historical data. We must understand and determine how much historical data we want to store in our data warehouse. Implementing the plan: Once all the steps mentioned above are complete, we have to define estimates for work and project completion. 5. Dimensional Modeling Business dimensions that are to be incorporated into a logical data model give us Dimensional Modeling. Dimensional modeling is a logical design technique that structures the business dimensions and the metrics that are analyzed along these dimensions. This modeling technique is intuitive for that purpose. High performance can be achieved while we query or do analysis using this model. Data in this model is contained in Fact table and Dimension tables. A fact table is used to maintain measurements. Each row in a fact table represents data which may relate to a particular customer, a particular product, or sales in a particular state or region. For each entity in an application, there will always be a row associated with that entity in the fact table. Few fields in a fact table may not have data. In an example where there are no sales for a particular product in any given month, the value of it in the fact table may not be present. These types of rows in a fact table may create gaps. Fact table for Sprint data warehouse may contain Usage as an attribute. This may have records for “Usage_For_January” for an example as a record. Dimension tables are used to represent business dimensions along which metrics are analyzed. An important characteristic of a dimensional table is that it is wide. Dimension table consists of many columns and attributes. We may come across dimension tables that have plenty of attributes, and hence that is how they are called wide. A dimensional table when laid out like a normal table with columns and row, it spans out horizontally. Dimension Attributes are the various columns in a dimension table. In the Stores dimension, the attributes can be Store ID, Store City, Store State, Store Country, Zip code. Generally, the Dimension Attributes are used in report labels, and query constraints such as where Country='USA'. The dimension attributes also contain one or more hierarchical relationships. Before designing your data warehouse, you need to decide what this data warehouse contains. Business Dimensions: The following are the business dimensions that we can use in our data warehouse for Sprint. The major attributes are Time, Customers, Stores and Products. Each of these may have multiple sub attributes. Customer Time Product Store Cust_ID Day Product_ID Store_ID Location Month Prod_Desc Location Year Type Quarter Price Name Age Our data warehouse may contain the following fields in the fact table: Sales Usage Dimensional Hierarchies defines a sequence of mappings from a set of low-level concepts to higher-level, more general concepts. The dimensional levels in a hierarchy form a tree like structure. Members at lowest level are called leaf members and they are connected to a single member at the highest level. Dimensional hierarchies are those various levels of detail contained within a business dimension. Analysts, data experts and higher level management can use the dimensional hierarchies as the paths for drilling down or rolling up in analysis. Modern software is very useful when designing fact tables, dimension tables, and establishing the relationships between them. There are two types of schemas generally used in a data warehouse. They are the STAR schema and the Snowflake schema. The STAR schema gets its name from the way dimension tables are arranged around a fact table. In STAR schema, the dimensional tables are arranged around a centralized fact table making it look like a star. A STAR schema is not normalized. Snowflaking is a method of normalizing the dimension tables in a STAR schema. When normalized, the resultant structure resembles a STAR schema. Component tables are also known as dimension tables. 6.Defining data When defining our data, we went with the star schema. We felt like the star schema model was best suited for the data we were providing. The star schema can be seen below. We have defined the relationships between each table, and determined what each table must contain to achieve our overall goal. The tables that were used in our schema are: Dimensions: Product Time Customer Store Fact Tables: Sales Usage 7.Data Cleaning: Errors happen all the data. Sometimes it is easier to identify errors, but at times it becomes very difficult to find the slightest mistake in data. Data coming from different sources, may be inputted in different formats. This can cause errors as it can change the context of the data. When data cleansing, the process because rigorous, as it can be time consuming and tedious. Data cleansing is very essential to the overall goal of the business. When decision makers use the data warehouse to analyze the business, they are depending on accurate and correctly formatted data in the data warehouse. That is why, before the data is made available in the data warehouse, the data must be cleanse and transformed correctly. Transformation is important, as it build consistency in format across the data warehouse. This is important because the decision makers want the make sure that the data they are pulling, will not have any errors. Since data warehouses are time sensitive, the transformation process makes sure that the dates provided are correct and accurate. One big issue in businesses today, is information/data overload. This can drive decision makers crazy. It is important to focus only on the data that is irrelevant to the business goal. Some of the data provided might not be important to the decision maker, and is removed in the data cleansing process. 8.Implementing in SQL Server Analysis Services 2012 Tables in SQL Sever 2012 Once the data cleansing and transformation of date is complete, it is now time to submit our data from Microsoft Access to SQL Server 2012. They are many advantages in having your data in SQL Server 2012, rather than MS Access. One of the major advantages, is that SQL Server supports and contains more GB for data. Depending on how big or small your company is, SQL Server can support Small to Large companies. The most important advantage is the ability to create a data cube. A cube consist of dimensions, fact tables, and any type of data related to it. The obvious advantage of creating a data cube is to be able to have the ability to organize your data. This would make it very beneficial for an Analyst to query the data. Within the data cube, you also have the ability to use techniques such as slice and dice, roll-up, drill-down and any other techniques as well. These techniques were essential when it was time to view are data when creating our business plans. When developing the cube, it is important to first define hierarchies in each dimension. This is very essential when trying to produce the correct information to the decision makers. Depending on the type of business, or managerial position, it is important to have the ability to determine what data is needed. Executives might want information by at the state level, while a store manager might just want information at the city level. With SQL Server 2012, you are able to manipulate the data as pleased. A cube is created when relationships are defined between dimensions, and when hierarchies are set. 9. Browsing the Data Cube The data cube is the most important part in data warehousing. OLAP cube is a method of storing data in a multidimensional form, generally for reporting purposes. In OLAP cubes, data (measures) are categorized by dimensions. OLAP cubes are often pre-summarized across dimensions to drastically improve query time over relational databases. Now looking at our outcome from or cube: The above figure determine which product sold more per quarter per state In First Quarter: • Accessories and Add on messaging were sold the most in this California • 5GB Data and Unlimited talk time Plan was popular in Pennsylvania • The most popular plans were 3000min Talk time only, Add on messaging, Unlimited talk time • Least popular was 5GB Data The objective was to identify where and when products are are sold. The above figure determine which product sold more per quarter per state In Second Quarter: • Add on Messaging was sold the mostly in Texas • Second most popular product was Unlimited talk time in Arizona, Florida, Georgia and Washington • Third most Popular product was 5GB Data plan and Accessories in Arizona, Colorado, Florida, North Carolina, Texas and Washington The objective was to identify where and when are products sold. 10. Report Generation The main goal of implementing a data warehouse was to make use of our available data and make a decision that is crucial to make business decisions. This information generated by the data cube would help the managers to make strategic decisions. The following are few reports that were generated by our cube: Report - To determine usage per quarter Usage was maximum in Third Quarter and lowest in Fourth Quarter The Maximum usage was of 3000min talk time only in First Quarter. Whereas, 5 GB data had maximum It is to identify this highest peak of usage Report - Usage Report by Age and Plans • Age group 25-35 has the most Data usage in 5GB Data Plan. • Text count and Minutes used was maximum for age group 32 and 44. Solution - Determine which plans are popular with a certain age group. 11. Conclusion The process of data warehousing has taught us the crucial technique of building a cube and making a crucial decision. It has given us the knowledge about making a strategic decision with the help of the data cube. Solving various challenges gave us the better understanding of the cube and how to implement it in the real scenario.