UNIVERSITY OF HOUSTON-CLEAR LAKE Ford Motor Company A Case Study on Building a Data Warehouse Prepared by: BalaNavya Karuturi Vineela Tatineni Jeff Rix Prepared for: Dr. Rob ISAM 5332 12/4/2011 University of Houston-Clear Lake ISAM 5332 Dr. Rob Ford Motor Company: A Case Study on Building a Data Warehouse Jeff Rix BalaNavya Karuturi Vineela Tatineni Abstract This paper discusses the process of building a data warehouse for Ford Motor Company, a global automotive company which is the second largest automaker in the U.S. Introduction Ford Motors is a wide-spread company expanded over all regions and states of the United States and manufactures and sells a wide variety and range of vehicles. The sales and production volumes of the company has been increasing significantly over the past 10 years. But over the recent years, there have been inconsistencies with the sales across various regions. It plans to reduce overhead costs by taking note of the dealerships that are not performing well enough. But however, they would want to do the classification based on region, so as not to eradicate or adversely affect the customer-base off an entire geographical region. Ford Motors currently does maintain multiple transactional systems at various dealerships. These systems are used to run the transactional and maintenance operations of the business like sales, services and accessories/parts sales of the various vehicle types. But these systems do not provide them with the above-needed information that Ford needs in order to obtain answers to certain business questions like which dealership amongst those in a particular state have the lowest sales, so as for Ford to reduce storage space for the dealership and in turn reduce overhead cost at that dealership. Many more of such strategic questions need to be answered in order for management of Ford to take decisions pertaining to the improvement/enhancement of the business over time. For the same purpose, Ford would need what is known as a Data Warehouse that houses information in a format so as to be able to provide answers to the strategic questions that Ford needs answers to. The purpose of a data warehouse is to collect desired data from various transactional systems, convert it into a common format, cleanse it, and allow it to be queried for useful strategic information in the future. Data must be continually collected, transformed, and cleansed from each of the different transactional systems in order to keep the warehouse properly maintained. Case Study: Ford Motor Company 1. Business Scenario Ford Motors is a wide-spread company expanded over all regions and states of the United States and manufactures and sells a wide variety and range of vehicles. The sales and production volumes of the company has been increasing significantly over the past 10 years. Ford is increasing rapidly in terms of 1|Page volumes and dealerships. The problem with an enterprise as big as Ford is that the ownership and the management face a huge problem, which is that with the number of locations and dealerships and the rate at which new dealerships are allocated and the service and sales centers are expanded, it is difficult for them to conduct timely visits to each of the locations in order to monitor dealership sales, employee performance, branch revenue, and ensure that the best care in terms of “aftersales-service” is being offered to the customers. It is thus imperative for an enterprise of this magnitude to get to find a way to extract the data that he needs from each of these systems he will be able to maintain greater control over his business. We will define sales for new vehicles by dealership, region and state for each product and vehicle type for each month and quarter in one year. 2. Why a Data Warehouse? Being an automotive company, the inventory, sales and location do matter for ford and this case study helps management keep track on sales at different locations. The data should help in decisions regarding inventory management, process management and cutting down on dealerships with less efficiency. Also, this study helps in concentrating on the dealerships with good success rates. So, the usual initial purpose of a Data Warehouse is simply to give end users more flexible access to data they already have. The Date Warehouse grows from there. Eventually it will fulfill the higher management’s in Data Mining: finding profitable customer demographics that could have been ignored, finding releases of certain models of vehicles that they have not tried at a particular geographical location and so on. Some points to look at to make key decisions: What dealership has the lowest sales in Texas? We could determine which dealerships to close first to reduce overhead. What vehicle types sell the most in each region? We would know what types of vehicles to provide in each region. What vehicle name sells the most in each state? We would know to order more of that product in that state. What quarter has the highest sales? We would know when to order more vehicles for the dealerships. What region sells the most in the busiest quarter? We could use this data to find out where we would like to create new dealerships. What vehicle type is selling the least in each state? We would know to order less of that product in that state. What month is least successful by city? We could enhance marketing in that quarter. The management needs a system where data is stored with respect to time as another dimension, and through the same/related query this data is used to obtain reports that will provide them with the strategic information that is needed from questions like those in previous section. This can include reports such as revenue for each dealer by state, revenue by vehicle category, customers who have bought high-end models and sales persons who are pivoting most sales in the dealership. The ability to query this kind of system online from his office is the perfect solution to his problem. 2|Page 3. Methodology Procedures of creating a DW that make it advantageous: With a Data warehouse, the queries can be run without putting load on the production databases. Data of different vehicles is extracted region wise by external websites which are transformed and sorted accordingly. This loaded the Data into Data Source and connected to the Analysis Services. Multiple data-sources can be mapped with the data ware house which eliminates a single source query issues. Multiple levels of filtering can be avoided by using a data warehouse which is a more streamlined approach. The database is not dependent on the operational infrastructure and hence decreases the criticality of having downtime due to "Single point of failure". This also decreases the chances of slowing down the IT infrastructure due to higher loads on the system. From business perspective, the operational and customer relationship applications are more easy to use. 3|Page Dimensional Modeling - Dimension Table 4. Dimensional Modeling and defining data structure Data Management - Dimensional Modeling Definition Dimensional modeling is the design concept used by many data warehouse designers to build their data warehouse. Dimensional model is the underlying data model used by many of the commercial OLAP products available today in the market. In this model, all data is contained in two types of tables called Fact Table and Dimension Table. In a Dimensional Model, context of the measurements are represented in dimension tables. You can also think of the context of a measurement as the characteristics such as who, what, where, when, how of a measurement (subject). In your business process Sales, the characteristics of the 'monthly sales number' measurement can be a Location (Where), Time (When), Product Sold (What). The Dimension Attributes are the various columns in a dimension table. In the Location dimension, the attributes can be Location Code, State, Country, Zip code. Generally the Dimension Attributes are used in report labels, and query constraints such as where Country='USA'. The dimension attributes also contain one or more hierarchical relationships. Before designing your data warehouse, you need to decide what this data warehouse contains. Say if you want to build a data warehouse containing monthly sales numbers across multiple store locations, across time and across products then your dimensions are: Location Time Product Query: Dimensional Modeling - Fact Table In a Dimensional Model, Fact table contains the measurements or metrics or facts of business processes. If your business process is Sales, then a measurement of this business process such as "monthly sales number" is captured in the fact table. In addition to the measurements, the only other things a fact table contains are foreign keys for the dimension tables. 4|Page modeling. Modern software is very useful when designing fact tables, dimension tables, and establishing the relationships between them. When you have finished modeling the dimensions and establishing the relationships, you end up with a database schema. The two types of schemas that are generally used in a data warehouse are the STAR schema and the snowflake schema. The STAR schema is a simple database schema for data design using a dimensional model. This schema consists of a fact table in the center that is directly related to the dimension tables that surround it. Although the STAR schema is a relational model, it is not a normalized model. The snowflake method normalizes the dimension tables in a STAR schema. As mentioned earlier, it is very important to employ the correct approach when designing the schema for a data warehouse. This is where we made our first mistake. As a result of this, we have two different database schemas. The first schema that we designed was a complex snowflake schema which is depicted in figure 1.1. Each dimension table contains data for one dimension. In the above example you get all your vehicle location information and put that into one single table called Location. Your store location data may be spanned across multiple tables in your OLTP system (unlike OLAP), but you need to de-normalize all that data into one single table. Dimensional attributes: Within each of the dimensions. These dimensional hierarchies are the various levels of detail contained within a business dimension. Managers can use the dimensional hierarchies as the paths for drilling down or rolling up in analysis. Dimensional modeling is the technique that is in designing a data warehouse. Many software vendors have expanded their modeling case tools to include dimensional Figure 1.1 When the dimensions in a STAR schema are completely normalized the resulting structure resembles a snowflake with the fact table in the middle. In the case of Benchmark, the most important fact to analyze is the sales commission which is the revenue for the company. The transaction table is the fact table which contains commission as an attribute. Transactions are analyzed base on dimensions such as Customer, Employee, Investment, Time, Date, and the Commission collected. 5|Page At first, we approached this project from the wrong direction, and because of that we normalized the tables in our schema. This was done because we were thinking in terms of a transactional system where we would need to track changes in the price of stocks, bonds, and other marketable securities. We stored detail information about employees such as position, and salary. After we realized that we weren’t looking at project correctly, we recognized that this information was not necessary for the data warehouse. Although a data warehouse is based on a relational database like one that would be used in an operational system what we were building was a decision-support system. The important component to the entire project is the ability to track and analyze the commission. We do not need to worry about the price of investments or employee salary. In order to correct our error, we redesigned our database schema and arrived at a Star schema which is depicted in figure 1.2. In this schema, we have the Transaction as the fact table with four dimension tables: Employee, Customer, Investment and Date_Time. We kept these dimensions because we needed to analyze commission by state, by employee, by customer, by investment type, and by investment risk class. Figure 1.2 From our new Star schema, we were able to define the data format that we need for the warehouse. We defined table name, attribute name, data type of each attribute in each table, and relationship among tables. From Benchmark’s operational system, we extracted data into an excel file that is depicted in figure 1.3. Many of the fields that were extracted are necessary for an operational system, but were not needed in the data warehouse. This is where cleansing data becomes important. In order to keep the warehouse efficient, we used Excel to remove the extraneous data before it was imported. After cleansing the data, we had attributes that were important to the structure of our system. An example of the cleansed data is found in figure 1.4. When we were satisfied that are data was formatted and cleansed correctly, we moved to our next step which was to implement our database schema in Microsoft Access. This schema is displayed in figure 1.5. We used Access to define our relationships and make sure that the system functioned before importing the database into SQL Server 2008. 6|Page Figure 1.3 Figure 1.4 7|Page Figure 1.5 5. Implementation in SQL Server 2008 In order to browse the data that is contained within the data warehouse, you must design a data structure called a cube. Constructing the cube is done in SQL Server Analysis Services. A cube is comprised of the fact table and all of the data that is directly related to it. The cube organizes the data into a format that can be easily queried, rolled up, drilled down, and sliced and diced based on the measures and hierarchies that are applicable to your particular data set. Importing a database from Access to SQL Server is supposed to be an easy process, but trust us it is anything but. Trying to figure out how to get the program to accept your data and process it turned out to be one of the biggest challenges of the whole project. According to Scott Cameron’s SQL Server 2008 Analysis Services: Step by Step, if you have an existing relational database such as Access, Teradata, Oracle, IBM DB2, as well as some others, you should be able to select the appropriate driver and connect to your data source without any difficulties.(Cameron 39) If only this were true. Due to not having sufficient security clearance to upload our database onto the University of Houston-Clear Lake (UHCL) server, we decided to use a personal laptop to run the software. Operating system compatibility was one of the first problems that we encountered. The solution to this problem was to download the applicable service pack from Microsoft Update. Once the software was installed we attempted to import the database from Access. The next attempt to import the database resulted in being able to import the data, but this time we could not build or deploy our cube. Do not get frustrated when you encounter this problem. We have chosen not to outline the steps that we took to get the program to function correctly as they will be different for each application. Once we had the database imported and functioning properly, we commenced building our cube. The ability to roll up, and drill down your data is based on the hierarchies that are defined within your dimensions. This very important step is depicted in figure 1.6. 8|Page Figure 1.6 Without defining the hierarchies in each of your dimensions, you will not have access to all of the data. When a manager is looking for information, he may want a very high level of granularity, or a very low level of granularity. These types of details are very important when deciding how to define the dimensions that are contained within your data warehouse. When all of the hierarchies are defined, you must set up the relationships that are contained within the dimensions. An example of these relationships can be seen in figure 1.7. Figure 1.8 6. Browsing the Cube Keeping in mind that the ultimate goal of the data warehouse is to provide strategic information to managers and business owners, it is now time to browse the cube that you have created. This is the process where you are actually designing the queries that will provide the reports the end user is looking for. The Benchmark project is concerned primarily with commission that is collected from each transaction. In order to get a picture of the business as a whole it is more reasonable to query the data for commission from a particular region, or in our case, by each state. In figure 1.9, we have shown commission by state as it is presented in the cube browser. Figure 1.7 When all of hierarchies and relationships are set up, the cube can be launched. A fully implemented cube will look something like figure 1.8. Figure 1.9 This is a very high level of detail. If you were to add all of the hierarchies that are available to this query, you can drill the data down to provide commission for each employee in each zip code, as it relates to each type and name of investment. This is shown in figure 1.10. 9|Page Figure 1.10 7. Generating useful reports Being able to browse the cube and design queries is a very powerful and useful tool. Unfortunately, to the end user of the system, some of these queries are almost unreadable within in the cube browser. Remember that the final result of this project is to provide strategic information that will be useful to management in making decisions that will affect the future health of the organization. These reports are not going to be provided to a member of the IT staff who would be comfortable viewing the format in the browser. Management will want a report that can be read and interpreted easily. Providing these kinds of reports is easily done once you have a functional cube. The cube that was initially created within Analysis Services can also be accessed with Reports Service which is another very powerful tool that is included in SQL Server 2008. By creating a Reports Services project, we were able to generate reports from the Benchmark warehouse that will be useful to the owner and management. The same information that is depicted in figure 1.9 is again displayed in figure 1.11 in a much easier to read format. Figure 1.11 Another report that the Benchmark management wanted was Customers with High Risk Investments, figure 1.12, which would allow them to find customers who have money in an investment that is now considered to be high risk. Figure 1.12 10 | P a g e Even though this particular report does not directly track commission, it is directly related to the amount of commission that the company collects. The goal of Benchmark is help grow the retirement funds of their customers, and if they were to ignore these risky investments, they would lose money, ruin the reputation they have strived to build, and drive new and existing customers away. When there are no customers, there is no commission to keep track. Sales by State by Vehicle Name Lowest sale by dealers in a state Sales by Quarter Highest Sale by vehicle type for fourth quarter Sales by Region 11 | P a g e Conclusion Sales by Month Entering into the process of constructing a data warehouse with no prior knowledge of the subject proved to be quite a challenge. It also turned into an exceptional learning experience. We learned to carefully analyze the project that has been presented before diving into it head first. It is essential to do this so that you can be sure that the correct approach is being taken in regards to the end result. Starting with the desired result and working backwards turned out to be the direction that we ultimately took with this project, and is probably a viable approach to take when designing a data warehouse. A data warehouse is a report centric system, so beginning with an understanding of the desired output will lead to a much more efficient design plan. We believe that we have constructed a system that Benchmark will be able to rely on for their reporting needs for the foreseeable future. 12 | P a g e Works Cited Cameron, Scott. (2009) Microsoft SQL Server 2008 Analysis Services Step by Step. Redmund, Washington: Microsoft Press. Ponniah, Paulraj. (2001) Data WarehousingFundamentals: A Comprehensive Guide for IT Professionals. New York, New York: John Wiley & Sons. 13 | P a g e