1 of 11 04-1Design Case 2 Notes Case: Mail out This modifies the previous exercise A part of the Day-to-Day operations (OLTP system) is kept on an Oracle database. The section we will look at is known as the Order Entry System Drawing of the system Categories Customer Warehouse Location Salesrep Product Orders Items or Order line The following shows some of the attributes found in the above tables. CUSTOMER TABLE PRODUCT TABLE SALESREP TABLE CID Customer Name Shipping Address Billing Address Phone Credit Limit YTD Sales Last Year Sales Contact Person Comments Terms of Sale Sales Rep ID PID Product Name Category ID Brand Cost (unit cost) Sell (unit price) Size Weight Supplier ID Warehouse Location ID Taxable PST Taxable GST Minimum Order Quantity ETC … SID Salesrep Name Street Address City Province Phone Home Phone Cell Email Address Date Of Birth Life Insurance Y or N Long Term Disability Y or N Salary Commission Rate Hire Date OBJECTIVE: Design and Draw the STAR SCEMA – Shown are the attributes of the tables Document1 by rt -- 12 March 2016 1 of 11 2 of 11 DEVELOPING A DESIGN FOR A DATA WAREHOUSE STEP 1 – Understanding the business What do you want to analyze Example - Who is best customer? - Are sales up or down? - Who makes the most sales? - And other … Where does it come from? Talk to the business executive to see how THEY measure the performance of the business. Note: Design is not usually done from an ERD. Management wouldn’t understand a complicated ERD (the one above is an oversimplification). For our purposes in this tutorial it is easier to use the ERD to see the system since there is no management to meet with and you are familiar with ERDs. ***Also need to know 1. How big is the business? # Of Salesreps # Of Orders especially this one 100/day or 15,000 per day Effects Choice of Grain 2 What measures do they look at? How does company measure their business? Assumption: We will assume the business is a wholesaler of Sporting Goods to sporting and outdoor stores as well as large department stores that have sporting goods sections. An example might be Canadian Tire, Sears, Wal-Mart or smaller stores like Outdoor Oriented Camping Stores. The company has 3 warehouses across the country where it stores stock for faster shipment to its customers. The company does not manufacture any product but buys it from the suppliers it deals with. The company has over 500 orders a day from each warehouse. Some orders are very big and some orders have only a few lines of products on them. The size of the orders from retail stores vary from $50 (but could be smaller) to over $100,000. Measures: Document1 by rt -- 12 March 2016 2 of 11 3 of 11 The normal business measures are Revenue, Costs of Product, Costs of operations, Profits and Quantity Sold. ASIDE: In the example above Total Revenue (or Total $ Sales) and Quantity Sold would be measures in the system, but sometimes costs of operations are not. The Total $ cost of product sold is in the system, but Cost of Processing an order or the cost of operations is not so easily found. It is more likely that those costs will be lumped into a bigger category of expenses. For example it takes vehicles, people, and warehouse heating just to name a few to process an order. That kind of information may still be analyzed but the information needed to do that will come from a different part of the OLTP system. It is harder to know what percentage of the Presidents salary is apportioned to each product sold. For a simple example if the President earns 1000 a week and we sold 1000 items you could say the cost of the product has added to it $1. However if next week the business has sales of 275 or zero because it is seasonal that salary has to be apportioned differently. This is far too difficult to track in this way and would distort reporting, which is the basis of analysis. Document1 by rt -- 12 March 2016 3 of 11 4 of 11 STEP 2 - GRAIN How to determine the grain? Easiest way is probably to use the measures. Revenue is a measure ($ Sales or Total $ Sales for the period of time) What time periods does management want to compare the measure over? Would management want the measure reported on a monthly, daily, weekly, or yearly basis? What is a reasonable time period that would show enough detail for decision making? There has to be a balance between too much detail and too little detail. For example yearly revenue is too little detail. Yearly revenue can also be obtained any time from the OLTP System and so having a data warehouse would not add extra value. Having sales reported on a daily basis and certainly on an hourly basis would be far too much detail and would not improve the decision making. It is really hard to compare revenue on a daily bases over a five year. That’s over 1500 entries. For this example we have decided to choose the time period as Month. If Do we want Revenue broken down also by SALESREP, CUSTOMER LOCATION? Designer in consultation with management needs to understand what information be a value to the management of the company. Since we don’t have management available to ask for our exercise I would suggest that the company would like the measures broken down by customer, by customer location, by sales Rep, by product and product category. The choices of how the Revenue or Total $ Sales is broken down determines the grain Assume GRAIN chosen is as follows: Monthly Revenue (or Total $ Sales), by product, product category, by salesrep, by customer and customer location. Based on the above grain Can you do analysis by order#? NO -- data will be kept on DAILY basis Not by order If you want by order then use the OLTP system because the data is already by order. You will discover that the data warehouse system is not broken down by order, because there’s little decision value to be gained. Document1 by rt -- 12 March 2016 4 of 11 5 of 11 STEP 3 Determine Dimension tables and attributes DISCUSS DIMENSIONS – meaning WHAT TO KEEP from the OLTP . Given the following 6 breakdowns of the measures as decided upon from the grain we need to determine the dimensions. TIME PRODUCT PRODUCT CATEGORY SALES REP LOCATION OF CUSTOMER CUSTOMER Remember that in a star schema we would like all the dimension tables to be connected to the fact table by a single join in order to enhance performance. Looking at the above six items are there any of those six that are closely related on a one to many bases. You can see that product and product category are related on a one to many bases. That would mean that an analysis by product category would require a join between the product category table, the product table, and the fact table along with any other table required in the analysis. We can reduce one of the joins by denormalize in the tables product category and product. When looking at the ERD diagram we can often see where there are areas to be de-normalized. You will notice in the OLTP system that when you see a hierarchy of 1:M relationships these can be denormalized. If we decide to use those tables in a DW, then they should be denormalized. Looking at the remaining five tables ask yourself again, is there any kind of close relationship between two of the breakdowns. It would appear that customer and location of the customer such as city will be closely related. In fact that detail in the OLTP is stored in one table and would also be one table in the data warehouse. The remaining dimensions are as follows: TIME PRODUCT includes PRODUCT CATEGORY SALES REP CUSTOMER includes LOCATION Let us now look at each of the tables to determine what attributes they would have. Document1 by rt -- 12 March 2016 5 of 11 6 of 11 DIMENSION TIME DIMENSION TABLE Since there is no TIME table shown in the OLTP system, what do we use as dimension attributes? Going back to the grain, the requirement was for MONTHLY analysis. The business would also like to analyze things on less detailed basis. The following choices might be made for TIME dimension table TIME TIME-KEY YEAR QUARTER MONTH hierarchy or levels UP SUMMARY VIEWS drill up or drill down DOWN DETAILS VIEWS DIMENSION CUSTOMER – What to keep from the list of attributes available. The company sells sporting goods to stores not directly to people. As an aside, students often want to keep things like GENDER. First stores don’t have gender and second there is no gender in the OLTP System. You have to look at what is available in the OLTP System and determine what is useful based on what management wants to look at. Example of a few of the attributes in the OLTP table for customer: CID Customer Name Shipping Address Billing Address Phone Credit Limit YTD Sales Last Year Sales Contact Person Comments Terms of Sale Sales Rep ID What ones would you keep??? NOTE: For a moment I will deal with LOCATION later as a separate dimension so you can see the analysis of why it is included in the CUSTOMER. CID Customer Name Yes Yes On reports customer names have more meaning than showing ID Document1 by rt -- 12 March 2016 6 of 11 Shipping Address No Billing Address Phone No No Credit Limit YTD Sales Last Year Sales Contact Person Comments Terms of Sale Sales Rep ID 7 of 11 Maybe you want to keep postal code, city or province, but not street The depth of detail depends on how many customers in a city Example: The Toronto Star Newspaper may want a breakdown of businesses by postal code. A supplier of lumber would not have a lot of customers (lumber stores) in a postal code Any analysis by phone number will be the same as analysis by CID The phone and id have a 1 to 1 relationship < important hint on what to keep ??? Maybe if the business wants revenue by credit limit – more likely to want if it is by GOOD, POOR or several descriptive or range values and not by specific numbers. Example: revenues by customers with credit limits would generate numbers like 11,000 limit $285,634 No This is data kept in the OLTP System so that when an employee like the sales rep, credit manager or marketing manager looks at a customer they can see what sales are YTD. Perhaps the customer wants an extra discount for a special anniversary sale they have. The amount of business they have done so far this year may determine the amount of discount. This is not needed the data warehouse. No Same reasoning as YTD sales. No The contact person is needed by the sales Rep but does not provide any additional analysis. It is not required in the data warehouse No Only if comment is GOOD, POOR, EXCELLENT as in credit limit ??? This is things like NET 30. Probably not No This is a foreign key connection done in FACT table for analysis. It is not stored in the customer dimension table Document1 by rt -- 12 March 2016 7 of 11 8 of 11 DIMENSION LOCATION We decided earlier that this had to do with customer location and it will be included in the customer table. But I wanted to go through the analysis to show you what would happen if it was originally thought to be a separate table. In this case are we talking about analysis by customer location? IF the grain was Monthly Revenue (or Total $ Sales), by product, by salesrep, by location, by customer The grain certainly indicates that we need to keep location information so we probably need to keep CITY PROVINCE Do we keep city and province in a separate dimension table? Example: LOCATION ID, CITY and PROVINCE ???? We could keep it in a separate table, but we need to insure that in the fact table the customer is associated with the correct customer location. This means the 2 fields must be validated as working together. See the fact table example below. TIME CUSTOMER LOCATION PRODUCT SAES REP MEASURE MEASURE MEASURE You can see that it would require extra coding to ensure that any entry in the fact table and there are millions of entries, have the customer ID and location ID matching for that customer. It would be easier to keep the LOCATION data with the customer. A keeping it in the customer table, the CUSTOMER table would have a built in hierarchy for analysis By CUSTOMER By CITY By PROVINCE Plus what ever measure they use. Example Revenue or Units shipped by Month by Province. The PROVINCE value can be specifically limited in the WHERE clause to any combination of provinces and so can the TIME periods Document1 by rt -- 12 March 2016 8 of 11 9 of 11 DIMENSION SALESREP Here are some sample attributes. Again decide what to extract from the OLTP and move to the DW dimension table for SALESREP SID Y NAME Y ADDRES N City maybe, but not likely. Don’t need analysis of where sales person lives. Would analysis of student revenue by where professor lives make sense? SALARY ???? HIREDATE N All different EXT# N … DOB N May keep a range SALARY Should you keep salary? If you do then this is what you analyze Daily revenue by Salesrep salary Is that important? If all salesreps have different salaries, then the GROUP BY on salaries will result in the same result as a GROUP BY on salesrep HIREDATE? Might use a range value to denote experience Only of value if you were looking for analysis similar to Are dollar sales affected by the experience level of the Sales Rep? Another question that needs to be addressed is that it might be nice to analyze by age, but is it done enough to warrant keeping that data. Here is an idea to consider. Keeping hire date in the salesrep dimension table is not a lot of data. It is when you make decisions that add a column to a fact table, and then the amount of data stored becomes quite an increase for little return. Document1 by rt -- 12 March 2016 9 of 11 10 of 11 DIMENSION PRODUCTS PID PRODUCT NAME CATID BRAND COST (unit cost) SELL (unit price) SIZE WEIGHT SUPPLIER ID WHSE LOCAT’N ID TAXABLE ETC Y Y See note 2 Y – It is part of naming a product See note 1 See note 1 N N N N N 1 Should we keep the unit price? I.e. selling price for one unit of product Ask the question. What is the measure? Revenue What is the descriptive attribute or dimension to create the grain? Revenue by unit price is this meaningful? Revenue by unit cost is this meaningful? NO NO Here is what an analysis might mean Sales in January for products costing 45.23 were $4,389.97 Would there be a meaningful question management may want to know? Maybe does high price stuff generate more revenue than low price stuff? Since there are a lot of values for unit price it might be better to create a range Values such as 10, 100, 500, 1000 meaning values between 0 and 10, 11 to 100 This is the same for CREDIT LIMIT in customer. It may be important to see if customers with low credit limits or small customers generate more revenue. However this is a decision made by the business so on the surface it looks like you wouldn’t keep it. 2 Should we keep Categories? In the OLTP system CATID is an ID referencing a CATEGORY table. This is done because tables are normalized for transaction performance and to reduce redundancy or space. Since the relationship of category to product is 1:M we can put the category name back into the product dimension or denormalize Document1 by rt -- 12 March 2016 10 of 11 11 of 11 Some OTHER ATTRIBUTES you may want to consider from other tables. In the ORDER TABLE Would ORDER STATUS be kept? Example: Cancelled orders Back ordered NO, the orders are not kept WAREHOUSE LOCATION? The relationship of WAREHOUSE LOCATION to PRODUCT is 1:M Could put WAREHOUSE LOCATION into PRODUCT and denormalize the two tables. You could if you wanted analysis by warehouse. . Once decisions have been made Now draw the STAR SCEMA NOTE: STAR SCHEMA has little meaning to management Grain has meaning Example: Daily revenue by product, by salesrep, by location of warehouse, by customer… These can be stated in things that management can evaluate. Daily revenue - By product - By category of product - By brand Daily revenue - By brand by city - By brand by customer … etc Document1 by rt -- 12 March 2016 Once the grain is approved The implementation is next 11 of 11