02-5-HO Data Warehousing Fundamentals This is a nice summary so far, but hard to understand if you have no exposure before. If you have exposure this will provide a different way of saying things. The more different explanations you read the better you will understand the subject. What is it about? Scenario for a relatively small business (Not the Sears, Bell, Wal-Mart, or TD Bank size) The company needs to keep growing in order to survive. The information that suggests this need came from sources such as the business section of the Star, Globe & Mail and National Post. The trend in this industry is to have mid-size corporations expand their markets by buying up competitors. The information about trends comes mainly from internal sources within the company and it comes from external sources such as trade magazines, business papers, business organizations and the Government of Canada (Statistics Canada). Most information that allows for comparisons, trends or answers to questions is numerical. A data warehouse then is software that allows for the storage of all this data from different sources in numerical format that will allow business analysis to take place. Requirements would be: 1 The information is available in a timely manner – fast 2 Users must be able to manipulate the data to obtain the needed analysis. No IT needed on reports 3 The format of the data presentation must be in a form easily understood. Charts, tables, graphs and lists are common information forms in the past. Such forms as spreadsheets represent excellent ways of portraying data. 4 The values in the data warehouse must be consistent with those from the day-to-day system. 5 Warehouse data must draw from multiple sources within the company to ensure a complete enterprise picture. 6 Processes need to be in place to gather and convert the data to this more usable form Later when you use SQLServer Analysis Services, you will find that it provides the visual or presentation tools that retrieve data fast and allows for data manipulation to see data from different perspectives such as levels of detail. With Excel and Pivotal tables it is possible to browse data differently. The requirements to ensure integrity of data so that decisions are based on reliable data are done in the process of storing data into a data warehouse. That means the RDMS that holds the data is loaded before the analysis tools can present the information. Document1 by rt -- March 22, 2016 1 of 10 Comparing transaction database with a data warehouse Transaction or OLTP (Online Transaction Processing) is the operation of running the business on a dayto-day basis. The system will tell me how many are enrolled in a class, what room is assigned and how many people paid tuition as of this date. The processing of payments, enrollment or assigning room usage is handled by application software that optimizes the transaction process. The data warehouse database handles questions about historical trends. Example of DW knowledge: If classes can only hold 40 students maximum then the historical trend of 180 students enrolling in September would require 4.5 classes, which isn’t possible. You would need 5 classes of 30 to 40 students. However, suppose when we looked at long-term analysis over the last 5 years the following appeared: that 5% of enrolled students for some reason either never show or attend one class. Another 6% drop within one or two weeks and an additional 15% are shown to have almost no attendance in the last 7 weeks. Knowing the preceding, means that the classes can be overstuffed with 45 students each, because the 11% not attending at the beginning the semester will result in classes of 40 students. In fact analysis shows that the later classes (after 3 PM) and the 8 AM classes tend to have the highest drop out. These could be overloaded even further. In a transaction database the values change often. Quantities on hand of products change every order of the product (shipped out) and when the product level is replenished (received or built). These changes can occur many times per day or hour. This type of change makes the database volatile. A data warehouse has data that changes but only at set periodic updates. Over time, there is new data to add to the data warehouse. An example might be adding sales for the day or month for each product. The new values added do not change the existing values. The transaction database deals in details. The Sales Order department wants to know how many of a product X is in stock as their customer requires 10 of them. The sales order person does not need to know what the average dollar investment that is tied up in product X during the last year or the amount of turnover on average for the line of products that X belongs to will experience. Data warehouses focus on higher-level questions rather than, did customer 2345 buy product 52756 lawn mowers and how many were purchased? Things like averages and total sales by time period, or profitability by grocery line within region or by store means that there is a lot of summarizing needed. To do this type of analysis means the values need to be numeric. Transaction databases provide the source data to feed the data warehouse. The transaction database could be one or many databases. Document1 by rt -- March 22, 2016 2 of 10 Dimensions and Measures in Data Analysis Since data needs to be summarized and that summary data provides information of how the business performs these numeric values are called measures. Examples of measures would be, Total Dollar Sales, Total Dollar Cost, Total Units Sold, Units manufactured per day, weight of goods shipped through a port. When you want to do analysis the question becomes what measure should be used. Suppose we wanted to know about units sold. The answer one would get might be 3 or it might be 3000. Both of these figures are meaningless. They both are numeric and both are summarized or accumulations, but they are still of little value. Units Sold 3000 Units Sold 3 To make it more informative the question might become, how many units were shipped by year for the last 4 years? 2009 3000 2008 2507 2007 2618 2006 2300 Suppose the question is sales by month for last year January 100 February 98 March 215 etc To a total of 3000 Suppose for example the company sells 3 products Product January February March Chicken Perogies 25 30 120 Cheese Perogies 35 35 40 Potato Perogies 40 33 55 The example above is more informative to some people in the business. You can imagine if there are 60,000 products, that many senior levels of management won’t require all of the detail. Note that even at the more detailed level the value of 25 Chicken Perogies is still a summary of the quantity sold for that product in January. As an aside the data could also be displayed differently. Which would you prefer? Product Month Quantity Sold Chicken Perogies January 25 Chicken Perogies February 30 Chicken Perogies March 120 Cheese Perogies January 35 Cheese Perogies February 35 Cheese Perogies March 40 Potato Perogies January 40 Potato Perogies February 33 Potato Perogies March 55 Document1 by rt -- March 22, 2016 3 of 10 Now suppose you had 25 kiosks in malls that sell this product. The report would need to be tailored to handle the changes. Store 1 Store 2 Product Chicken Perogies Cheese Perogies Potato Perogies Chicken Perogies Cheese Perogies Potato Perogies January 2 3 4 3 4 February 3 35 3 3 3 March 12 4 5 2 5 etc The more variables you wish to show the more values in the tables. For example Total sales each month generated 12 values Total sales by month for the 3 products gave 36 values in one month Total sales for 25 stores for the 3 products by month would be 25 * 3 * 12 = 900 values Do the same by 60,000 products as might be carried by a grocery chain and the number of cells holding values becomes large. The values are too difficult to humanly grasp. The three variables used product, store and month are called dimensions. Although the above report has 3 dimensions the number of dimensions can be higher. The question becomes how useful is the report. If you looked at store 17, Potato Perogies by month the report would be manageable and contain 3 dimensions with a restriction on some of the dimensions. (Using SQL as a retrieval that would mean a WHERE condition). Note each measure added simply subdivides the same numeric measure. Interesting example If you have 128 members in a single dimension (128 products in a product dimension) then there are 128 possible subdivisions of the value of a Total Dollar Sales measure. If 128 products were put into 64 product groupings dimensions (just two in each) the report then has 18,446,744,073,709,551,616 possible values. It is perfectly fine to have dimensions or ways to analyze the measures, but be careful adding dimensions to a report. (Note: need to check the figure) Document1 by rt -- March 22, 2016 4 of 10 Hierarchies If there are few products and the analysis is for only a few months a single report may be easy to comprehend. With more products and longer time spans, humans need to reduce the number of values they are looking at in order to comprehend the data. Grouping things does this. Another term might be aggregating. Things that belong together form a dimension. Products belong in one dimension and months in another. These dimensions are independent column and row labels for values. If one is used as a column heading and the other as a row heading a value will exist at the intersection of the two. Suppose we have product line and product as two labels. Are they independent? Product Line 18” gas 18” electric Saw Hammer 25’ rubber 25’ green Mowers 12 25 Tools 25 63 Hoses 100 1234 50’ rubber 540 Looking at the report there can be no values at the intersection of Hoses and 18” gas mowers. This generates a ridiculous report. Adding Product Line in a separate dimension means that the report grows substantially (see example previous page). Product Line and Product are not completely independent. Product Line total is a grouping or aggregation of the products within the same Product Line. Therefore they belong in the same dimension. The relationship between Product and Product Line is called a hierarchy. Each represents a “level” of summarization and is referred to as levels. One of the dimensions you will almost always have is the Time dimension. Depending upon the level of detail or summarization you could have Day, Week, Month, Quarter and Year as the hierarchy structure. Each of them is a level. Members The term member applies to members of a level in a hierarchy or to all members of the dimension. The members at the lowest level of detail are called leaf members. Types of Hierarchies Balanced A balanced hierarchy has members at each level. The time dimension is the easiest to use. Under every Year is a Quarter and under every Quarter is a Month. If a single Month has Days below it, then so do all other months. This does not mean that the numbers of members at a level are the same for all as they are in the Time example. Document1 by rt -- March 22, 2016 5 of 10 Unbalanced These hierarchies may be similar to an employee hierarchy. A graphic might portray this better. President VP Corporate VP Marketing Mgr Print Media VP Sales Mgr TV Media Mgr Sales Mgr Eastern Canada Mgr Western Canada Notice that some levels such as VP do not all have “children” below them. Notice that it is harder to name some of the levels i.e. to give a level an appropriate name that describes the level. You can see that Manager applies to 2 different levels. Suppose also that under Mgr Sales were Salespeople and under Mgr of Eastern Canada and Western Canada there were also salespeople, then some salespeople are at a level directly in line with regional managers. It is then hard to call this level Regional managers. Now what would you call this level. Leaf members are always the ones with no children below them. Balanced Ragged Some hierarchies appear to be unbalanced but also appear to be balanced. North America USA Canada Mexico Eastern Group New York Pennsylvania Document1 by rt -- March 22, 2016 Ontario Yucatan 6 of 10 In Analysis Services, you can define balanced and unbalanced hierarchies, whether they are ragged or not. A dimension will always have leaf members otherwise it would be a table with no content. The hierarchy structure simply defines how the values of leaf members are summarized. DATA WAREHOUSE STRUCTURE Analysis Services makes it easy for a client application to create reports that use multiple dimensions, but the values displayed in the report, must come from a relational data warehouse. Analysis Services assumes that you already have a relational data warehouse. The process of gathering data, transforming and cleaning the data, and loading the data into a data warehouse (ETL) requires considerable skill and work. For Analysis Services to work it needs a number of measures in a table. It needs a fact table in the data warehouse and it needs the fact table to be in a certain form. FACT TABLE A fact table is a table in the relational data warehouse that stores the detailed values for measurers, or facts. A fact table that stores Dollar Sales and Units Sold by Province, by Product and by Month has five columns like the following: Province BC BC ON ON ON BC BC Etc .. Product 18” 4HP Gas Lawn Mower 18” 4.5 HP Gas Lawn Mower 18” 4HP Gas Lawn Mower 18” 4.5 HP Gas Lawn Mower 22” 5 HP Gas Lawn Mower 18” 4HP Gas Lawn Mower 18” 4.5 HP Gas Lawn Mower Month June June June June June July July Dollar Sales 7600.00 3000.00 22000.00 3000.00 1000.00 2600.00 2400.00 Units Sold 38 10 1100 10 2 13 8 Although the above table contains descriptive names to make it easier for us to understand right now, in fact, each of the key columns would contain an integer value. The descriptive names, if they were needed in a report would come from the dimension tables. Because the fact table contains many rows possibly millions of rows, the use of an integer can substantially reduce the space requirements. From the sample table above, the three columns Province, Product and Month represent keys (3 composite parts of the primary key). The remaining two columns of Dollar Sales and units Sold are called measures. The number off measures found in a fact table depends upon the application. For example, a warehouse might contain two measure columns, Dollar Sales and Units Sold. A manufacturing operation may have different measures. One of the measures might be units manufactured another measure might be defects or the number of defects. To be usable by Analysis Services, the fact table must contain rows at the lowest level of detail that you might want to retrieve for a measure. That means the fact table contains rows for leaf member in the dimension tables. If the fact table contains aggregates or summaries, such as quarter and year totals, then Analysis Services cannot use the fact table. As an example, if a Province table included in the Document1 by rt -- March 22, 2016 7 of 10 hierarchy Province, Region, and Country, only the members from the Province level appear in the fact table. Analysis services will create all the summarized values. In the fact table, specifying a single leaf member for each dimension should identify a single row. DIMENSION TABLES A fact table contains only the lowest level of detail. If you have eight products grouped into two product lines, the fact table does not contain any rows for the product lines, only rows for the products. The same thing is true about quarter totals or year totals, only values for the months, which is the lowest level of detail in the fact table. Information necessary to create summaries on product lines will be contained in the product table. Information in order to summarize on quarters or years will be stored in the Time table. A dimension table contains one row for each leaf member of the dimension. A product dimension table with three products will have three rows. A dimension table contains a numeric key column that uniquely identifies each member known as the primary key. A dimension table also contains the names of the members. If the dimension table is involved in a balanced hierarchy, the dimension table will also have an additional column that gives the parent for each member. Example: Note there are several hierarchies in the table below. PRODUCT TABLE Product Product Description ID 57400 18” 4HP Gas Lawn Mower 57450 18” 4.5 HP Gas Lawn Mower 57500 22” 5 HP Gas Lawn Mower 60400 18” Gas Snow Blower Etc … 99607 12.6 Volt Cordless Drill Product Sub Category Lawn Mower Lawn Mower Lawn Mower Snow Blower Product Category Lawn &Garden Lawn &Garden Lawn &Garden Lawn &Garden Department Cordless Tools Power Tools Hardware Hardware Hardware Hardware Hardware In the above Product table, product ID is the primary key for the table and for performance purposes this primary key column should be indexed. The primary key column of each dimension table must match one of the key columns in the related fact table. A key such as 57400 in the product table will appear once. But in the fact table it may appear many thousands of times. The relationship between dimension tables, and fact tables, is a one to many. Notice that in the above product table, the descriptions under product sub category and product category and departments repeat themselves. Normally, in an OLTP system the tables will be normalized and what is a product table above would in fact be four different tables. The product table above is denormalized and the hierarchy forms a chain of one to many relationships. The denormalization is meant to increase the speed of processing queries as it requires less joins. In some data warehouses, the tables are kept separate rather than put into a single table (snowflake design). Both of these methods (star and snowflake) will be discussed on this course and the theoretical advantages to each. OTHER DIMENSION TABLE STRUCTURES (within Analysis Services) Document1 by rt -- March 22, 2016 8 of 10 Most data warehouses include a time dimension. Sometimes the fact table will store date column, as key for the Time dimension rather than an integer key. In that case there is no need for a Time dimension table. Analysis Services can still use the date column from the fact table as the source for a Time dimension and can also do hierarchies such as month, quarter, and year. An employee table has an unbalanced hierarchy. The employee dimension contains all leaf members, as all members are employees. Each employee has a manager; however, the manager is also an employee. If we were using a snowflake design, the manager would be in a separate table. In the employee dimension table, the parent member simply points back to a new row of the original dimension table. This is called a parent-child dimension, because the parent member and the child member are in the same table. A parent-child dimension provides a great deal of flexibility in how a hierarchy is organized. In some organizations an employee will report to two different managers. To do analysis requires that the employee information is aggregated into two or more different managers. This means there are two or more hierarchies. In Analysis Services, you can create these separate hierarchies, both of which use the same leaf members, but which then aggregates using different parent members. How to do it is not covered on this course. A dimension table can also contain columns that are not part of a hierarchy structure. For example, a product dimension may contain columns such as color and type of packaging. A column that is not part of a hierarchy is called a member property. Document1 by rt -- March 22, 2016 9 of 10 CONCEPT OF A CUBE A cube in Analysis Services is a logical construct. It allows a client application to retrieve values as if every possible summarized value existed in the cube. Conceptually, the cube is a fact table, but with a few significant differences. Like a fact table, a cube contains one column for each dimension and one column for each measure. Also like a fact table, a cube contains a row for each possible combination of members for all dimensions. A fact table, however, contains only the lowest level members for each dimension. A cube contains the same lowest level plus the rows based on summarizations. The summarized rows would never appear in a fact table. The cube conceptually contains a value for each measure summarized at each possible hierarchy level for each dimension, but again, that doesn't mean that the cube will actually store all those possible summarized values. The cube can calculate any value by dynamically summarizing leaf level values. For example in a structure with Country, Provinces and City with cities as the leaf level, a summary can quickly be made into provinces, and then the summaries of the provinces in countries. Analysis Services allows the cube designer the flexibility to control how many aggregations are physically created. The more aggregations the more space required to store them. Aside from performance differences, whether or not the cube physically stores a particular summarized value is completely invisible to a user of a cube. When storing values in a cube, Analysis Services stores values for only simple aggregations, such as summing, counting and taking minimum or maximum values. However business analysis requires more than simple aggregations. You can create calculated members that perform calculations on aggregated values. Calculated members make it easy to create values such as average sales. An average sale is a sum of all the sales divided by a count, both of which the cube does. Virtual Cube A virtual cube combines measures from cubes that share at least one common dimension. Document1 by rt -- March 22, 2016 10 of 10