DISTINCT COUNT and Basket Analysis with Microsoft SQL Server OLAP Services The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication. This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 1999 Microsoft Corporation. All rights reserved. Microsoft, the BackOffice logo, and PivotTable are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. Other trademarks and tradenames mentioned herein are the property of their respective owners. The names of companies, products, people, characters, and/or data mentioned herein are fictitious and are in no way intended to represent any real individual, company, product, or event, unless otherwise noted. Part number: 098-83777 2 Contents Introduction ............................................................................................................................................. 4 DISTINCT COUNT Analysis ................................................................................................................. 5 Understanding the Problem ................................................................................................................ 5 The Solution ....................................................................................................................................... 6 Basket Analysis ........................................................................................................................................ 8 Performance Considerations ................................................................................................................ 10 The DISTINCT Cube ....................................................................................................................... 10 Aggregations .................................................................................................................................... 10 Execution Location .......................................................................................................................... 11 Sampling .......................................................................................................................................... 12 Rendering ......................................................................................................................................... 12 Conclusion.............................................................................................................................................. 13 Finding More Information ................................................................................................................... 14 3 Introduction Microsoft® SQL Server™ OLAP Services version 7.0 provides powerful tools for data analysis. Some of the capabilities are apparent from the user interface. These include the ability to aggregate data and categorize data into dimensions and levels. Other analysis capabilities, usually the more advanced, are not obvious from the user interface and may require more expertise if the user wants take full advantage of OLAP Services. These advanced capabilities involve the use of calculated members and multidimensional expressions (MDX) expressions to achieve the desired analysis. For example, suppose you have a cube that analyzes sales transactions. It has dimensions that describe customers (geography, education, income level, gender), products (classification, color, size), time, and the sales rep through the organizational structure. The measures include information about revenue, quantity, and discounts. One of the most common questions would be, “How many customers bought a specific product?” An even better and more general question might be, “How many customers are buying each product?” Although this last question seems simple, it is not. A regular COUNT measure will not provide correct results because double counts may occur. If a single customer buys a product more than once, a regular COUNT with the measure will count the product sale by customer twice. In order to get the correct results, each customer needs to be counted only once. This is the classic DISTINCT COUNT problem, and it requires a fairly complex resolution in the online analytical processing (OLAP) environment. The problem may become even more interesting if the question becomes, “How many customers bought a specific basket of products?” Take the “Diapers & Beer” example, “How many customers bought both diapers and beer?” This type of question falls under the Basket Analysis problem category. This document discusses the techniques to solve these two classic problems, DISTINCT COUNT and Basket Analysis. It assumes that the reader has a basic understanding of the concepts of OLAP in general, OLAP Services in particular, and MDX. 4 DISTINCT COUNT Analysis DISTINCT COUNT analysis is one of the most popular types of analyses by users and one of the toughest problems for an OLAP system. Some users refer to the problem as the many-tomany problem because analysis of the relationship between entities that have many-to-many relationships is involved. A few of the more typical applications for DISTINCT COUNT analysis are: Sales and marketing, especially counting distinct number of customers. Insurance claims relating policies to damages. One claim may have many damages. Quality control data relating causes to defects. A defect can be caused by multiple factors. Consider the following query: SELECT { [Sales], [Distinct Customers Count] } On Columns, Products.Members On Rows From Sales A typical query result may look like this. Sales Distinct Customers Count All products 8000 200 Hardware 3300 80 Computers 2000 70 Monitors 800 60 Printers 500 30 Software 4700 150 Home 1500 100 Business 2500 100 Games 700 80 Understanding the Problem In the Sales column, the numbers add up to subtotals and their totals. This is the expected behavior of a SUM measure. However, in the Distinct Customers Count column, the numbers do not add up. 5 In this example, 70 customers bought computers, 60 customers bought printers, and 30 customers bought monitors. However, the total number of bought who bought hardware, according to the result set, is not 160, or 70+60+30, as shown in the table. The query results display an actual count of 80 total hardware customers. The reason for this irregularity is simple: many of the customers bought more than one product. Some customers bought both computers and monitors, others bought the whole three-piece package, some replaced just the monitor, and so on. The end result is that there is no way to infer directly from the lower level results what the customer subtotal really is. This discrepancy continues through the upper levels as well: 80 customers bought hardware, 150 bought software, and all together, All Products totals only 200 customers. These kinds of irregularities pose challenges for OLAP systems. Nonadditive measures pose the following problems on a typical OLAP system: Roll-ups are not possible. When precalculating results during cube processing, the system cannot deduce summaries from other summaries. All results must be calculated from the detail data. This situation places a heavy burden in processing time. All results must be precalculated. With nonadditive measures, there is no way to deduce the result for a higher-level summary query from one precalculated aggregation. Failure to precalculate the results in advance means that the results are not available. It is next to impossible to perform and maintain incremental updates to the system. A single transaction added to the cube usually invalidates huge portions of previously precalculated results. In order to recover from this, a nearly complete recalculation is needed. OLAP Services takes a very different approach to the solution to these kinds of problems. All basic measures in the cube must be additive. These include SUM, MIN, MAX, and simple COUNT. More problematic measures that are not additive are handled through calculated members, which are calculated at run time. The Solution You can define the calculated member [Distinct Customers Count] using an MDX expression. Use the following expression to deduce the number of customers who bought a product by counting the customers where non-NULL sales exist: Count(CrossJoin({[Sales]}, [Customer Names].Members), ExcludeEmpty) This expression evaluates each Sales-Customer Name tuple and counts the number of tuples that are not NULL. The number of tuples being evaluated will always equal the number of customers. This expression works with any set of coordinates in any dimension (except Customers). If the current member in the products dimension is [Hardware], the NULL evaluation will be for the [Sales] of [Hardware] for each [Customer Name]. If you slice by a specific month, January for example, the count will be for all non-NULL values for the [Sales] of [Hardware] in [January] for each [Customer Name]. 6 However, this expression does not work well with the Customers dimension itself. The calculated member defined here counts for all of the Customer Names, no matter what the current member on the customer dimension is. For example, to perform a distinct count on the customers in California, you might expect that if you slice by [California] in the [Customers] dimension, only the customers in this state will be counted. However, the calculated member created here has no such limitation. It counts all of the customers in all of the countries/states/cities without any limitation. To fix this problem, change the expression to the following: Count(CrossJoin( {[Sales]}, Descendants([Customers].CurrentMember, [Customer Names])), ExcludeEmpty) The modified expression helps ensure that only the customers under the current member in the [Customers] dimensions are counted. This generic expression solves the distinct count problem and provides the correct answers. The only problem with this method lies in performance. In many businesses, the number of customers may be very large. The need to evaluate each customer individually at run time places a significant calculation burden on the system. A later section of this document discusses techniques to optimize these calculations and ease some of the load on performance. It is important to remember that even with these optimizations, DISTINCT COUNTS are much slower than other additive measures. 7 Basket Analysis Basket Analysis goes one step further than DISTINCT COUNT. With Basket Analysis, the idea is to count the number of intersected occurrences. For example, how many customers bought a computer and a printer together? A more generic query result is shown here. Sales Distinct Customers Count Customers Who Bought Printers All Products 8000 200 30 Hardware 3300 80 30 Computers 2000 70 20 Monitors 800 60 25 Printers 500 30 30 Software 4700 150 15 Home 1500 100 7 Business 2500 100 10 Games 700 80 5 The last column in the table shows how many customers bought both the corresponding product and a printer for each product/category. This query investigates the relationships between members of the same dimension. The combination of each product and a printer creates a basket of products. Understanding the occurrences of these baskets is one of the most important insights into the purchasing habits of customers. It is usually a good basis for cross-promotions, direct mail, and other focused marketing activities. This kind of analysis has wide applicability in other areas beyond marketing. For example, in quality control it is important to learn about the relationships between failed components or causes of failure. The definition of [Customers Who Bought Printers] is: Sum(Descendants([Customers].CurrentMember, [Customer Names]), Iif(IsEmpty(Sales, Printers) Or IsEmpty(Sales), 0, 1)) This expression sums one (1) for each customer who bought the current product in addition to purchasing a printer. Suppose you want to analyze baskets that contain more than two products (current and printer in out example.) You can extend the basket to {current, Printer, Computer} using the following expression: [Customers Who Bought Printers & Computers]: Sum(Descendants([Customers].CurrentMember, [Customer Names]), Iif(IsEmpty(Sales, Printers) or IsEmpty(Sales, Computers) Or IsEmpty(Sales), 0, 1)) 8 Yes… But Were They Bought Together? The expression in the previous section will count the number of customers that bought a set of products (Computer, Printer, and another product). However, there is no indication in the expression as to whether the products were bought together. In some cases, it is important to know not only when a customer bought several products, but also whether the customer bought them together at the same time or at different times. “Together” deserves a definition. At first reaction you may think that the products were ordered or delivered together on the same invoice. However, in business intelligence, “together” usually has a definition that spans time rather than invoice numbers. There are two reasons for this: In OLAP cubes, maintaining information about specific invoices is difficult and inefficient compared to the management of a time dimension. The number of invoices may be several orders of magnitude larger than the number of time periods the system is tracking. There is usually a time span during which multiple transactions by the same customer are considered to be related. The separation between the transactions may be due to supplementary purchases, merchandise returns or replacements, clerical error, payment methods, or other reasons. For many businesses, multiple transactions that were made in the same day by a single customer are deemed to be related and so are considered as a single transaction. In other businesses, multiple transactions made by a customer in the same week or even the same month are considered as one transaction. When working with OLAP Services, it is strongly recommended that you work with time periods instead of invoices when analyzing concurrent purchases. The following expression counts the number of customers who bought the current product and a printer in the same week: [Customers Who Bought Printers] = Sum(Descendants([Customers].CurrentMember, [Customer Names]), Iif(0=Sum( Filter(Descendants([Time].CurrentMember, [Week]), Not IsEmpty(Sales)), (Sales, Printers)) ,1, 0)) This complex expression sums one (1) for each customer who bought the current product and a printer in the same week. To make certain that the printer was bought in the same week as the current product, filter out all of the weeks to find only the weeks where the current customer bought the current product. You can use the following clause: Filter(Descendants([Time].CurrentMember, [Week]), Not IsEmpty(Sales)) The Descendants function limits the scan of the weeks according to the slicing member of the time dimension. This returns the set of weeks. The expression then sums all sales of printers for the current customer during these weeks. If the sum returns NULL, this customer did not buy the product together with a printer. If the sum is not NULL, the expression adds one (1) to the count of customers. 9 Performance Considerations For both DISTINCT COUNT and Basket Analysis, calculation of results poses demanding computation loads. These computations must scan vast quantities of data in order to calculate a single number. For example, in the query illustrated in the table of the DISTINCT COUNT example, the system must query the results of the sales for each customer per product. With even medium-sized databases, both dimensions may have tens of thousands of members. The combination of these dimensions generates a huge result set that needs to be analyzed. There is no one solution to solve the performance problem. However, using several techniques, the scale of the problem can be managed. The following sections discuss three approaches to working with performance issues. Throughout, a reference to DISTINCT COUNT measures applies also to Basket Analysis. The DISTINCT Cube One of the most efficient ways to optimize the performance of these two analysis techniques is to isolate the DISTINCT functionality into a separate cube. This cube should have a single COUNT measure (a long integer). The rest of the measures will reside in a separate cube that contains the exact dimensions found in the DISTINCT cube. The two cubes will be joined together to form a virtual cube with which the user will work. The user will not experience any difference between the functionality of the virtual cube and the functionality of a unified physical cube. However, performance and memory consumption can improve dramatically. The reason for the improvement is simple. When a user asks for the DISTINCT COUNT measure, the virtual cube helps ensure that that only the DISTINCT cube will be queried for the detailed result set that is needed for the calculation. Because the distinct cube has only a single long measure, it is usually much smaller than the cube that contains the rest of the measures. Therefore, querying that cube involves less I/O. In addition, the cache size needed for the result set is much smaller than a cache containing all of the measures would be, and the net transport is also much smaller. Separating the DISTINCT COUNT into another cube also enables fine-grained control of the aggregations. Aggregations As mentioned before, DISTINCT COUNTS are not additive (and this is the main reason why these measures are so problematic). Therefore, the aggregations, which are all derived from additive operators, are completely useless; however, there is one exception: the property dimensions of the counted dimension. If the entity you want to count is “customers,” there may be several other dimensions that describe properties of the customers. For example, gender, education level, and income level are all dimensions that are actually describing the customers. 10 When a query involves only those dimensions (the rest of the dimensions are on ALL), the DISTINCT COUNT measure behaves like a regular SUM measure. For example, if you know that you have 100 distinct male customers and 120 distinct female customers, you can say for sure that you have 220 customers all together. Therefore, when working with an isolated DISTINCT cube, it is worthwhile to create aggregations that are limited only to the customer dimensions and its property dimensions. To do that, use the Cube editor in the OLAP Manager to limit aggregations. In the Property pane, set the Aggregation Mode property of the rest of the dimensions to Top Level Only. This helps ensure that all of the aggregations designed for the distinct cube are additive and useful. An opposite approach is to set the Aggregation Mode property of the counted dimension and its property dimensions to Bottom Level Only. This helps ensure that all of the aggregations created are detailed enough to be useful in the DISTINCT calculations. When using this approach, you need to work around a limitation of the size estimation algorithm of Decision Support Objects (DSO). When DSO calculates an estimated size for an aggregation, it assumes that all of the dimensions are independent; therefore, in DSO, the maximum theoretical size of the aggregation is the product of the cardinality of each dimension. For example, 1,000 customers and 2,000 products have a maximum theoretical size of 2,000,000 cells. However, the property dimensions are not independent from the customer dimension. Two genders, six education levels, eight income levels, and 1,000 customers will be calculated to 96,000 possible cells. However, because the dimensions are dependent, the actual maximum number of cells is only 1,000. This miscalculation is important if all of the customer dimensions are set to Bottom Level Only. All calculations of the possible aggregations will be inflated 96 fold. The system will decide that most of these are not useful because the aggregations are too large. To put the system back on the right path, you need to tell DSO that the fact table contains far more records than it actually contains. In this example, if the fact table has 1,000,000 rows, set the (estimated) Fact Table Size property to 96,000,000. This will compensate for the miscalculation. Execution Location The execution location may be the most significant factor in the performance of the DISTINCT COUNT queries. OLAP Services supports both client-side and server-side query execution. Executing queries on the client allows the server to scale up to support many more users and queries. However, for some queries, it is more appropriate to do the calculation on the server. Those queries may work with very large dimensions (such as “top 10 customers out of 1,000,000”). They may also aggregate vast volumes of data to return a small answer table. DISTINCT COUNT analysis usually falls into both of these categories. Server-side execution takes two forms: Axes resolution: The axes of a dataset may be relayed to the server for resolution if the axes involve large dimension levels (usually 1,000 or more). PivotTable ® Service automatically detects whether relaying to the server is needed and performs it without client application intervention. Dataset resolution: The cells of the dataset may also be calculated on the server side. However, this applies only to snapshot queries. With a snapshot query, PivotTable Service decides automatically whether the query needs to be resolved on the server side. 11 It is strongly recommended that all queries involving DISTINCT COUNT measures are snapshot queries so they can be relayed to the server. Failure to create snapshot queries may result in huge memory consumption on the client computer, vast quantities of data transported over the network, and very slow response times. Sampling In cases where the data volumes are very large, and the main interest is in relationships, proportions, and ratios rather than absolute numbers, sampling can reduce the magnitude of the problem. However, this document will not deal with sampling techniques for OLAP Services. Rendering The last technique pertains to the behavior of the user interface on the client application side. The client application should recognize that some of the queries might be very slow when this technique is used. Most OLAP browsing tools assume very fast response time and therefore work in “auto recalc” mode. This means that a query is generated for every action on the user’s part. Users do not have to initiate “Execute” operations to populate the views with which they are working. However, this mode is not appropriate for DISTINCT COUNT measures. A query for each user operation will cause the user interface to work very slowly and will try the user’s patience considerably. The best way to avoid this situation is to allow the user to move into “manual recalc” mode. In this mode, the user first positions the dimensions on the axes and performs all of the drill-downs and slice-and-dice operations to set the view. After the view is set, the user explicitly asks for the population of the view with numbers. 12 Conclusion The questions posed by DISTINCT COUNT and Basket Analysis are important ones in business intelligence. Although the OLAP environment does not provide simple ways to answer these questions, the methods outlined in this document offer viable ways to work around the limitations of OLAP. By using features provided by OLAP Services and following a few simple guidelines, you can leverage the power of OLAP to address these and other business analysis scenarios. 13 Finding More Information For more information about DISTINCT COUNT, see your structured query language (SQL) documentation. For more information about MDX, calculated members, virtual cubes, DSO, and member properties, see OLAP Services Books Online. 14