DISTINCT COUNT and Basket Analysis
with Microsoft SQL Server OLAP Services
The information contained in this document represents the current view of
Microsoft Corporation on the issues discussed as of the date of publication.
Because Microsoft must respond to changing market conditions, it should not be
interpreted to be a commitment on the part of Microsoft, and Microsoft cannot
guarantee the accuracy of any information presented after the date of publication.
This document is for informational purposes only. MICROSOFT MAKES NO
WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT.
© 1999 Microsoft Corporation. All rights reserved.
Microsoft, the BackOffice logo, and PivotTable are either registered trademarks or
trademarks of Microsoft Corporation in the United States and/or other countries.
Other trademarks and tradenames mentioned herein are the property of their
respective owners.
The names of companies, products, people, characters, and/or data mentioned
herein are fictitious and are in no way intended to represent any real individual,
company, product, or event, unless otherwise noted.
Part number: 098-83777
2
Contents
Introduction ............................................................................................................................................. 4
DISTINCT COUNT Analysis ................................................................................................................. 5
Understanding the Problem ................................................................................................................ 5
The Solution ....................................................................................................................................... 6
Basket Analysis ........................................................................................................................................ 8
Performance Considerations ................................................................................................................ 10
The DISTINCT Cube ....................................................................................................................... 10
Aggregations .................................................................................................................................... 10
Execution Location .......................................................................................................................... 11
Sampling .......................................................................................................................................... 12
Rendering ......................................................................................................................................... 12
Conclusion.............................................................................................................................................. 13
Finding More Information ................................................................................................................... 14
3
Introduction
Microsoft® SQL Server™ OLAP Services version 7.0 provides powerful tools for data
analysis. Some of the capabilities are apparent from the user interface. These include the
ability to aggregate data and categorize data into dimensions and levels. Other analysis
capabilities, usually the more advanced, are not obvious from the user interface and may
require more expertise if the user wants take full advantage of OLAP Services. These
advanced capabilities involve the use of calculated members and multidimensional expressions
(MDX) expressions to achieve the desired analysis.
For example, suppose you have a cube that analyzes sales transactions. It has dimensions that
describe customers (geography, education, income level, gender), products (classification,
color, size), time, and the sales rep through the organizational structure. The measures include
information about revenue, quantity, and discounts.
One of the most common questions would be, “How many customers bought a specific
product?” An even better and more general question might be, “How many customers are
buying each product?”
Although this last question seems simple, it is not. A regular COUNT measure will not provide
correct results because double counts may occur. If a single customer buys a product more
than once, a regular COUNT with the measure will count the product sale by customer twice.
In order to get the correct results, each customer needs to be counted only once. This is the
classic DISTINCT COUNT problem, and it requires a fairly complex resolution in the online
analytical processing (OLAP) environment.
The problem may become even more interesting if the question becomes, “How many
customers bought a specific basket of products?” Take the “Diapers & Beer” example, “How
many customers bought both diapers and beer?” This type of question falls under the Basket
Analysis problem category.
This document discusses the techniques to solve these two classic problems, DISTINCT
COUNT and Basket Analysis. It assumes that the reader has a basic understanding of the
concepts of OLAP in general, OLAP Services in particular, and MDX.
4
DISTINCT COUNT Analysis
DISTINCT COUNT analysis is one of the most popular types of analyses by users and one of
the toughest problems for an OLAP system. Some users refer to the problem as the many-tomany problem because analysis of the relationship between entities that have many-to-many
relationships is involved.
A few of the more typical applications for DISTINCT COUNT analysis are:

Sales and marketing, especially counting distinct number of customers.

Insurance claims relating policies to damages. One claim may have many damages.

Quality control data relating causes to defects. A defect can be caused by multiple factors.
Consider the following query:
SELECT
{ [Sales], [Distinct Customers Count] } On Columns,
Products.Members On Rows
From Sales
A typical query result may look like this.
Sales
Distinct Customers Count
All products
8000
200
Hardware
3300
80
Computers
2000
70
Monitors
800
60
Printers
500
30
Software
4700
150
Home
1500
100
Business
2500
100
Games
700
80
Understanding the Problem
In the Sales column, the numbers add up to subtotals and their totals. This is the expected
behavior of a SUM measure. However, in the Distinct Customers Count column, the numbers
do not add up.
5
In this example, 70 customers bought computers, 60 customers bought printers, and 30
customers bought monitors. However, the total number of bought who bought hardware,
according to the result set, is not 160, or 70+60+30, as shown in the table. The query results
display an actual count of 80 total hardware customers. The reason for this irregularity is
simple: many of the customers bought more than one product. Some customers bought both
computers and monitors, others bought the whole three-piece package, some replaced just the
monitor, and so on. The end result is that there is no way to infer directly from the lower level
results what the customer subtotal really is. This discrepancy continues through the upper
levels as well: 80 customers bought hardware, 150 bought software, and all together, All
Products totals only 200 customers.
These kinds of irregularities pose challenges for OLAP systems. Nonadditive measures pose
the following problems on a typical OLAP system:

Roll-ups are not possible. When precalculating results during cube processing, the system
cannot deduce summaries from other summaries. All results must be calculated from the
detail data. This situation places a heavy burden in processing time.

All results must be precalculated. With nonadditive measures, there is no way to deduce
the result for a higher-level summary query from one precalculated aggregation. Failure to
precalculate the results in advance means that the results are not available.

It is next to impossible to perform and maintain incremental updates to the system. A
single transaction added to the cube usually invalidates huge portions of previously
precalculated results. In order to recover from this, a nearly complete recalculation is
needed.
OLAP Services takes a very different approach to the solution to these kinds of problems. All
basic measures in the cube must be additive. These include SUM, MIN, MAX, and simple
COUNT. More problematic measures that are not additive are handled through calculated
members, which are calculated at run time.
The Solution
You can define the calculated member [Distinct Customers Count] using an MDX
expression. Use the following expression to deduce the number of customers who bought a
product by counting the customers where non-NULL sales exist:
Count(CrossJoin({[Sales]}, [Customer Names].Members), ExcludeEmpty)
This expression evaluates each Sales-Customer Name tuple and counts the number of tuples
that are not NULL. The number of tuples being evaluated will always equal the number of
customers.
This expression works with any set of coordinates in any dimension (except Customers). If the
current member in the products dimension is [Hardware], the NULL evaluation will be for the
[Sales] of [Hardware] for each [Customer Name]. If you slice by a specific month, January for
example, the count will be for all non-NULL values for the [Sales] of [Hardware] in [January]
for each [Customer Name].
6
However, this expression does not work well with the Customers dimension itself. The
calculated member defined here counts for all of the Customer Names, no matter what the
current member on the customer dimension is. For example, to perform a distinct count on the
customers in California, you might expect that if you slice by [California] in the [Customers]
dimension, only the customers in this state will be counted. However, the calculated member
created here has no such limitation. It counts all of the customers in all of the
countries/states/cities without any limitation.
To fix this problem, change the expression to the following:
Count(CrossJoin( {[Sales]},
Descendants([Customers].CurrentMember, [Customer Names])),
ExcludeEmpty)
The modified expression helps ensure that only the customers under the current member in the
[Customers] dimensions are counted.
This generic expression solves the distinct count problem and provides the correct answers.
The only problem with this method lies in performance. In many businesses, the number of
customers may be very large. The need to evaluate each customer individually at run time
places a significant calculation burden on the system. A later section of this document
discusses techniques to optimize these calculations and ease some of the load on performance.
It is important to remember that even with these optimizations, DISTINCT COUNTS are
much slower than other additive measures.
7
Basket Analysis
Basket Analysis goes one step further than DISTINCT COUNT. With Basket Analysis, the
idea is to count the number of intersected occurrences. For example, how many customers
bought a computer and a printer together? A more generic query result is shown here.
Sales
Distinct Customers
Count
Customers Who
Bought Printers
All Products
8000
200
30
Hardware
3300
80
30
Computers
2000
70
20
Monitors
800
60
25
Printers
500
30
30
Software
4700
150
15
Home
1500
100
7
Business
2500
100
10
Games
700
80
5
The last column in the table shows how many customers bought both the corresponding
product and a printer for each product/category.
This query investigates the relationships between members of the same dimension. The
combination of each product and a printer creates a basket of products. Understanding the
occurrences of these baskets is one of the most important insights into the purchasing habits of
customers. It is usually a good basis for cross-promotions, direct mail, and other focused
marketing activities.
This kind of analysis has wide applicability in other areas beyond marketing. For example, in
quality control it is important to learn about the relationships between failed components or
causes of failure.
The definition of [Customers Who Bought Printers] is:
Sum(Descendants([Customers].CurrentMember, [Customer Names]),
Iif(IsEmpty(Sales, Printers) Or IsEmpty(Sales), 0, 1))
This expression sums one (1) for each customer who bought the current product in addition to
purchasing a printer.
Suppose you want to analyze baskets that contain more than two products (current and printer
in out example.) You can extend the basket to {current, Printer, Computer} using the
following expression:
[Customers Who Bought Printers & Computers]:
Sum(Descendants([Customers].CurrentMember, [Customer Names]),
Iif(IsEmpty(Sales, Printers) or IsEmpty(Sales, Computers) Or IsEmpty(Sales), 0, 1))
8
Yes… But Were They Bought Together?
The expression in the previous section will count the number of customers that bought a set of
products (Computer, Printer, and another product).
However, there is no indication in the expression as to whether the products were bought
together. In some cases, it is important to know not only when a customer bought several
products, but also whether the customer bought them together at the same time or at different
times.
“Together” deserves a definition. At first reaction you may think that the products were
ordered or delivered together on the same invoice. However, in business intelligence,
“together” usually has a definition that spans time rather than invoice numbers.
There are two reasons for this:

In OLAP cubes, maintaining information about specific invoices is difficult and
inefficient compared to the management of a time dimension. The number of invoices
may be several orders of magnitude larger than the number of time periods the system is
tracking.

There is usually a time span during which multiple transactions by the same customer are
considered to be related. The separation between the transactions may be due to
supplementary purchases, merchandise returns or replacements, clerical error, payment
methods, or other reasons. For many businesses, multiple transactions that were made in
the same day by a single customer are deemed to be related and so are considered as a
single transaction. In other businesses, multiple transactions made by a customer in the
same week or even the same month are considered as one transaction.
When working with OLAP Services, it is strongly recommended that you work with time
periods instead of invoices when analyzing concurrent purchases.
The following expression counts the number of customers who bought the current product and
a printer in the same week:
[Customers Who Bought Printers] =
Sum(Descendants([Customers].CurrentMember, [Customer Names]),
Iif(0=Sum(
Filter(Descendants([Time].CurrentMember, [Week]), Not IsEmpty(Sales)),
(Sales, Printers)) ,1, 0))
This complex expression sums one (1) for each customer who bought the current product and a
printer in the same week. To make certain that the printer was bought in the same week as the
current product, filter out all of the weeks to find only the weeks where the current customer
bought the current product. You can use the following clause:
Filter(Descendants([Time].CurrentMember, [Week]), Not IsEmpty(Sales))
The Descendants function limits the scan of the weeks according to the slicing member of the
time dimension. This returns the set of weeks. The expression then sums all sales of printers
for the current customer during these weeks. If the sum returns NULL, this customer did not
buy the product together with a printer. If the sum is not NULL, the expression adds one (1) to
the count of customers.
9
Performance Considerations
For both DISTINCT COUNT and Basket Analysis, calculation of results poses demanding
computation loads. These computations must scan vast quantities of data in order to calculate a
single number. For example, in the query illustrated in the table of the DISTINCT COUNT
example, the system must query the results of the sales for each customer per product. With
even medium-sized databases, both dimensions may have tens of thousands of members. The
combination of these dimensions generates a huge result set that needs to be analyzed.
There is no one solution to solve the performance problem. However, using several
techniques, the scale of the problem can be managed. The following sections discuss three
approaches to working with performance issues. Throughout, a reference to DISTINCT
COUNT measures applies also to Basket Analysis.
The DISTINCT Cube
One of the most efficient ways to optimize the performance of these two analysis techniques is
to isolate the DISTINCT functionality into a separate cube.
This cube should have a single COUNT measure (a long integer). The rest of the measures will
reside in a separate cube that contains the exact dimensions found in the DISTINCT cube.
The two cubes will be joined together to form a virtual cube with which the user will work.
The user will not experience any difference between the functionality of the virtual cube and
the functionality of a unified physical cube. However, performance and memory consumption
can improve dramatically.
The reason for the improvement is simple. When a user asks for the DISTINCT COUNT
measure, the virtual cube helps ensure that that only the DISTINCT cube will be queried for
the detailed result set that is needed for the calculation. Because the distinct cube has only a
single long measure, it is usually much smaller than the cube that contains the rest of the
measures. Therefore, querying that cube involves less I/O. In addition, the cache size needed
for the result set is much smaller than a cache containing all of the measures would be, and the
net transport is also much smaller.
Separating the DISTINCT COUNT into another cube also enables fine-grained control of the
aggregations.
Aggregations
As mentioned before, DISTINCT COUNTS are not additive (and this is the main reason why
these measures are so problematic). Therefore, the aggregations, which are all derived from
additive operators, are completely useless; however, there is one exception: the property
dimensions of the counted dimension. If the entity you want to count is “customers,” there
may be several other dimensions that describe properties of the customers. For example,
gender, education level, and income level are all dimensions that are actually describing the
customers.
10
When a query involves only those dimensions (the rest of the dimensions are on ALL), the
DISTINCT COUNT measure behaves like a regular SUM measure. For example, if you know
that you have 100 distinct male customers and 120 distinct female customers, you can say for
sure that you have 220 customers all together.
Therefore, when working with an isolated DISTINCT cube, it is worthwhile to create
aggregations that are limited only to the customer dimensions and its property dimensions. To
do that, use the Cube editor in the OLAP Manager to limit aggregations. In the Property pane,
set the Aggregation Mode property of the rest of the dimensions to Top Level Only. This helps
ensure that all of the aggregations designed for the distinct cube are additive and useful. An
opposite approach is to set the Aggregation Mode property of the counted dimension and its
property dimensions to Bottom Level Only. This helps ensure that all of the aggregations
created are detailed enough to be useful in the DISTINCT calculations.
When using this approach, you need to work around a limitation of the size estimation
algorithm of Decision Support Objects (DSO). When DSO calculates an estimated size for an
aggregation, it assumes that all of the dimensions are independent; therefore, in DSO, the
maximum theoretical size of the aggregation is the product of the cardinality of each
dimension. For example, 1,000 customers and 2,000 products have a maximum theoretical size
of 2,000,000 cells.
However, the property dimensions are not independent from the customer dimension. Two
genders, six education levels, eight income levels, and 1,000 customers will be calculated to
96,000 possible cells. However, because the dimensions are dependent, the actual maximum
number of cells is only 1,000. This miscalculation is important if all of the customer
dimensions are set to Bottom Level Only. All calculations of the possible aggregations will be
inflated 96 fold. The system will decide that most of these are not useful because the
aggregations are too large. To put the system back on the right path, you need to tell DSO that
the fact table contains far more records than it actually contains. In this example, if the fact
table has 1,000,000 rows, set the (estimated) Fact Table Size property to 96,000,000. This will
compensate for the miscalculation.
Execution Location
The execution location may be the most significant factor in the performance of the
DISTINCT COUNT queries. OLAP Services supports both client-side and server-side query
execution. Executing queries on the client allows the server to scale up to support many more
users and queries. However, for some queries, it is more appropriate to do the calculation on
the server. Those queries may work with very large dimensions (such as “top 10 customers out
of 1,000,000”). They may also aggregate vast volumes of data to return a small answer table.
DISTINCT COUNT analysis usually falls into both of these categories.
Server-side execution takes two forms:

Axes resolution: The axes of a dataset may be relayed to the server for resolution if the
axes involve large dimension levels (usually 1,000 or more). PivotTable ® Service
automatically detects whether relaying to the server is needed and performs it without
client application intervention.

Dataset resolution: The cells of the dataset may also be calculated on the server side.
However, this applies only to snapshot queries. With a snapshot query, PivotTable Service
decides automatically whether the query needs to be resolved on the server side.
11
It is strongly recommended that all queries involving DISTINCT COUNT measures are
snapshot queries so they can be relayed to the server. Failure to create snapshot queries may
result in huge memory consumption on the client computer, vast quantities of data transported
over the network, and very slow response times.
Sampling
In cases where the data volumes are very large, and the main interest is in relationships,
proportions, and ratios rather than absolute numbers, sampling can reduce the magnitude of the
problem. However, this document will not deal with sampling techniques for OLAP Services.
Rendering
The last technique pertains to the behavior of the user interface on the client application side.
The client application should recognize that some of the queries might be very slow when this
technique is used. Most OLAP browsing tools assume very fast response time and therefore
work in “auto recalc” mode. This means that a query is generated for every action on the
user’s part. Users do not have to initiate “Execute” operations to populate the views with
which they are working.
However, this mode is not appropriate for DISTINCT COUNT measures. A query for each
user operation will cause the user interface to work very slowly and will try the user’s patience
considerably. The best way to avoid this situation is to allow the user to move into “manual
recalc” mode. In this mode, the user first positions the dimensions on the axes and performs all
of the drill-downs and slice-and-dice operations to set the view. After the view is set, the user
explicitly asks for the population of the view with numbers.
12
Conclusion
The questions posed by DISTINCT COUNT and Basket Analysis are important ones in
business intelligence. Although the OLAP environment does not provide simple ways to
answer these questions, the methods outlined in this document offer viable ways to work
around the limitations of OLAP. By using features provided by OLAP Services and following
a few simple guidelines, you can leverage the power of OLAP to address these and other
business analysis scenarios.
13
Finding More Information
For more information about DISTINCT COUNT, see your structured query language (SQL)
documentation. For more information about MDX, calculated members, virtual cubes, DSO,
and member properties, see OLAP Services Books Online.
14