A Trio of Interesting Snowflakes

advertisement
A Trio of Interesting
Snowflakes
Beat three common modeling challenges with
extensions of the dimensional model
By Ralph Kimball
"When can I use a snowflake?" is a question data warehouse designers have
asked me hundreds of times. I usually answer that it's a bad idea to expose the
end users to a physical snowflake design, because it almost always compromises
understandability and performance. But in certain situations a snowflake design is
not only acceptable, but recommended.
CLASSIC SNOWFLAKE
The way to create a classic snowflake, let us remind ourselves, is to remove low
cardinality attributes from a dimension table and place these attributes in a
secondary dimension table connected by a snowflake key. In cases where a set of
attributes form a multilevel hierarchy, the resulting string of tables looks a little
like a snowflake - hence the name.
A classic physical snowflake design may be useful in the backroom staging area
as a way to enforce the many-to-one relationships in a dimension table. But in
the front room presentation part of your data warehouse, you have to
demonstrate to me that the end users find the snowflake easier to understand
and, moreover, that queries and reports run faster with the snowflake, before I
am comfortable with the snowflake design.
But having issued this warning, I have found three cases where variations on a
snowflake are not only acceptable, but are the keys to a successful design.
LARGE CUSTOMER DIMENSIONS
The customer dimension is probably the most challenging dimension in a data
warehouse. In a large organization, the customer dimension can be huge, with
millions of records, and wide, with dozens of attributes.
To make matters worse, the biggest customer dimensions commonly contain two
categories of customers, which I will call "visitor" and "customer."
Visitors are anonymous. You may see them more than once, but you don't know
their names or anything else about them. On a Web site, the only knowledge you
have about visitors is a cookie indicating they have returned. In a retail operation,
a visitor engages in an anonymous transaction.
Customers, conversely, are reliably registered with your company. You know
customers' names, addresses, and as much demographic and historical data as
you care to elicit directly from them or purchase from third parties.
Let us assume that at the most granular level of your data collection, 80 percent
of the fact table measurements involve visitors and 20 percent involve customers.
You accumulate just two simple behavior scores for visitors consisting only of
recency (when they last visited you) and frequency (how many times they have
visited).
On the other hand, let us assume you have 50 attributes and measures for a
customer, covering all the components of location, payment behavior, credit
behavior, directly elicited demographic attributes, and purchased demographic
attributes.
Now you combine visitors and customers into a single logical dimension called
shopper. You give the visitor or customer a single, permanent shopper ID, but
make the key to the table a surrogate key so that you can track changes to the
shopper over time. Logically, the shopper dimension has the following attributes.
The attributes for both visitors and customers are:




-
Shopper surrogate key
Shopper ID (fixed ID for each physical shopper)
Recency
Frequency.
Attributes for customers only are:




-
Five name attributes
10 location attributes
10 behavior attributes
25 demographic attributes.
Note the importance of including the recency and frequency information as
dimensional attributes rather than as facts and overwriting them as time
progresses. This decision makes the shopper dimension very powerful. You can
do classic shopper segmentation directly off the dimension without navigating a
fact table in a complex application. See the discussion of this kind of
segmentation in my book, The Data Webhouse Toolkit, starting on page 73.
Assuming that many of the final 50 customer attributes are textual, you could
have a total record width of 500 bytes or more. Suppose you have 20 million
shoppers (16 million visitors and four million registered customers). Obviously,
you are worried that in 80 percent of your records, the trailing 50 fields contain
no data! In a 10GB dimension, this condition gets your attention.
This is a clear case where, depending on the database, you want to introduce a
snowflake. You should break the dimension into a base dimension and a
snowflake subdimension. All the visitors share a single record in the
subdimension, which contains special null attribute values. (See FIGURE 1.)
FIGURE 1. A Shopper dimension where 80 percent of the records have 50 null
attributes
In a fixed-width database, using our previous assumptions, the base shopper
dimension is 20 million x 25 bytes=500MB, and the snowflake dimension is 4
million x 475 bytes=1.9GB. You save 8GB by using the snowflake. If you have a
query tool that insists on a classic star schema with no snowflakes, then you can
hide the snowflake under a view declaration.
FINANCIAL PRODUCT DIMENSIONS
Banks, brokerage houses, and insurance companies all have trouble modeling
their product dimensions because each of the individual products has a host of
special attributes not shared by other products. Except for a set of common
"core" attributes, a checking account doesn't look very much like a mortgage or
certificate of deposit. They even have different numbers of attributes.
If you try to build a single product dimension with the union of all possible
attributes, you end up with hundreds of attributes, most of which are empty in a
given record.
The answer in this case is to build a context-dependent snowflake. You isolate the
core attributes in a base product dimension table, and include a snowflake key in
each base record that points to its proper extended product subdimension. (See
Figure 2.)
FIGURE 2. A Financial Product dimension with a subdimension for each product
type.
This solution is not a conventional relational join! The snowflake key must
connect to the particular subdimension table that a specific product type defines.
Usually you can accomplish this task by constructing a relational view for each
product type that hardwires the correct join path.
MULTIENTERPRISE CALENDAR DIMENSIONS
Building a calendar dimension in a distributed data warehouse spanning multiple
organizations is difficult because each organization has idiosyncratic fiscal periods,
seasons, and holidays. Although you should make a heroic effort to reduce
incompatible calendar labels, many times you want to look at the overall
multienterprise data through the eyes of just one of the organizations.
Unlike the financial products dimensions, each of the separate calendars can have
the same number of attributes describing fiscal periods, seasons, and holidays.
But there may be hundreds of separate calendars. An international retailer may
have to deal with a calendar for each foreign country.
In this case you modify the snowflake design to let the snowflake key join to a
single calendar subdimension. (See Figure 3.) But the subdimension has higher
cardinality than the base dimension! The key for the subdimension is both the
snowflake key and the organization key.
FIGURE 3. A Calendar dimension with a higher cardinality subdimension.
In this situation, you must specify a single organization in the subdimension
before evaluating the join between the tables. When done correctly, the
subdimension has a one-to-one relationship with the base dimension as if the two
tables were a single entity. Now the entire multienterprise data warehouse can be
queried through the calendar of any constituent organization.
Download